<a href="https://colab.research.google.com/github/Seb85vickz/DAS7000Data_Analytics_and_Visualization/blob/main/tracks_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tracks Data — Cleaning, EDA and Interactive Visualisations

1. Data loading from the provided URL.
2. Detailed data cleaning using imputation techniques only (no dropping of rows).
3. Extensive EDA.
4. At least 10 beginner interactive plots and 10 advanced interactive plots (Plotly).
5. Feature engineering and key findings.


In [1]:
#Configuration:URL and core imports
CSV_URL = 'https://raw.githubusercontent.com/Seb85vickz/DAS7000Data_Analytics_and_Visualization/refs/heads/main/tracks.csv'

import numpy as np
import pandas as pd
import ast
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 400)


## 1. Load data

We read the CSV directly from the GitHub raw URL. If the list-like columns are serialized as strings, we `ast.literal_eval` them to Python lists.

In [2]:
df = pd.read_csv(CSV_URL)
df.head()


Unnamed: 0,oid,timestamp,x,y,body_roll,body_pitch,body_yaw,head_roll,head_pitch,head_yaw,other_oid,other_class,other_x,other_y
0,50187,1842.4,495854.64031,5405751.0,,,,,,,"[47646, 50181, 50184, 50187]","[0, 4, 4, 4]","[495923.373133135, 495899.069769386, 495899.056786096, 495854.640309584]","[5405744.32136751, 5405738.47595118, 5405739.18984693, 5405750.91234782]"
1,50187,1842.5,495854.792078,5405751.0,,,,,,,"[50181, 50187, 50184, 47646]","[4, 4, 4, 0]","[495899.234566716, 495854.792078353, 495899.224798791, 495922.569930677]","[5405738.39126416, 5405750.93930797, 5405739.20502755, 5405744.42285387]"
2,50187,1842.6,495854.943847,5405751.0,,,,,,,"[47646, 50187, 50184, 50181]","[0, 4, 4, 4]","[495921.779445452, 495854.943847121, 495899.357695912, 495899.399364046]","[5405744.51929698, 5405750.96626812, 5405739.15318381, 5405738.30657713]"
3,50187,1842.7,495855.095616,5405751.0,,,,,,,"[50187, 47646, 50184, 50181]","[4, 0, 4, 4]","[495855.09561589, 495920.943052671, 495899.490593033, 495899.564161375]","[5405750.99322827, 5405744.63008031, 5405739.10134006, 5405738.22189011]"
4,50187,1842.8,495855.256935,5405751.0,,,,,,,"[50187, 50184, 50181, 47646]","[4, 4, 4, 0]","[495855.256935427, 495899.585908147, 495899.720312982, 495920.115044655]","[5405751.02150176, 5405739.0332702, 5405738.08456954, 5405744.73152952]"


### Convert list-like string columns to actual lists (if necessary)
Columns: `other_oid`, `other_class`, `other_x`, `other_y` may be stored as strings like "[1,2,3]". We'll convert them.


In [3]:
list_cols = ['other_oid','other_class','other_x','other_y']
for c in list_cols:
    # if dtype is object (strings), attempt literal_eval for each non-null
    if df[c].dtype == object:
        def try_eval(v):
            if pd.isna(v):
                return v
            if isinstance(v, list):
                return v
            try:
                return ast.literal_eval(v)
            except Exception:
                return v
        df[c] = df[c].apply(try_eval)

df[list_cols].head()


Unnamed: 0,other_oid,other_class,other_x,other_y
0,"[47646, 50181, 50184, 50187]","[0, 4, 4, 4]","[495923.373133135, 495899.069769386, 495899.056786096, 495854.640309584]","[5405744.32136751, 5405738.47595118, 5405739.18984693, 5405750.91234782]"
1,"[50181, 50187, 50184, 47646]","[4, 4, 4, 0]","[495899.234566716, 495854.792078353, 495899.224798791, 495922.569930677]","[5405738.39126416, 5405750.93930797, 5405739.20502755, 5405744.42285387]"
2,"[47646, 50187, 50184, 50181]","[0, 4, 4, 4]","[495921.779445452, 495854.943847121, 495899.357695912, 495899.399364046]","[5405744.51929698, 5405750.96626812, 5405739.15318381, 5405738.30657713]"
3,"[50187, 47646, 50184, 50181]","[4, 0, 4, 4]","[495855.09561589, 495920.943052671, 495899.490593033, 495899.564161375]","[5405750.99322827, 5405744.63008031, 5405739.10134006, 5405738.22189011]"
4,"[50187, 50184, 50181, 47646]","[4, 4, 4, 0]","[495855.256935427, 495899.585908147, 495899.720312982, 495920.115044655]","[5405751.02150176, 5405739.0332702, 5405738.08456954, 5405744.73152952]"


## 2. Inspect missingness and initial stats
We'll inspect missing counts and basic distributions.

In [4]:
print('Shape:', df.shape)
display(df.info())
display(df.isna().sum())
display(df.describe(include='all'))


Shape: (4759, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4759 entries, 0 to 4758
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   oid          4759 non-null   int64  
 1   timestamp    4759 non-null   float64
 2   x            4759 non-null   float64
 3   y            4759 non-null   float64
 4   body_roll    2061 non-null   float64
 5   body_pitch   2061 non-null   float64
 6   body_yaw     2061 non-null   float64
 7   head_roll    2061 non-null   float64
 8   head_pitch   2061 non-null   float64
 9   head_yaw     2061 non-null   float64
 10  other_oid    4759 non-null   object 
 11  other_class  4759 non-null   object 
 12  other_x      4759 non-null   object 
 13  other_y      4759 non-null   object 
dtypes: float64(9), int64(1), object(4)
memory usage: 520.6+ KB


None

Unnamed: 0,0
oid,0
timestamp,0
x,0
y,0
body_roll,2698
body_pitch,2698
body_yaw,2698
head_roll,2698
head_pitch,2698
head_yaw,2698


Unnamed: 0,oid,timestamp,x,y,body_roll,body_pitch,body_yaw,head_roll,head_pitch,head_yaw,other_oid,other_class,other_x,other_y
count,4759.0,4759.0,4759.0,4759.0,2061.0,2061.0,2061.0,2061.0,2061.0,2061.0,4759,4759,4759,4759
unique,,,,,,,,,,,3734,2055,4701,4701
top,,,,,,,,,,,"[9776, 7219]","[4, 0]","[496170.433691562, 496164.085499575, 496159.107809483, 496154.556815939, 496161.400876046, 496156.941724541, 496156.520173907, 496155.159422905, 496154.083392208]","[5405739.8926471, 5405732.63293484, 5405733.08550224, 5405733.08256904, 5405732.54753168, 5405734.40031554, 5405735.61981056, 5405734.68631012, 5405735.55517041]"
freq,,,,,,,,,,,32,67,3,3
mean,36158.947258,1169.678924,496070.834761,5405964.0,0.304014,-0.711818,190.560107,-0.25381,-1.086118,186.370247,,,,
std,15992.323879,800.496866,109.83002,176.3856,1.36282,2.263132,80.499321,3.5134,5.660005,77.836386,,,,
min,7682.0,217.5,495813.501735,5405731.0,-7.0,-18.3342,0.0,-28.0,-25.1852,0.0,,,,
25%,19348.0,317.5,496008.570272,5405741.0,0.0,0.0,133.6,0.0,-1.72593,149.2,,,,
50%,42054.0,925.4,496062.81667,5406074.0,0.0,0.0,192.0,0.0,0.0,185.687,,,,
75%,49654.0,1837.85,496157.493305,5406116.0,0.0,0.0,247.2,0.0,0.0,233.2,,,,


### Observations
- Orientation columns body_*,head_* have many missing values. The user requested no rows dropped, so we'll impute.
- other_* columns contain lists of neighbours; we can engineer useful features from them.


## 3. Cleaning / Imputation strategy (explanation)

We will use a combination of imputation techniques (never drop rows):

1. **Group-wise forward/backward fill:** For each `oid` (object id), the orientation values typically change smoothly across timestamps. We'll use `groupby('oid')` with `.ffill()` and `.bfill()` to propagate known values within the same object track.
2. **Interpolation:** Use linear interpolation within each group to fill gaps between known orientation values.
3. **If still missing after group methods:** use **global median** (more robust than mean in presence of outliers) for continuous orientation columns.
4. **For categorical-ish `other_class` elements:** we won't change the lists — but for engineered features like `num_neighbors` we compute counts directly.

This keeps temporal continuity and avoids dropping rows.


In [41]:
# 1. Make a clean copy and sort
df_clean = df.copy()
df_clean = df_clean.sort_values(['oid', 'timestamp']).reset_index(drop=True)

# Orientation columns containing missing values
orient_cols = [
    'body_roll','body_pitch','body_yaw',
    'head_roll','head_pitch','head_yaw'
]

# 2. Forward-fill and back-fill within each oid group
df_clean[orient_cols] = df_clean.groupby('oid')[orient_cols].transform(lambda g: g.ffill().bfill())

# 3. Linear interpolation within each oid group
df_clean[orient_cols] = df_clean.groupby('oid')[orient_cols].transform(lambda g: g.interpolate(method='linear'))

# 4. Global median fill (in case any column still has missing values)
for col in orient_cols:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

# Check if any missing values remain
print(df_clean[orient_cols].isna().sum())


body_roll     0
body_pitch    0
body_yaw      0
head_roll     0
head_pitch    0
head_yaw      0
dtype: int64


## 4. Feature engineering
We'll create features useful for EDA and plotting:
- num_neighbors: length of other_oid list.
- first_neighbor_oid, n1_x, n1_y  : first neighbor position.
- centroid_other_x,centroid_other_y : mean position of neighbors.
- dx,dy, speed : delta position and approximate speed per time step (group-wise).
- dist_to_centroid : Euclidean distance from object to neighbors centroid.

In [6]:
from math import hypot

df_feat = df_clean.copy()

df_feat['num_neighbors'] = df_feat['other_oid'].apply(lambda x: len(x) if isinstance(x, (list,tuple)) else 0)

def safe_get_list(lst, idx, default=np.nan):
    try:
        return lst[idx]
    except Exception:
        return default

df_feat['first_neighbor_oid'] = df_feat['other_oid'].apply(lambda x: safe_get_list(x,0))
df_feat['n1_x'] = df_feat['other_x'].apply(lambda x: safe_get_list(x,0))
df_feat['n1_y'] = df_feat['other_y'].apply(lambda x: safe_get_list(x,0))

def centroid(xs):
    try:
        xs = list(xs)
        return float(np.mean(xs)) if len(xs)>0 else np.nan
    except Exception:
        return np.nan

df_feat['centroid_other_x'] = df_feat['other_x'].apply(centroid)
df_feat['centroid_other_y'] = df_feat['other_y'].apply(centroid)

# dx, dy and speed per oid by timestamp differences
df_feat[['dx','dy']] = df_feat.groupby('oid')[['x','y']].diff()
df_feat['dt'] = df_feat.groupby('oid')['timestamp'].diff()
df_feat['speed'] = np.sqrt(df_feat['dx']**2 + df_feat['dy']**2) / df_feat['dt']

df_feat['dist_to_centroid'] = np.sqrt((df_feat['x']-df_feat['centroid_other_x'])**2 + (df_feat['y']-df_feat['centroid_other_y'])**2)

df_feat[['num_neighbors','n1_x','n1_y','centroid_other_x','centroid_other_y','dx','dy','dt','speed','dist_to_centroid']].head()


Unnamed: 0,num_neighbors,n1_x,n1_y,centroid_other_x,centroid_other_y,dx,dy,dt,speed,dist_to_centroid
0,5,496191.745181,5405736.0,496203.686993,5405733.0,,,,,21.40803
1,4,496182.301754,5405731.0,496201.164748,5405734.0,-0.105602,0.10749,0.1,1.506848,19.092244
2,3,496182.196345,5405731.0,496198.037843,5405735.0,-0.105409,0.107222,0.1,1.503587,16.293557
3,3,496218.200121,5405738.0,496197.898337,5405735.0,-0.105409,0.107222,0.1,1.503587,16.247667
4,3,496181.985526,5405731.0,496197.724425,5405735.0,-0.105409,0.107222,0.1,1.503587,16.167645


# 5. EDA explanations :
We'll perform EDA that inspects distributions, correlations, missingness (now reduced), trajectories, neighbor relationships, and orientation behaviors.
All plots use Plotly for interactivity.


In [7]:
# Helper: pick a sample of oids for some plots to keep rendering fast
sample_oids = df_feat['oid'].unique()[:6]

print('Sample oids used in examples:', sample_oids)


Sample oids used in examples: [7682 7683 7684 8072 8075 8217]


### Beginner interactive plots (10)
Each plot has a short caption in a markdown cell before it in the notebook. We'll create:
1. Trajectory (x,y) scatter + lines (interactive)
2. Timestamp vs x (line)
3. Timestamp vs y (line)
4. Speed over time
5. Histogram of num_neighbors
6. Scatter: num_neighbors vs speed
7. Heatmap (2D density) of positions
8. Boxplots of speed by oid (sample)
9. Orientation time-series (body_roll/pitch/yaw)
10. Missingness heatmap (after imputation should be clean)


# A. Trajectory (x,y) sctter + lines (interactice)

In [8]:
# 1 Trajectory: interactive for chosen oids
fig = px.line(df_feat[df_feat['oid'].isin(sample_oids)], x='x', y='y', color='oid', markers=True, title='Trajectories (x vs y) for sample oids')
fig.update_layout(height=600)
fig.show()


**Key points:** Shows spatial paths for selected (oids), reveals overlapping tracks and directionality.


# B.Timestamp vs X and Y (line)

In [30]:
# 2 Timestamp vs x
fig = px.line(df_feat[df_feat['oid'].isin(sample_oids)], x='timestamp', y='x', color='oid', title='Timestamp vs X')
fig.show()

# 3 Timestamp vs y
fig = px.line(df_feat[df_feat['oid'].isin(sample_oids)], x='timestamp', y='y', color='oid', title='Timestamp vs Y')
fig.show()

**Key points:** Temporal progression of X — useful to detect sudden jumps or sensor errors.


# C. Speed Over Time

In [11]:
# 4 Speed over time
fig = px.line(df_feat[df_feat['oid'].isin(sample_oids)], x='timestamp', y='speed', color='oid', title='Approximate Speed over time')
fig.update_yaxes(type='log')
fig.show()


**Key points:** Log-scaled speed highlights bursts and near-zero motion between frames.


# D. Histogram of num_neighbors

In [12]:
# 5 Histogram of num_neighbors
fig = px.histogram(df_feat, x='num_neighbors', nbins=20, title='Distribution of number of neighbours')
fig.show()


## E. Scatter: num_neighbors vs speed

In [13]:
# 6 Scatter: num_neighbors vs speed
fig = px.scatter(df_feat.sample(2000, random_state=1), x='num_neighbors', y='speed', hover_data=['oid','timestamp'], title='Neighbors vs Speed (sample)')
fig.show()


# F. Heatmap (2D density) of positions

In [14]:
# 7 2D density heatmap of positions
fig = px.density_heatmap(df_feat.sample(2000, random_state=2), x='x', y='y', nbinsx=60, nbinsy=60, title='Position density heatmap (sample)')
fig.update_layout(height=600)
fig.show()


# G. Boxplots of speed by oid (sample)

In [15]:
# 8 Boxplots of speed by oid (sample top 12 frequent oids)
top_oids = df_feat['oid'].value_counts().nlargest(12).index.tolist()
fig = px.box(df_feat[df_feat['oid'].isin(top_oids)], x='oid', y='speed', title='Speed distribution for top 12 oids')
fig.show()


# H. Orientation time-series (body_roll/pitch/yaw)

In [16]:
# 9 Orientation time-series (body)
fig = go.Figure()
for c in ['body_roll','body_pitch','body_yaw']:
    fig.add_trace(go.Scatter(x=df_feat[df_feat['oid']==sample_oids[0]]['timestamp'], y=df_feat[df_feat['oid']==sample_oids[0]][c], mode='lines', name=c))
fig.update_layout(title=f'Body orientations over time for oid {sample_oids[0]}')
fig.show()


# I. Missingness heatmap (after imputation should be clean)

In [17]:
# 10 Missingness heatmap after imputation (binary)
miss = df_feat.isna().astype(int)
fig = px.imshow(miss.T, aspect='auto', labels=dict(x='row', y='column'), title='Missingness matrix (1=missing)')
fig.update_layout(height=400)
fig.show()


### Advanced interactive plots (10)
These use Plotly advanced features and combinations. Examples:
1. Animated trajectory (frame by timestamp)
2. 3D scatter of (x,y,timestamp) or (x,y,body_yaw)
3. Quiver-like orientation vectors (using line segments)
4. Parallel coordinates for orientation features
5. Clustered scatter (k-means) with interactive selection
6. Density contours + scatter
7. Trajectories colored by speed (continuous)
8. Time-series small multiples (facets) of num_neighbors
9. Sankey-style flow of top interactions (who appears together) — simplified
10. Spatial hexbin / aggregated map with hover summaries


# Animated trajectory (frame by timestamp)

In [18]:
# 1 Animated trajectory (frames by timestamp) - sample
sample = df_feat[df_feat['oid'].isin(sample_oids)].copy()
sample['t_str'] = sample['timestamp'].astype(str)
fig = px.scatter(sample, x='x', y='y', animation_frame='t_str', animation_group='oid', color='oid', size_max=8, title='Animated trajectories over timestamp (sample)')
fig.update_layout(height=600)
fig.show()


# 3D scatter of (x,y,timestamp) or (x,y,body_yaw)

In [40]:
# 2 3D scatter (x, y, timestamp)
fig = px.scatter_3d(df_feat.sample(2000, random_state=3), x='x', y='y', z='timestamp', color='num_neighbors', title='3D scatter: x,y,timestamp (sample)')
fig.show()


# Quiver-like orientation vectors (using line segments)

In [37]:
# 3 Quiver-ish plot: show orientation vectors for sample rows
samp = df_feat[df_feat['oid'].isin(sample_oids)].sample(500, random_state=4)
fig = go.Figure()
scale = 0.5e-3
for i,row in samp.iterrows():
    x0,y0 = row['x'], row['y']
    # use body_yaw to create a direction vector (if valid)
    yaw = row['body_yaw']
    if not np.isnan(yaw):
        dx = np.cos(np.deg2rad(yaw))*scale
        dy = np.sin(np.deg2rad(yaw))*scale
        fig.add_trace(go.Scatter(x=[x0,x0+dx], y=[y0,y0+dy], mode='lines', line=dict(width=1), showlegend=False))
fig.update_layout(title='Quiver-like plot using body_yaw (sample)')
fig.show()


# Parallel coordinates for orientation features

In [33]:
# 4 Parallel coordinates (orientations + speed) - sample
pcols = ['body_roll','body_pitch','body_yaw','head_roll','head_pitch','head_yaw','speed']
par_sample = df_feat[pcols + ['oid']].dropna().sample(600, random_state=5)
fig = px.parallel_coordinates(par_sample, dimensions=pcols, color=par_sample['speed'], title='Parallel coordinates: orientations and speed (sample)')
fig.show()


# Clustered scatter (k-means) with interactive selection

In [34]:
# 5 KMeans clustering on positions (simple) and interactive scatter
from sklearn.cluster import KMeans
s2 = df_feat[['x','y']].sample(2000, random_state=6).dropna()
kmeans = KMeans(n_clusters=6, random_state=6).fit(s2)
s2['cluster'] = kmeans.labels_
fig = px.scatter(s2, x='x', y='y', color='cluster', title='KMeans clusters of positions (sample)')
fig.show()


# Density contours + scatter

In [35]:
# 6 Contour density + scatter
s3 = df_feat.sample(2000, random_state=7)
fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Histogram2dContour(x=s3['x'], y=s3['y'], contours=dict(coloring='heatmap')))
fig.add_trace(go.Scatter(x=s3['x'], y=s3['y'], mode='markers', marker=dict(size=3), name='points'))
fig.update_layout(title='Density contours + scatter (sample)')
fig.show()


# Trajectories colored by speed (continuous)

In [24]:
# 7 Trajectories colored by speed
s4 = df_feat.sample(2000, random_state=8)
fig = px.scatter(s4, x='x', y='y', color='speed', size='num_neighbors', title='Trajectories colored by speed (sample)')
fig.show()


# Time-series small multiples (facets) of num_neighbors

In [25]:
# 8 Small multiples: num_neighbors over time per oid (facet)
small = df_feat[df_feat['oid'].isin(sample_oids)]
fig = px.line(small, x='timestamp', y='num_neighbors', color='oid', facet_col='oid', title='Num neighbors over time (facets)')
fig.show()


# Sankey-style flow of top interactions (who appears together) — simplified

In [26]:
# 9 Interaction summary: compute pairs frequency (simplified)
from collections import Counter
pairs = Counter()
for other in df_feat['other_oid'].dropna():
    if isinstance(other, (list,tuple)):
        for i in range(len(other)):
            for j in range(i+1,len(other)):
                pairs[(other[i], other[j])] += 1
top_pairs = pairs.most_common(20)
nodes = set()
for (a,b),cnt in top_pairs:
    nodes.add(a); nodes.add(b)
nodes = list(nodes)
node_idx = {n:i for i,n in enumerate(nodes)}
sankey = dict(
    node = dict(label=[str(n) for n in nodes]),
    link = dict(source=[node_idx[a] for (a,b),_ in top_pairs], target=[node_idx[b] for (a,b),_ in top_pairs], value=[cnt for _,cnt in top_pairs])
)
fig = go.Figure(go.Sankey(sankey))
fig.update_layout(title='Top co-occurring pairs (simplified Sankey)')
fig.show()


# Spatial hexbin / aggregated map with hover summaries

In [27]:
# 10 Spatial aggregation: hexbin-like using binning and hover summary
s5 = df_feat.copy()
s5['bx'] = pd.cut(s5['x'], bins=40)
s5['by'] = pd.cut(s5['y'], bins=40)
agg = s5.groupby(['bx','by']).agg(count=('oid','count'), mean_speed=('speed','mean')).reset_index()
agg['bx_mid'] = agg['bx'].apply(lambda r: r.mid if hasattr(r, 'mid') else np.nan)
agg['by_mid'] = agg['by'].apply(lambda r: r.mid if hasattr(r, 'mid') else np.nan)
fig = px.scatter(agg, x='bx_mid', y='by_mid', size='count', color='mean_speed', title='Spatial aggregated bins (count and mean speed)')
fig.show()






## 6. Key findings and features (summary)

### Key findings
- **Missingness:** Orientation columns had substantial missingness; group-wise interpolation + global median imputation successfully filled them while preserving temporal patterns.
- **Trajectories:** Objects generally follow smooth trajectories; some overlap indicates interactions or crossing paths.
- **Neighbors:** Distribution of num_neighbors shows most frames have a small number of nearby objects but some frames have many useful to detect crowded scenes.
- **Speed patterns:** Mostly near-zero speeds with occasional spikes (bursts). Log scaling helps reveal large spikes.
- **Interactions:** Frequent co-occurring pairs can be extracted from other_oid lists to build interaction networks.

### Feature engineering implemented
- num_neighbors (count of other_oid)
- first_neighbor_oid, n1_x, n1_y
- centroid_other_x, centroid_other_y
- dx, dy, dt, speed
- dist_to_centroid

These features power visualisations and can be used for downstream tasks (anomaly detection, interaction modeling, clustering).


## 7. Notes, reproducibility and next steps
- All visualisations use Plotly for interactivity. If running locally, ensure plotly is installed (e.g., pip install plotly).
- For very large datasets, reduce samples in plotting to keep interactive responsiveness.
- Next steps: build a Streamlit dashboard or export cleaned dataset to CSV for modelling.
