# Traffic estimation

### Let's start with assigning categorical labels, rather than continuous, because it is simpler

**How we can estimate the traffic in the city?**

In [13]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from geopy.distance import great_circle


In [9]:
all_data = pd.read_csv('Geolife_all_data.csv')
# for some reason after importing the data, the time column is not datetime. So we need to convert it once again
all_data['time'] = pd.to_datetime(all_data['time'])
all_data

Unnamed: 0,lat,lon,alt,time,user_id
0,41.741415,86.186028,-777.0,2008-03-31 16:00:08,10
1,41.737063,86.179470,-777.0,2008-03-31 16:01:07,10
2,41.734105,86.172823,-777.0,2008-03-31 16:02:07,10
3,41.739110,86.166563,-777.0,2008-03-31 16:03:06,10
4,41.744368,86.159987,-777.0,2008-03-31 16:04:05,10
...,...,...,...,...,...
2044968,40.070186,116.314153,-45.0,2008-11-29 02:01:31,179
2044969,40.070193,116.314041,-48.0,2008-11-29 02:01:33,179
2044970,40.070224,116.313923,-51.0,2008-11-29 02:01:35,179
2044971,40.070227,116.313843,-56.0,2008-11-29 02:01:37,179


In [10]:
all_data.columns

Index(['lat', 'lon', 'alt', 'time', 'user_id'], dtype='object')

### Idea 1:


Calculate vehicle speeds from consecutive observations to gauge movement.

Aggregate data spatially and temporally to estimate local traffic conditions.

Use density and speed metrics to categorize traffic via unsupervised clustering.

In [None]:
df = all_data.copy()
print('Start step 1')
# Step 1: Calculate speeds
# This function computes the speed between consecutive observations for each user, but it needs verification (that it works correctly)
def compute_speed(group):
    group = group.sort_values('time')
    speeds = []
    for i in range(len(group) - 1):
        loc1 = (group.iloc[i]['lat'], group.iloc[i]['lon'])
        loc2 = (group.iloc[i+1]['lat'], group.iloc[i+1]['lon'])
        dt = (group.iloc[i+1]['time'] - group.iloc[i]['time']).total_seconds()
        speed = great_circle(loc1, loc2).meters / dt if dt > 0 else 0
        speeds.append(speed)
    speeds.append(np.nan)
    group['speed'] = speeds
    return group

df = df.groupby('user_id').apply(compute_speed).reset_index(drop=True)

print('Start step 2')
# Step 2: Assign grid cells and time windows
def get_grid_cell(lat, lon, size=100):
    # Simple grid mapping (adjust for precision)
    """
    The function is designed to map a given latitude and longitude to a specific grid cell based on a predefined size. This function is useful in applications where spatial data needs to be organized into a grid for easier analysis or visualization.
    """
    return (int(lat * 10000 / size), int(lon * 10000 / size))

df['grid_cell'] = df.apply(lambda row: get_grid_cell(row['lat'], row['lon']), axis=1)
df['time_window'] = df['time'].dt.floor('5min')

print('Start step 3')
# Step 3: Aggregate
agg_data = []
for (cell, tw), group in df.groupby(['grid_cell', 'time_window']):
    N = group['user_id'].nunique()
    if N > 0:
        S = group.groupby('user_id')['speed'].mean().mean()  # Average of user averages
        agg_data.append({'grid_cell': cell, 'time_window': tw, 'N': N, 'S': S})

agg_df = pd.DataFrame(agg_data)

print('Start step 4')
# Step 4: Cluster
features = agg_df[['N', 'S']].dropna()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(features_scaled)
agg_df['traffic'] = pd.Series(labels, index=features.index)

# Assign labels (adjust based on actual centers)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
traffic_map = {0: 'low', 1: 'medium', 2: 'high'}  # Example; validate with centers
agg_df['traffic'] = agg_df['traffic'].map(traffic_map)

# Step 5: Map back to observations
df = df.merge(agg_df[['grid_cell', 'time_window', 'traffic']], 
              on=['grid_cell', 'time_window'], how='left')
df['traffic'] = df['traffic'].fillna('unknown')

Start step 1


KeyboardInterrupt: 

### Takes too long for execution for now, let's make it on a subset. For each user we will take the first 500 observations

In [12]:
# Keep the first 500 observations for each user_id
filtered_data = all_data.groupby('user_id').head(500).reset_index(drop=True)
filtered_data

Unnamed: 0,lat,lon,alt,time,user_id
0,41.741415,86.186028,-777.0,2008-03-31 16:00:08,10
1,41.737063,86.179470,-777.0,2008-03-31 16:01:07,10
2,41.734105,86.172823,-777.0,2008-03-31 16:02:07,10
3,41.739110,86.166563,-777.0,2008-03-31 16:03:06,10
4,41.744368,86.159987,-777.0,2008-03-31 16:04:05,10
...,...,...,...,...,...
23348,40.087501,116.313351,192.0,2008-08-21 09:03:44,179
23349,40.087483,116.313329,195.0,2008-08-21 09:03:46,179
23350,40.087469,116.313308,195.0,2008-08-21 09:03:48,179
23351,40.087469,116.313278,193.0,2008-08-21 09:03:50,179


In [16]:
df = filtered_data.copy()
print('Start step 1')
# Step 1: Calculate speeds
# This function computes the speed between consecutive observations for each user, but it needs verification (that it works correctly)
def compute_speed(group):
    group = group.sort_values('time')
    speeds = []
    for i in range(len(group) - 1):
        loc1 = (group.iloc[i]['lat'], group.iloc[i]['lon'])
        loc2 = (group.iloc[i+1]['lat'], group.iloc[i+1]['lon'])
        dt = (group.iloc[i+1]['time'] - group.iloc[i]['time']).total_seconds()
        speed = great_circle(loc1, loc2).meters / dt if dt > 0 else 0
        speeds.append(speed)
    speeds.append(np.nan)
    group['speed'] = speeds
    return group

df = df.groupby('user_id').apply(compute_speed).reset_index(drop=True)

print('Start step 2')
# Step 2: Assign grid cells and time windows
def get_grid_cell(lat, lon, size=100):
    # Simple grid mapping (adjust for precision)
    """
    The function is designed to map a given latitude and longitude to a specific grid cell based on a predefined size. This function is useful in applications where spatial data needs to be organized into a grid for easier analysis or visualization.
    """
    return (int(lat * 10000 / size), int(lon * 10000 / size))

df['grid_cell'] = df.apply(lambda row: get_grid_cell(row['lat'], row['lon']), axis=1)
df['time_window'] = df['time'].dt.floor('5min')

print('Start step 3')
# Step 3: Aggregate
agg_data = []
for (cell, tw), group in df.groupby(['grid_cell', 'time_window']):
    N = group['user_id'].nunique()
    if N > 0:
        S = group.groupby('user_id')['speed'].mean().mean()  # Average of user averages
        agg_data.append({'grid_cell': cell, 'time_window': tw, 'N': N, 'S': S})

agg_df = pd.DataFrame(agg_data)

print('Start step 4')
# Step 4: Cluster
features = agg_df[['N', 'S']].dropna()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(features_scaled)
agg_df['traffic'] = pd.Series(labels, index=features.index)

# Assign labels (adjust based on actual centers)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
traffic_map = {0: 'low', 1: 'medium', 2: 'high'}  # Example; validate with centers
agg_df['traffic'] = agg_df['traffic'].map(traffic_map)

# Step 5: Map back to observations
df = df.merge(agg_df[['grid_cell', 'time_window', 'traffic']], 
              on=['grid_cell', 'time_window'], how='left')
df['traffic'] = df['traffic'].fillna('unknown')

Start step 1
Start step 2
Start step 3
Start step 4


In [18]:
df.traffic.value_counts()

medium     20246
low         3093
unknown       10
high           4
Name: traffic, dtype: int64

**Well, at least it is not medium all the time, hence, the approach is not as naive (especially for the first attempt)**

### How to make continuous predictions and improve the pipeline?

In general, traffic load is influenced by:

Density: The number of unique vehicles (or users) in a specific area (e.g., a grid cell) over a given time window. Higher density suggests more crowded conditions.

Speed: The average speed of vehicles in that area and time window. Lower speeds often indicate congestion, while higher speeds suggest smoother flow.

The goal is to combine these indicators into a single continuous metric that increases with density and decreases with speed, reflecting real-world traffic dynamics.

However, since our data is probably not sufficient (at least for now) to estimate density, we need to think about the function which takes speed as an input, and produces a continuous value of traffic as an output (or find more data for better representation). Another thing to consider is data aggregation. Currenty we mapped every input to an output, however, a more realistic approach would be to map a group of inputs to an output, since we do not need to estimate the traffic every two seconds, and also it requires a lot of computations

