<a href="https://colab.research.google.com/github/2ovisa/AH2179/blob/main/project5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Stop ahead prediction
- intressant för resenärer att veta hur stor förseningen blir vid sitt stop
- intressant för resenärer att veta hur stor förseningen är vid sin påstigande station

##Data Preparation

In [155]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

#-------------------------------------------------data preprocessing------------------------------------------------------------------

url = 'https://raw.githubusercontent.com/zhenliangma/Applied-AI-in-Transportation/master/ProjectAssignmentData/Dataset-PT.csv'
df = pd.read_csv(url, header = 1)
#print(df.head(1))
#df.info()
#print(df.columns.tolist())

#df = df.iloc[:1000]



###Kolumner att droppa:
- Identifiers: Calender_date, route_id, bus_id (innehåller inte nödvändig information
- They who leak future knowledge: Arrival_delay (should be target)
- Dummies: To avoid multi collineraty one of the examples should be dropped. factor(temperature)Normal, (weather)normal, weekend, off-peak. *these become the baseline?*


In [156]:
#skapa unika trippar
df = df.sort_values(['bus_id', 'Calendar_date', 'stop_sequence']).reset_index(drop=True)

df['trip_number'] = df.groupby(['bus_id','Calendar_date', 'stop_sequence']).cumcount()
df['unique_trip'] = (
    df['bus_id'].astype(str) + '_' +
    df['Calendar_date'].astype(str) + '_' +
    df['trip_number'].astype(str)
)


In [None]:
#Dubbelkolla så att tripparna är unika och har count = 27
df.groupby('unique_trip')['stop_sequence'].agg(['min','max','count']).sort_values('count')


In [None]:
# 1) Varje trip har exakt stoppen 1..27
stops_ok = (
    df.groupby('unique_trip')['stop_sequence']
      .apply(lambda s: set(s.tolist()) == set(range(1,28)))
      .all()
)
print('Stops exakt 1..27 per trip:', stops_ok)

# 2) Strikt stigande med steg 1 inom varje trip
mono_ok = (
    df.groupby('unique_trip')['stop_sequence']
      .apply(lambda s: (s.diff().fillna(1) == 1).all())
      .all()
)
print('Strikt +1 mellan rader inom trip:', mono_ok)

# 3) Horizons pekar h stopp framåt inom samma trip
for h in [1,3,6,12]:
    s_future = df.groupby('unique_trip')['stop_sequence'].shift(-h)
    ok = (s_future.dropna() - df.loc[s_future.notna(), 'stop_sequence'] == h).all()
    print(f'H={h} korrekt skift:', ok)

# 4) Inga korsningar mellan trips vid skift
for h in [1,3,6,12]:
    uid_future = df.groupby('unique_trip')['unique_trip'].shift(-h)
    cross_ok = (uid_future.dropna() == df.loc[uid_future.notna(), 'unique_trip']).all()
    print(f'H={h} ingen kors-trip:', cross_ok)


In [159]:
#multi horizon targets
#df = df.sort_values(['unique_trip', 'stop_sequence'])

for h in [1,3,6,12]:
  df[f'arrival_delay_t+{h}'] = df.groupby('unique_trip')['arrival_delay'].shift(-h)

# ta bort rader utan target
df = df[df["arrival_delay_t+1"].notna()].copy()

*Why is arrival_delay used and not stop sequence?*
- *stop sequence is just an index telling which stop number in the trip. It only contains ordering, not delay information*
- *the code says "for stop i, the target is the arrival delay at stop i+1*

*this assumes each stop is ~5 minutes apart. If travel times vary a lot, then “t+5 min” might not really equal “next stop”*

In [160]:
# 3. FEATURE ENGINEERING (temporala + dynamiska features)
#-------------------------------------------------
# Normaliserad stopp-position
df["stop_sequence_norm"] = df.groupby("unique_trip")["stop_sequence"].transform(
    lambda x: (x - x.min()) / (x.max() - x.min())
)

# Stopp kvar till slutet
df["stops_remaining"] = df.groupby("unique_trip")["stop_sequence"].transform(
    lambda x: x.max() - x
)

# Differens i försening mot föregående stopp
df["delay_diff"] = df.groupby("unique_trip")["arrival_delay"].diff().fillna(0)

# Glidande medelvärde (3 stopp bakåt)
df["delay_ma3"] = df.groupby("unique_trip")["arrival_delay"].transform(
    lambda x: x.rolling(window=3, min_periods=1).mean()
)
df["delay_trend"] = df["arrival_delay"] - df["delay_ma3"]

#ökning/minskning av försening
df["delay_growth"] = df.groupby("unique_trip")["arrival_delay"].diff().fillna(0)



*använda ovan för visualisation, för varför annars ska dessa vara med?*

In [161]:
unique_trips = df["unique_trip"].unique()
split_point = int(len(unique_trips) * 0.8)
train_trips = unique_trips[:split_point]
test_trips = unique_trips[split_point:]

# skapa mask
train_mask = df["unique_trip"].isin(train_trips)
test_mask = ~train_mask


In [162]:
#skapa kopia innan droppa för visualisering
df_vis = df.copy()

drop_cols = [
    "Calendar_date", "bus_id", "route_id", "arrival_time",
    "stop_sequence", "unique_trip", "new_trip", "trip_number"
]
df = df.drop(columns=[c for c in drop_cols if c in df.columns])

# Baseline dummies tas bort för att undvika multikollinearitet
to_drop = ["factor(weather)Normal", "factor(temperature)Normal",
           "factor(day_of_week)weekend", "factor(time_of_day)Off-peak"]
df = pd.get_dummies(df, drop_first=False)
df = df.drop(columns=[c for c in to_drop if c in df.columns], errors="ignore")


- *if stop_sequence is a freature, the model might cheat by just learning that higher stop numbers mean later in the trip*

In [163]:
# --- features/targets ---
targets = [f"arrival_delay_t+{h}" for h in [1, 3, 6, 12]]

X = df.drop(["arrival_delay"] + targets, axis=1, errors="ignore")
y = df[targets]

In [164]:
#skala
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [165]:
X_train, X_test = X_scaled[train_mask], X_scaled[test_mask]
y_train, y_test = y[train_mask], y[test_mask]

##Multi-output models

###*KNN*

In [None]:
X_train.dtypes.value_counts()

In [166]:
models = {}
for h in [1,3,6,12]:
    y_h = df[f"arrival_delay_t+{h}"].dropna()
    X_h = X.loc[y_h.index]
    model = KNeighborsRegressor(n_neighbors=5)
    model.fit(X_h, y_h)
    models[h] = model

In [None]:
from sklearn.neighbors import KNeighborsRegressor


model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

###*Random Forest Regressor*

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Träna en enda multi-output RF
model = RandomForestRegressor(n_estimators=80, max_depth=15, random_state=42)
model.fit(X_train, y_train)

# Prediktion
y_pred = model.predict(X_test)



###*LGBMR // XGBoost*

In [None]:
from lightgbm import LGBMRegressor
model = LGBMRegressor(n_estimators=300, learning_rate=0.05, max_depth=-1, num_leaves=64)
model.fit(X_train, y_train)


In [None]:
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Multi-output XGBoost
xgb_model = MultiOutputRegressor(
    XGBRegressor(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=8,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        tree_method='hist'  # snabbare, mindre minne
    )
)

xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)


##Evalutation

In [167]:
#utvärdera xgboost
mae = mean_absolute_error(y_test, y_pred, multioutput='raw_values')
r2 = [r2_score(y_test.iloc[:, i], y_pred[:, i]) for i in range(y_test.shape[1])]

for i, col in enumerate(y_test.columns):
    print(f"{col}: MAE={mae[i]:.2f}, R²={r2[i]:.3f}")


ValueError: Found input variables with inconsistent numbers of samples: [56532, 60570]

In [169]:
from sklearn.metrics import mean_absolute_error, r2_score

for h, model in models.items():
    y_h = df[f"arrival_delay_t+{h}"].dropna()
    X_h = X.loc[y_h.index]

    # dela data (train/test)
    split = int(len(X_h) * 0.8)
    X_train_h, X_test_h = X_h.iloc[:split], X_h.iloc[split:]
    y_train_h, y_test_h = y_h.iloc[:split], y_h.iloc[split:]

    # prediktera
    y_pred_h = model.predict(X_test_h)

    # utvärdera
    mae = mean_absolute_error(y_test_h, y_pred_h)
    r2 = r2_score(y_test_h, y_pred_h)
    print(f"t+{h}: MAE={mae:.2f}, R²={r2:.3f}")


t+1: MAE=18.31, R²=0.978
t+3: MAE=32.48, R²=0.937
t+6: MAE=46.89, R²=0.877
t+12: MAE=70.30, R²=0.747


In [168]:
#evaluate

mae = mean_absolute_error(y_test, y_pred, multioutput='raw_values')
for col, score in zip(y.columns, mae):
    print(f"MAE for {col}: {score:.2f}")

ValueError: Found input variables with inconsistent numbers of samples: [56532, 60570]

KNN
- MAE for arrival_delay_t+1: 27.19
- MAE for arrival_delay_t+3: 43.65
- MAE for arrival_delay_t+6: 61.55
- MAE for arrival_delay_t+12: 88.90

single output
- t+1: MAE=18.31, R²=0.978
- t+3: MAE=32.48, R²=0.937
- t+6: MAE=46.89, R²=0.877
- t+12: MAE=70.30, R²=0.747

RF
- MAE for arrival_delay_t+1: 20.42
- MAE for arrival_delay_t+3: 37.31
- MAE for arrival_delay_t+6: 55.17
- MAE for arrival_delay_t+12: 81.35

XGBoost
- arrival_delay_t+1: MAE=19.95, R²=0.922
- arrival_delay_t+3: MAE=36.88, R²=0.878
- arrival_delay_t+6: MAE=54.21, R²=0.820
- arrival_delay_t+12: MAE=80.16, R²=0.725

In [None]:
# MAE/r2 över prediktionshorisonter: Visar hur prognosprecisionen försämras längre fram i rutten
horizons = [1, 3, 6, 12]
mae = mean_absolute_error(y_test, y_pred, multioutput='raw_values')
r2 = [r2_score(y_test.iloc[:,i], y_pred[:,i]) for i in range(len(horizons))]

fig, ax1 = plt.subplots(figsize=(8,5))
ax1.plot(horizons, mae, marker='o', label='MAE')
ax1.set_xlabel("Stops ahead")
ax1.set_ylabel("MAE (s)", color='tab:blue')
ax2 = ax1.twinx()
ax2.plot(horizons, r2, marker='s', color='tab:red', label='R²')
ax2.set_ylabel("R²", color='tab:red')
plt.title("Prediction accuracy by horizon")
plt.show()

In [None]:
#Korrelation mellan faktisk och predikterad försening per horisont
import seaborn as sns
for i, h in enumerate([1,3,6,12]):
    sns.scatterplot(x=y_test.iloc[:,i], y=y_pred[:,i])
    plt.title(f"Actual vs Predicted delay (t+{h})")
    plt.xlabel("Actual delay (s)")
    plt.ylabel("Predicted delay (s)")
    plt.plot([0, max(y_test.iloc[:,i])],[0, max(y_test.iloc[:,i])],'r--')
    plt.show()


In [None]:
#förseningens tillväxt längs rutten (delay propagation)
#visar hur förseningar ackumuleras under resans gång och hur väl modellen följer detta
#trip_id = df_vis["unique_trip"].iloc[0]
trip = df_vis[df_vis["unique_trip"]].sort_values("stop_sequence")

plt.figure(figsize=(10,5))
plt.plot(trip["stop_sequence"], trip["arrival_delay"], label="Actual", marker='o')
for i, h in enumerate([1,3,6,12]):
    plt.plot(trip["stop_sequence"], y_pred[:len(trip), i], '--', label=f"Predicted t+{h}")
plt.xlabel("Stop sequence")
plt.ylabel("Delay (s)")
plt.title(f"Delay propagation along trip")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
plt.figure(figsize=(8,5))
plt.plot([1,3,6,12], mae, marker='o', label='MAE')
plt.title("XGBoost – MAE vs Prediction Horizon")
plt.xlabel("Stops Ahead")
plt.ylabel("Mean Absolute Error (s)")
plt.grid(True)
plt.show()


In [None]:
# skapa en kategori från de one-hot-kodade väderkolumnerna
weather_cols = [c for c in df_vis.columns if "factor(weather)" in c]
df_vis["weather_category"] = df_vis[weather_cols].idxmax(axis=1).str.replace("factor(weather)", "")

sns.boxplot(x="weather_category", y="arrival_delay", data=df_vis)
plt.title("Arrival Delay by Weather Condition")
plt.show()


In [None]:
tod_cols = [c for c in df_vis.columns if "factor(time_of_day)" in c]
df_vis["time_of_day_category"] = df_vis[tod_cols].idxmax(axis=1).str.replace("factor(time_of_day)", "")

sns.boxplot(x="time_of_day_category", y="arrival_delay", data=df_vis)
plt.title("Arrival Delay by Time of Day")
plt.show()


In [None]:
dow_cols = [c for c in df_vis.columns if "factor(day_of_week)" in c]
df_vis["day_category"] = df_vis[dow_cols].idxmax(axis=1).str.replace("factor(day_of_week)", "")

sns.boxplot(x="day_category", y="arrival_delay", data=df_vis)
plt.title("Arrival Delay by Weekday/Weekend")
plt.show()


In [None]:
mean_growth = df_vis.groupby("stop_sequence")["delay_growth"].mean()
plt.plot(mean_growth.index, mean_growth.values)


In [None]:
sns.boxplot(x="factor(weather)Rain", y="delay_growth", data=df_vis)
sns.boxplot(x="factor(time_of_day)Morning_peak", y="delay_growth", data=df_vis)


In [None]:
corrs = df_vis.groupby("stop_sequence")[["delay_growth", "factor(weather)Rain", "factor(time_of_day)Morning_peak"]].corr().unstack().iloc[:,1]
plt.plot(corrs.index.get_level_values(0), corrs.values)


In [None]:
pivot = df_vis.pivot_table(values="arrival_delay", index="unique_trip", columns="stop_sequence")
sns.heatmap(pivot, cmap="coolwarm", cbar_kws={'label': 'Delay (s)'})


In [None]:
#visualization
#Horizon performance
plt.figure(figsize=(10,6))
plt.plot([1,3,6,12], mae, marker='o')
plt.title("MAE vs Prediction Horizon (Stops Ahead)")
plt.xlabel("Stops Ahead")
plt.ylabel("Mean Absolute Error (s)")
plt.grid(True)
plt.show()

In [None]:
#r2-fall över horisonter
r2_scores = [r2_score(y_test.iloc[:, i], y_pred[:, i]) for i in range(y.shape[1])]
plt.plot([1,3,6,12], r2_scores, marker='o')
plt.title("R² per Horizon")
plt.xlabel("Stops Ahead")
plt.ylabel("R²")
plt.grid(True)
plt.show()


In [None]:
#delay propagation (för några turer)
trip_example = df_vis[df_vis['unique_trip'].isin(df_vis['unique_trip'].unique()[:3])]
sns.lineplot(data=trip_example, x='stop_sequence', y='arrival_delay', hue='unique_trip')
plt.title("Delay Propagation Along Stops")
plt.show()

In [None]:
#feature importance
importances = model.feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False)
feat_imp.head(15).plot(kind='barh', figsize=(8,6))
plt.title("Top 15 Feature Importances")
plt.show()



In [None]:
#kontextanalys
sns.boxplot(x='factor(weather)', y='arrival_delay', data=df_vis)
plt.title("Arrival Delay by Weather Condition")
plt.show()

sns.boxplot(x='factor(day_of_week)', y='arrival_delay', data=df_vis)
plt.title("Arrival Delay by Day Type")
plt.show()


In [None]:
# skapa samma features som vid modellträning
trip_id = df["unique_trip"].iloc[0]
trip_data = df[df["unique_trip"] == trip_id].sort_values("stop_sequence")

# ta bara kolumner som användes i modellen
trip_X = trip_data[X.columns.intersection(trip_data.columns)].copy()

# eventuellt skapa dummies på samma sätt som innan
trip_X = pd.get_dummies(trip_X)
trip_X = trip_X.reindex(columns=X.columns, fill_value=0)

# prediktion
y_trip_pred = model.predict(trip_X)


In [None]:
plt.figure(figsize=(12, 8))
for i, col in enumerate(y.columns):
    plt.subplot(2, 2, i+1)
    plt.scatter(y_test[col], y_pred[:, i], alpha=0.4)
    plt.plot([y_test[col].min(), y_test[col].max()],
             [y_test[col].min(), y_test[col].max()],
             "r--")
    plt.xlabel("Actual")
    plt.ylabel("Predicted")
    plt.title(f"{col} | R² = {r2_score(y_test[col], y_pred[:, i]):.2f}")
plt.tight_layout()
plt.show()

In [None]:
#distributioncheck
fig, axes = plt.subplots(len(targets), 1, figsize=(8, 16))
for i, col in enumerate(targets):
    sns.histplot(df_h[col], bins=50, kde=True, ax=axes[i])
    axes[i].set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

importances = model.feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(8,5))
feat_imp.head(15).plot(kind='bar')
plt.title("Top 15 Feature Importances")
plt.show()


MAE = the average absolute difference between the models predictions and the actual delay

- MAE for delay_t+1_stop: 34.00
- MAE for delay_t+3_stop: 46.54
- MAE for delay_t+6_stop: 54.66
- MAE for delay_t+12_stop: 56.01

This means that on average, the prediction for the next stop is off by 34 seconds.

Interpretations:
- Errors grow as the horizon length increases, which is expected since uncertainty accumulates further into the future


#Visualization

- *points above the line = overestimations, below = underestimations*