## 12 – ML Prediction: Transit Volume Based on Weather & Time

This notebook builds a machine learning model to predict the number of real-time transit vehicle updates (as a proxy for transit volume) in Seattle, using weather and time-based features.

**Data Source:** Gold Layer table `gtfs_rt_weather_joined`  
**Goal:** Forecast daily transit activity using features like:
- Temperature
- Wind Speed
- Weather Conditions (One-hot encoded)
- Day of the Week

We compare **Linear Regression**, **Random Forest**, and **XGBoost** models.


### Setup & Imports

In [0]:
from pyspark.sql.functions import hour, dayofweek, col, count, to_date
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql.functions import split, col
from sklearn.metrics import r2_score
# Set timezone for Spark timestamps
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

### Load Gold Layer Data

In [0]:
# Load Gold-layer data that joins real-time GTFS updates with weather info
df = spark.read.format("delta").load("dbfs:/gold/gtfs_rt_weather_joined/")
df.limit(5).display()


### Feature Engineering – Daily Aggregation
We’ll group by date to predict the total number of vehicle updates per day, with average weather conditions.

In [0]:
# Check format of wind speed column before cleaning
df.groupBy("windSpeed").count().orderBy("count", ascending=False).show()

In [0]:
# Convert "12 mph" to numeric 12
df = df.withColumn("windSpeed", split(col("windSpeed"), " ").getItem(0).cast("int"))

In [0]:
#  Verify Cleaning
df.select('windSpeed').limit(5).display()

In [0]:
# Extract hour and filter to 8 AM for consistent comparison
df = df.withColumn("hour", hour("event_ts"))
df= df.filter(col("hour")==8)

In [0]:
from pyspark.sql.functions import avg, first

# Aggregate by day and engineer time features
df_daily = df.withColumn("event_date", to_date("event_ts")) \
    .groupBy("event_date") \
    .agg(
        count("*").alias("vehicle_update_count"),
        avg("temperature").alias("avg_temp"),
        first("condition").alias("condition"),  # assuming most frequent is fine for now
        avg("windSpeed").alias("wind_speed"),
        first("event_ts").alias("sample_ts")  # just for extracting time features
    )

df_daily = df_daily \
    .withColumn("day_of_week", dayofweek("sample_ts"))

df_daily.display()


### Convert to Pandas & One-Hot Encode Condition

In [0]:
# Convert to Pandas and prepare data for ML
pdf = df_daily.toPandas()

# One-hot encode weather condition
pdf = pd.get_dummies(pdf, columns=["condition"], prefix="cond")

# Drop Drop unused columns
pdf = pdf.drop(columns=["event_date", "sample_ts"])

pdf.head()


In [0]:
# Check for Nulls
pdf.isna().any().sum()

### Train-Test Split

In [0]:
# Split into features and target
X = pdf.drop("vehicle_update_count", axis=1)
y = pdf["vehicle_update_count"]

# 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Train Linear Regression

In [0]:
# Train Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)

print(f"🔍 Linear Regression RMSE: {rmse:.2f}")
print(f"📏 Linear Regression MAE: {mae:.2f}")


### Feature Engineering & Data Prep

In [0]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)
print(f"🌲 Random Forest R2: {r2_rf:.2f}")
print(f"🌲 Random Forest MAE: {mae_rf:.2f}")
print(f"🌲 Random Forest RMSE: {rmse_rf:.2f}")


In [0]:
%pip install xgboost
from xgboost import XGBRegressor


# Train XGBoost
xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

# Evaluation
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"⚡ XGBoost R2: {r2_xgb:.2f}")
print(f"⚡ XGBoost MAE: {mae_xgb:.2f}")
print(f"⚡ XGBoost RMSE: {rmse_xgb:.2f}")


In [0]:
# Get feature importances
importances = list(zip(X.columns, xgb.feature_importances_))
importances_sorted = sorted(importances, key=lambda x: x[1], reverse=True)

In [0]:
importances_sorted