## Traffic Forecasting Research Plan (Phase 1)
## Model Strategy, Evaluation Metrics, and Selection Criteria

### Objective
To identify the most accurate and robust model for traffic flow forecasting using real-world traffic sensor data. The best model will later feed into EV charging coordination simulations for improving power system resilience.

---

## Modeling Strategy

We will compare **9 models** across 4 categories:

| Category                    | Models |
|----------------------------|--------|
| **Baseline Models**        | Linear Regression, LSTM-only |
| **Published Hybrid Model** | CNN-GRU-LSTM (from literature) |
| **Graph-based Models**     | DCRNN-only, STGCN-only, GraphWaveNet-only |
| **Proposed Hybrid Models** | DCRNN-GRU-LSTM, STGCN-GRU-LSTM, GraphWaveNet-GRU-LSTM |

---

## Model Rationales

**1. Baseline Models**
- *Linear Regression*: Simple, interpretable benchmark.
- *LSTM-only*: Captures temporal patterns without spatial structure.

**2. CNN-GRU-LSTM (Literature Benchmark)**
- Combines CNN (spatial features) + GRU + LSTM (short and long-term temporal dependencies).
- Reproduced from a recent paper for benchmarking.

**3. Graph-based Models**
- *DCRNN*: Diffusion convolution (directed graphs) + RNNs.
- *STGCN*: Chebyshev graph convolution + temporal convolution.
- *GraphWaveNet*: Adaptive graph learning + dilated causal convolutions.

**4. Proposed Hybrid Models**
- Combines graph-based spatial modules with GRU and LSTM layers to leverage deep spatial-temporal dependencies.

---

## Forecasting Horizon

We forecast traffic **72 hours ahead (3 days)** using multi-output models. This helps evaluate short- and medium-term predictive accuracy.

- Models will be trained to predict:
  - The next **12, 24 hours**,
  - And the next **48, 72 hours** ahead.

---

## Evaluation Metrics

We will use **three key metrics** for model comparison:

- **MAPE (Mean Absolute Percentage Error)**: 
  - Interpretable percentage error.
- **RMSE (Root Mean Squared Error)**: 
  - Penalizes larger errors more heavily.
- **MAE (Mean Absolute Error)**: 
  - Stable average error measure.

All models will be evaluated on the same test set using these metrics at both 24h and 72h forecasting horizons.

---

## Model Selection Criteria

Final model will be selected based on:

1. **Lowest RMSE, MAE, and MAPE** on test data.
2. **Spatial consistency**: Station-level errors should not vary drastically.
3. **Generalization ability**: Minimal overfitting (close validation and test performance).
4. **Efficiency** *(secondary)*: Preference for simpler or faster models if performance is tied.

---

## Input Features (Used Across All Models)

To ensure a fair comparison, all 9 models will use the same feature set:

- **Temporal Features**: hour, day, month, weekday, holiday flags.
- **Lag Features**: Flow_lag_1 to Flow_lag_72.
- **Sliding Window Stats**: rolling mean, min, max, std.
- **Cyclical Features**: sine/cosine of time variables.
- **Graph Features**:
  - *DCRNN*: fixed adjacency matrix from CV clustering.
  - *STGCN*: Chebyshev-based graph structure.
  - *GraphWaveNet*: learns the adjacency dynamically.

---

## Summary

- Compare 9 models (baseline, benchmark, graph-based, hybrid).
- Forecast 24h and 72h ahead.
- Use MAPE, RMSE, MAE as evaluation metrics.
- Select best model for downstream EV charging coordination.


In [14]:
import pandas as pd
import glob
import os
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import  Dropout
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df = pd.read_csv(r"C:\Users\attafuro\Desktop\EV Charging Analysis\cleaned_traffic_data.csv")

In [3]:
df. head()

Unnamed: 0,Timestamp,Station,District,Route,Direction of Travel,Lane Type,Station Length,Samples,% Observed,Total Flow,...,Lane 5 Avg Speed,Lane 6 Flow,Lane 6 Avg Occ,Lane 6 Avg Speed,Lane 7 Flow,Lane 7 Avg Occ,Lane 7 Avg Speed,Lane 8 Flow,Lane 8 Avg Occ,Lane 8 Avg Speed
0,10/01/2024 00:00:00,308512,3,50,W,ML,3.995,197,0,497.0,...,,,,,,,,,,
1,10/01/2024 00:00:00,311831,3,5,S,OR,,101,92,27.0,...,,,,,,,,,,
2,10/01/2024 00:00:00,311832,3,5,S,FR,,101,92,78.0,...,,,,,,,,,,
3,10/01/2024 00:00:00,311844,3,5,N,OR,,202,92,43.0,...,,,,,,,,,,
4,10/01/2024 00:00:00,311847,3,5,N,OR,,303,92,73.0,...,,,,,,,,,,


In [4]:
# Define the final selected columns
selected_columns = [
    "Timestamp", "Station", "Route", "Direction of Travel",
    "Total Flow", "Avg Speed", "% Observed","Samples","Lane Type"
]

# Keep only the selected columns
df = df[selected_columns]

In [5]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,Avg Speed,% Observed,Samples,Lane Type
0,10/01/2024 00:00:00,308512,50,W,497.0,64.1,0,197,ML
1,10/01/2024 00:00:00,311831,5,S,27.0,,92,101,OR
2,10/01/2024 00:00:00,311832,5,S,78.0,,92,101,FR
3,10/01/2024 00:00:00,311844,5,N,43.0,,92,202,OR
4,10/01/2024 00:00:00,311847,5,N,73.0,,92,303,OR


In [7]:
df.isna().sum()

Timestamp                    0
Station                      0
Route                        0
Direction of Travel          0
Total Flow              303776
Avg Speed              1582594
% Observed                   0
Samples                      0
Lane Type                    0
dtype: int64

In [8]:
df.drop(columns=["Avg Speed"], inplace=True)

In [9]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,% Observed,Samples,Lane Type
0,10/01/2024 00:00:00,308512,50,W,497.0,0,197,ML
1,10/01/2024 00:00:00,311831,5,S,27.0,92,101,OR
2,10/01/2024 00:00:00,311832,5,S,78.0,92,101,FR
3,10/01/2024 00:00:00,311844,5,N,43.0,92,202,OR
4,10/01/2024 00:00:00,311847,5,N,73.0,92,303,OR


In [10]:
df.isna().sum()

Timestamp                   0
Station                     0
Route                       0
Direction of Travel         0
Total Flow             303776
% Observed                  0
Samples                     0
Lane Type                   0
dtype: int64

In [11]:
df = df[df['Total Flow'].notna()]

In [12]:
df.isna().sum()

Timestamp              0
Station                0
Route                  0
Direction of Travel    0
Total Flow             0
% Observed             0
Samples                0
Lane Type              0
dtype: int64

In [13]:
df.head()

Unnamed: 0,Timestamp,Station,Route,Direction of Travel,Total Flow,% Observed,Samples,Lane Type
0,10/01/2024 00:00:00,308512,50,W,497.0,0,197,ML
1,10/01/2024 00:00:00,311831,5,S,27.0,92,101,OR
2,10/01/2024 00:00:00,311832,5,S,78.0,92,101,FR
3,10/01/2024 00:00:00,311844,5,N,43.0,92,202,OR
4,10/01/2024 00:00:00,311847,5,N,73.0,92,303,OR


In [16]:
df.dtypes

Timestamp               object
Station                  int64
Route                    int64
Direction of Travel     object
Total Flow             float64
% Observed               int64
Samples                  int64
Lane Type               object
dtype: object

In [17]:
df.duplicated().sum()

0

In [18]:
df.reset_index().duplicated(subset=['Timestamp', 'Station']).sum()

0

In [19]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp', inplace=True)
df.sort_index(inplace=True)

In [21]:
df.head()

Unnamed: 0_level_0,Station,Route,Direction of Travel,Total Flow,% Observed,Samples,Lane Type
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-10-01,308512,50,W,497.0,0,197,ML
2024-10-01,311831,5,S,27.0,92,101,OR
2024-10-01,311832,5,S,78.0,92,101,FR
2024-10-01,311844,5,N,43.0,92,202,OR
2024-10-01,311847,5,N,73.0,92,303,OR
