# Traffic Volume Predictor

### Data:

Dataset: `traffic_enhanced.csv` 

The dataset is collected and maintained by [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume).

Columns:

| Column name | Type | Description              |
|------------|------------|--------------------------|
|`temp`                   |Numeric            |Average temp in kelvin|
|`rain_1h`                |Numeric            |Amount in mm of rain that occurred in the hour|
|`snow_1h`                |Numeric            |Amount in mm of snow that occurred in the hour|
|`clouds_all`             |Numeric            |Percentage of cloud cover|
|`date_time`              |DateTime           |Hour of the data collected in local CST time|
|`holiday_` (11 columns)  |Categorical        |US National holidays plus regional holiday, Minnesota State Fair|
|`weather_main_` (11 columns)|Categorical     |Short textual description of the current weather|
|`weather_description_` (35 columns)|Categorical|Longer textual description of the current weather|
|`hour_of_day`|Numeric|The hour of the day|
|`day_of_week`|Numeric|The day of the week (0=Monday, Sunday=6)|
|`day_of_month`|Numeric|The day of the month|
|`month`|Numeric|The number of the month|
|`traffic_volume`         |Numeric            |Hourly I-94 ATR 301 reported westbound traffic volume|

Target var: `traffic_volume`

## 0. Setup

In [47]:
# Imports
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

import plotly.io as pio
import plotly.graph_objects as go

## 1. Data Preprocessing

In [6]:
traffic_data = pd.read_csv('traffic_enhanced.csv', index_col=0) # First column is the index
traffic_data.head()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,date_time,holiday_Christmas Day,holiday_Columbus Day,holiday_Independence Day,holiday_Labor Day,holiday_Martin Luther King Jr Day,...,weather_description_thunderstorm with heavy rain,weather_description_thunderstorm with light drizzle,weather_description_thunderstorm with light rain,weather_description_thunderstorm with rain,weather_description_very heavy rain,hour_of_day,day_of_week,day_of_month,month,traffic_volume
0,288.28,0.0,0.0,40,2012-10-02 09:00:00,False,False,False,False,False,...,False,False,False,False,False,9,1,2,10,5545
1,289.36,0.0,0.0,75,2012-10-02 10:00:00,False,False,False,False,False,...,False,False,False,False,False,10,1,2,10,4516
2,289.58,0.0,0.0,90,2012-10-02 11:00:00,False,False,False,False,False,...,False,False,False,False,False,11,1,2,10,4767
3,290.13,0.0,0.0,90,2012-10-02 12:00:00,False,False,False,False,False,...,False,False,False,False,False,12,1,2,10,5026
4,291.14,0.0,0.0,75,2012-10-02 13:00:00,False,False,False,False,False,...,False,False,False,False,False,13,1,2,10,4918


> **🧠 Note:** PyTorch expects inputs in the form of `torch.Tensor`.  
> Let’s make sure our data is ready for that!

In [9]:
# Remove 'date_time' column before normalization
traffic_data_no_datetime = traffic_data.drop(columns=['date_time'])

# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()
traffic_data_normalized = scaler.fit_transform(traffic_data_no_datetime)

# Convert the normalized data back to a DataFrame
traffic_data_normalized_df = pd.DataFrame(traffic_data_normalized, columns=traffic_data_no_datetime.columns)
traffic_data_normalized_df

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,holiday_Christmas Day,holiday_Columbus Day,holiday_Independence Day,holiday_Labor Day,holiday_Martin Luther King Jr Day,holiday_Memorial Day,...,weather_description_thunderstorm with heavy rain,weather_description_thunderstorm with light drizzle,weather_description_thunderstorm with light rain,weather_description_thunderstorm with rain,weather_description_very heavy rain,hour_of_day,day_of_week,day_of_month,month,traffic_volume
0,0.929726,0.0,0.0,0.40,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.391304,0.166667,0.033333,0.818182,0.761676
1,0.933209,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.434783,0.166667,0.033333,0.818182,0.620330
2,0.933918,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.478261,0.166667,0.033333,0.818182,0.654808
3,0.935692,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.521739,0.166667,0.033333,0.818182,0.690385
4,0.938949,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.565217,0.166667,0.033333,0.818182,0.675549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40570,0.914148,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.826087,1.000000,0.966667,0.727273,0.486676
40571,0.911923,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.869565,1.000000,0.966667,0.727273,0.382005
40572,0.911826,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.913043,1.000000,0.966667,0.727273,0.296566
40573,0.909762,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.956522,1.000000,0.966667,0.727273,0.199176


## 2. Train/Test Data Preparation

> ⚠️ **Important:** Since we're working with time series data, we must **not shuffle** it.  
> Time-dependent patterns are key for training, and shuffling would break the sequence the model needs to learn from.

In [22]:
train_size = 0.8
train_index = int(len(traffic_data_normalized_df) * train_size)

# Split the data
train_set = traffic_data_normalized_df[:train_index]
test_set = traffic_data_normalized_df[train_index:]

print("🧪 Train Sample:")
display(train_set)

print("\n🧾 Test Sample:")
display(test_set)

🧪 Train Sample:


Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,holiday_Christmas Day,holiday_Columbus Day,holiday_Independence Day,holiday_Labor Day,holiday_Martin Luther King Jr Day,holiday_Memorial Day,...,weather_description_thunderstorm with heavy rain,weather_description_thunderstorm with light drizzle,weather_description_thunderstorm with light rain,weather_description_thunderstorm with rain,weather_description_very heavy rain,hour_of_day,day_of_week,day_of_month,month,traffic_volume
0,0.929726,0.0,0.0,0.40,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.391304,0.166667,0.033333,0.818182,0.761676
1,0.933209,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.434783,0.166667,0.033333,0.818182,0.620330
2,0.933918,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.478261,0.166667,0.033333,0.818182,0.654808
3,0.935692,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.521739,0.166667,0.033333,0.818182,0.690385
4,0.938949,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.565217,0.166667,0.033333,0.818182,0.675549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32455,0.921244,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.565217,0.500000,0.833333,0.818182,0.717033
32456,0.918567,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.608696,0.500000,0.833333,0.818182,0.747665
32457,0.912117,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.652174,0.500000,0.833333,0.818182,0.842582
32458,0.907440,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.695652,0.500000,0.833333,0.818182,0.944231



🧾 Test Sample:


Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,holiday_Christmas Day,holiday_Columbus Day,holiday_Independence Day,holiday_Labor Day,holiday_Martin Luther King Jr Day,holiday_Memorial Day,...,weather_description_thunderstorm with heavy rain,weather_description_thunderstorm with light drizzle,weather_description_thunderstorm with light rain,weather_description_thunderstorm with rain,weather_description_very heavy rain,hour_of_day,day_of_week,day_of_month,month,traffic_volume
32460,0.896217,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.782609,0.5,0.833333,0.818182,0.666621
32461,0.893411,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.826087,0.5,0.833333,0.818182,0.516896
32462,0.892218,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.869565,0.5,0.833333,0.818182,0.442582
32463,0.889090,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.913043,0.5,0.833333,0.818182,0.516621
32464,0.887380,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.956522,0.5,0.833333,0.818182,0.442033
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40570,0.914148,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.826087,1.0,0.966667,0.727273,0.486676
40571,0.911923,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.869565,1.0,0.966667,0.727273,0.382005
40572,0.911826,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.913043,1.0,0.966667,0.727273,0.296566
40573,0.909762,0.0,0.0,0.90,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.956522,1.0,0.966667,0.727273,0.199176


### 🧩 Sequence Construction for Time Series Modeling

In time series forecasting, models like RNNs or LSTMs don't just look at individual data points — they learn from **sequences of past values** to predict what's next.

Here, we define a function to:
- Slide a window of `seq_length` over our data,
- Extract input sequences (past observations),
- Extract the corresponding label (the value that follows the sequence).

> 📌 **Why this matters:**  
> This structure transforms our raw dataset into a supervised learning format, where:
> - **Inputs = A sequence of previous timesteps**  
> - **Label = The next timestep value**

> 🔍 **Why `seq_length = 12`?**  
> The `seq_length` defines how much historical context the model sees at each step.  
> In this case, we chose `12` because:
> - ✅ **Temporal structure**: Our data has a [insert time resolution here – e.g., hourly], so 12 steps ≈ half a day of context.
> - ⚙️ **Model capacity**: 12 is a small enough window to keep computation efficient, but long enough to capture short-term trends.
> - 🔬 **Empirical choice**: This value can (and should) be tuned — try comparing performance with other values like 6, 24, or 48 to find the sweet spot.

We then:
- Apply this logic to both **training** and **testing** sets.
- Wrap the sequences and labels into `TensorDataset`s, making them compatible with PyTorch's `DataLoader`.
- Use `shuffle=True` for training to improve generalization and avoid learning order biases.

Finally, we define `DataLoader`s with a batch size of 64 to allow efficient mini-batch training.

In [31]:
# Create sequences and labels for training and testing
def create_sequences(data, seq_length):
    sequences = []
    labels = []
    for i in range(len(data) - seq_length):
        seq = data[i:i + seq_length]
        label = data[i + seq_length, -1]
        sequences.append(seq)
        labels.append(label)
    return torch.tensor(sequences, dtype=torch.float32), torch.tensor(labels, dtype=torch.float32)

In [32]:
# Define sequence length
seq_length = 12

# Create sequences and labels for training set
train_sequences, train_labels = create_sequences(train_set.values, seq_length)

# Create sequences and labels for testing set
test_sequences, test_labels = create_sequences(test_set.values, seq_length)

print(test_sequences.shape)
print(test_labels.shape)

# Create TensorDatasets
train_dataset = TensorDataset(train_sequences, train_labels)
test_dataset = TensorDataset(test_sequences, test_labels)

# Create DataLoaders
batch_size = 64

# Set shuffle=True for training loader to reduce chance of overfitting
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

torch.Size([8103, 12, 66])
torch.Size([8103])


## 3. LSTM Model Build 

### 🧠 Building the LSTM Model Architecture

We define a custom `TrafficVolumeLSTM` class by extending `torch.nn.Module`.  
Using a class gives us full flexibility and clarity when defining model layers, internal logic, and forward pass behavior — a common practice in PyTorch.

> 📌 **Why use a class?**  
> - Encapsulates all the model logic cleanly  
> - Allows easy reuse, extension, and organization of layers  
> - Follows PyTorch best practices for complex architectures

Our model includes:
- 🔁 An **LSTM layer** with multiple layers (`num_layers`) and hidden units (`hidden_size`) to learn temporal dependencies.
- 🎯 A **fully connected layer** (`nn.Linear`) to project the final hidden state to the output — in this case, a single value (regression).
- ⚙️ We initialize hidden states (`h0`, `c0`) at each forward pass. This can later be improved with stateful LSTMs or learned initial states.

> 🧠 **Key LSTM arguments:**
> - `input_size`: Number of features in each timestep.
> - `hidden_size`: Dimensionality of the LSTM's hidden state.
> - `num_layers`: How many stacked LSTM layers we want.
> - `batch_first=True`: Makes input/output format `(batch, seq, feature)`, which matches how we built our sequences.

We instantiate the model using:
```python
model = TrafficVolumeLSTM(input_size, hidden_size=64, num_layers=2, output_size=1)

In [28]:
class TrafficVolumeLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(TrafficVolumeLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, dropout=0.2)
        self.linear = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        batch_size = x.shape[0]
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).requires_grad_()
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).requires_grad_()
        
        _, (hn, _) = self.lstm(x, (h0, c0))
        out = self.linear(hn[-1]).flatten()

        return out

In [30]:
# Define model parameters
input_size = train_sequences.shape[2]  # Number of features
hidden_size = 64
num_layers = 2
output_size = 1

# Instantiate the model
model = TrafficVolumeLSTM(input_size, hidden_size, num_layers, output_size)

In [33]:
# Define the loss function for regression tasks
loss_function = nn.MSELoss()

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [35]:
# Train the model with 5 epochs
final_training_loss = 0
num_epochs = 5
# Training loop
for epoch in range(num_epochs):
    for batch_x, batch_y in train_loader:
        outputs = model(batch_x)
        loss = loss_function(outputs, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("Epoch: %d, train loss: %1.5f" % (epoch+1, loss))
    final_training_loss = loss

Epoch: 1, train loss: 0.00394
Epoch: 2, train loss: 0.00337
Epoch: 3, train loss: 0.00680
Epoch: 4, train loss: 0.00284
Epoch: 5, train loss: 0.00395


## 4. Model Evaluation

In [48]:
# Set the model to evaluation mode
model.eval()

# Initialize variables to store outputs and labels
all_predictions = []
all_labels = []

# Disable gradient calculation during inference
with torch.no_grad():
    for seqs, labels in test_loader:
        outputs = model(seqs).squeeze()
        all_predictions.append(outputs)
        all_labels.append(labels)
        
# Concatenate all predictions and labels as PyTorch tensors
all_predictions = torch.cat(all_predictions, dim=0)
all_labels = torch.cat(all_labels, dim=0)

# Calculate MSE directly with PyTorch
test_mse = F.mse_loss(all_predictions, all_labels)

print(f'Test MSE: {test_mse.item()}')

Test MSE: 0.0020351330749690533


In [51]:
# ✅ Ensure tensors are detached from the computation graph before converting
predictions_np = all_predictions.detach().cpu().numpy()
labels_np = all_labels.detach().cpu().numpy()

# Create the plot
fig = go.Figure()

# Add traces for predictions and real values
fig.add_trace(go.Scatter(
    x=np.arange(len(predictions_np)),
    y=predictions_np,
    mode='lines',
    name='Predicted Traffic Volume'
))

fig.add_trace(go.Scatter(
    x=np.arange(len(labels_np)),
    y=labels_np,
    mode='lines',
    name='Actual Traffic Volume'
))

# Update layout
fig.update_layout(
    title='📈 Predicted vs Actual Traffic Volume',
    xaxis_title='Sample Index',
    yaxis_title='Traffic Volume',
    legend=dict(x=0.01, y=0.99, borderwidth=1)
)

pio.renderers.default = 'browser'
fig.show()

In [53]:
# 🔁 Retrieve the original scale of the data
max_traffic_volume = traffic_data['traffic_volume'].max()

# 🎯 Rescale predictions and labels from normalized [0, 1] back to actual traffic volume
predictions_real = predictions_np * max_traffic_volume
labels_real = labels_np * max_traffic_volume

In [54]:
# Create the plot
fig = go.Figure()

# Add traces for predictions and true labels
fig.add_trace(go.Scatter(x=list(range(len(predictions_real))), y=predictions_real, mode='lines', name='Predicted Traffic Volume'))
fig.add_trace(go.Scatter(x=list(range(len(labels_real))), y=labels_real, mode='lines', name='Real Traffic Volume'))

# Update layout
fig.update_layout(title='Predicted vs True Labels',
                  xaxis_title='Sample Index',
                  yaxis_title='Value',
                  legend=dict(x=0, y=1, traceorder='normal'))

# Show the plot
fig.show()

## 🌍 How to Translate This Project to Your Own Location

One of the strengths of this traffic volume prediction model is that it's **highly adaptable**. Whether you're in Madrid, São Paulo, Tokyo, or any other city — you can apply the same core approach with just a few adjustments.

### 🔁 Steps to Customize:

1. **Find Local Traffic Data**
   - Look for open data platforms from your city or country.
   - Useful sources include:
     - City or municipality open data portals
     - Government transportation departments
     - Public APIs (e.g., Google Maps, TomTom, DGT, etc.)

2. **Ensure Similar Structure**
   - The dataset should have:
     - A timestamp or datetime column
     - A target variable (e.g. vehicle count, traffic volume, etc.)
     - Optional: weather, location, or road condition features

3. **Adapt Preprocessing**
   - Adjust column names and date formats to match your new dataset
   - Normalize or scale the data as needed (just like we did)
   - If necessary, re-tune the `seq_length`, model size, or batch size

4. **Retrain the Model**
   - Use the same architecture (`LSTM`) to learn from your new data
   - Train, evaluate, and visualize just like we did here

5. **Compare Results**
   - See how traffic trends differ between cities
   - Test how well your model predicts rush hours, quiet times, or anomalies

### 🇪🇸 Example: Madrid, Spain

Madrid offers public traffic datasets via its [Open Data Portal](https://datos.madrid.es).

For example, the dataset **[Estado M30 en tiempo real](https://datos.madrid.es/sites/v/index.jsp?vgnextoid=d5ec05dc4d1ab410VgnVCM2000000c205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD)** provides real-time traffic status on the M-30, including congestion levels and sensor data.

You could easily swap in this data and use the same LSTM model to predict traffic along major roads like **Gran Vía**, **M-30**, or **Castellana**.

---

> 💡 **Insight:** The power of ML isn’t just in building one model — it’s in applying that model framework to any data, anywhere.