In [53]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.io import loadmat
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    mean_absolute_error,
    median_absolute_error,
    explained_variance_score
)

# Load the dataset
data = loadmat('./traffic_dataset.mat')  # Adjust the path to your dataset

# Prepare the dataset
X_train_raw = data['tra_X_tr']  # Assuming shape: (time intervals, sensors, features)
X_test_raw = data['tra_X_te']  # Assuming same shape as X_train_raw

# Flatten across sensors and features for each time interval
X_train_flattened = X_train_raw.reshape(X_train_raw.shape[0], -1)
X_test_flattened = X_test_raw.reshape(X_test_raw.shape[0], -1)

# Aggregate traffic volumes across sensors for each time interval
y_train = np.sum(data['tra_Y_tr'], axis=0)  # Assuming shape: (sensors, time intervals)
y_test = np.sum(data['tra_Y_te'], axis=0)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_train_flattened, y_train, test_size=0.2, random_state=42
)

# Train the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
predictions_train = model.predict(X_train)
predictions_val = model.predict(X_val)

# Calculate and print evaluation metrics
rmse_train = np.sqrt(mean_squared_error(y_train, predictions_train))
rmse_val = np.sqrt(mean_squared_error(y_val, predictions_val))
r2_train = r2_score(y_train, predictions_train)
r2_val = r2_score(y_val, predictions_val)

print(f'Training RMSE: {rmse_train}')
print(f'Validation RMSE: {rmse_val}')
print(f'Training R²: {r2_train}')
print(f'Validation R²: {r2_val}')

# Print additional evaluation metrics
mae_train = mean_absolute_error(y_train, predictions_train)
mae_val = mean_absolute_error(y_val, predictions_val)
medae_train = median_absolute_error(y_train, predictions_train)
medae_val = median_absolute_error(y_val, predictions_val)
evs_train = explained_variance_score(y_train, predictions_train)
evs_val = explained_variance_score(y_val, predictions_val)

print(f'Training MAE: {mae_train}')
print(f'Validation MAE: {mae_val}')
print(f'Training Median AE: {medae_train}')
print(f'Validation Median AE: {medae_val}')
print(f'Training Explained Variance Score: {evs_train}')
print(f'Validation Explained Variance Score: {evs_val}')


# Select a sample from the validation set to showcase actual vs. predicted traffic volume
sample_index = np.random.randint(0, len(y_val))  # Randomly select an index for sample demonstration
sample_features = X_val_flattened[sample_index]
sample_true_volume = y_val[sample_index]
sample_prediction = model.predict([sample_features])[0]

print(f"\nSample Prediction:\nActual traffic volume: {sample_true_volume}")
print(f"Predicted traffic volume: {sample_prediction:.2f}")

# Plot actual vs predicted traffic volume for the sample
plt.figure(figsize=(6, 4))
plt.bar(['Actual Volume', 'Predicted Volume'], [sample_true_volume, sample_prediction], color=['blue', 'orange'])
plt.title('Actual vs Predicted Traffic Volume for Selected Sample')
plt.show()

# Calculate residuals for the validation set and plot a residual plot
residuals = y_val - predictions_val
plt.figure(figsize=(10, 6))
plt.scatter(predictions_val, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Traffic Volume')
plt.ylabel('Residuals')
plt.title('Residual Plot for Validation Set')
plt.show()

# Plotting the feature importances
feature_importances_normalized = model.feature_importances_ / np.max(model.feature_importances_)
sorted_indices = np.argsort(feature_importances_normalized)[::-1]
top_n = 20  # Number of top features to plot
plt.figure(figsize=(10, 6))
plt.title('Top 20 Feature Importances')
plt.bar(range(top_n), feature_importances_normalized[sorted_indices[:top_n]], align='center')
plt.xticks(range(top_n), sorted_indices[:top_n], rotation=90)
plt.xlabel('Feature Index')
plt.ylabel('Normalized Importance')
plt.show()


ValueError: Found input variables with inconsistent numbers of samples: [1, 1261]

At 2017-01-02 00:00, the traffic is predicted to be approximately 15 vehicles.
At 2017-01-02 00:15, the traffic is predicted to be approximately 15 vehicles.
At 2017-01-02 00:30, the traffic is predicted to be approximately 15 vehicles.
At 2017-01-02 00:45, the traffic is predicted to be approximately 12 vehicles.
At 2017-01-02 01:00, the traffic is predicted to be approximately 14 vehicles.
At 2017-01-02 01:15, the traffic is predicted to be approximately 13 vehicles.
At 2017-01-02 01:30, the traffic is predicted to be approximately 7 vehicles.
At 2017-01-02 01:45, the traffic is predicted to be approximately 16 vehicles.
At 2017-01-02 02:00, the traffic is predicted to be approximately 15 vehicles.
At 2017-01-02 02:15, the traffic is predicted to be approximately 16 vehicles.
At 2017-01-02 02:30, the traffic is predicted to be approximately 7 vehicles.
At 2017-01-02 02:45, the traffic is predicted to be approximately 14 vehicles.
At 2017-01-02 03:00, the traffic is predicted to be ap

IndexError: index 20 is out of bounds for axis 0 with size 20

# Traffic Flow Prediction Using Random Forest Regression

## 1. Introduction
Traffic flow prediction is important in modern urban planning and traffic management. Models may help to reduce congestion and plan optimal road routing. In this project, the goal is to utilize machine learning to predict the next 15 minutes of traffic at 36 sensor locations along highways in Northern Virginia & D.C. area.

## 2. Problem Description
The dataset for this regression problem contains 47 features, including historical traffic volume, time of day, and day of the week, across 36 locations. The challenge is to predict the future traffic volume for these locations, understanding the spatio-temporal patterns in the data.

## 3. Data Preparation
The dataset is in a MATLAB format, containing over a year of data. Data preparation involved flattening the matrices into vectors and aggregating features.

## 4. Benchmarking
Due to its ease of use and robustness, I chose random forest as the first benchmarking model. I tuned the hyperparameters to better suit the problem and achieve better results.

## 5. Methodology
The model's hyperparameters were set with 100 trees and the default depth to balance complexity and performance.

## 6. Results
The model performed exceptionally well on the training data as well as the validation data, indicated by the RMSE and R² scores. The results from additional metrics also reflected good performance.
- Training RMSE: 0.0059
- Validation RMSE: 0.0151
- Training R²: 0.9983
- Validation R²: 0.9889

# Residual Plot
The residual plot displays residuals on the y and predicted on the x. The points are randomly scattered around the x-axis which is the ideal outcome. The randomness of the errors displayed in this plot indicates that the model is not performing errors in a pattern which would indicate some part of it is performing poorly. There is also no general shape of the errors meaning the predictions are not reliant on the magnitude of the prediction. The even distribution of large and small residuals suggests that the model does not over-predict or under-predict
# Feature Importances Plot

# Prediction Error Plot

# Learning Curves Plot
The learning curve plot starts with a high training error which quickly declines as more data is included. This indicates that the model is quickly learning from new data points when they are introduced. After the initial drop, the error plateaus indicatting that after a certain point the model stops improving. The validation curve begins at a low error rate and rises slightly before also plateauing. Even with a small amount of data this suggests the model is performing well. The validation curve has an error rate similar to that of the training data curve indicating that the model is not overfitting on training data.

## 7. Evaluation and Discussion
Visualizations such as learning curves, residual plots, feature importance, and prediction error histograms were generated. The learning curves indicate a model that performs consistently as more data is introduced. The residual plot and prediction error histogram show that errors are relatively small and centered around zero, which is desirable.

[Include code cells that generate the visualizations here]

## 8. Conclusions
The Random Forest model shows promise in predicting traffic flow. The high R² and low RMSE values indicate a strong fit. Further studies could explore more complex models to capture any remaining variance.

## 9. References
Liang Zhao, Olga Gkountouna, and Dieter Pfoser, 2019. Spatial Auto-regressive Dependency Interpretable Learning Based on Spatial Topological Constraints.