<div align="center">
    <h1><b><u>TRAFFIC OPTIMIZATION</u></b></h1>
</div>


**GROUP NUMBER: 7**
     
**GROUP MEMBERS**
   
- **Adebola**
- **Rahim**
- **Sayeed**
- **Yinka**
- **Minto**
   

<div align="center">
    <h3><b><u>4. Model Selection</u></b></h3>
</div>


- The purpose of this document is to find the best model to predict 'injury'. 
- encoded_data.csv, file containing data after the pre-processing steps is used todo the model selection. 
- Following models are taken into consideration:
    - Linear Regression – Simple baseline model
    - Ridge Regression – The dataset has 191 features, many of which might be correlated. Ridge helps reduce multicollinearity and overfitting by adding L2 regularization.
    - Random Forest Regressor - Random Forest can capture complex non-linear interactions between variables.
    - XGBoost - is a high-performance gradient boosting model

In [1]:
# import libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import numpy as np

In [3]:
# load dataset
df = pd.read_csv('data/encoded_data.csv')
df.head()

Unnamed: 0,ROAD_CLASS,ACCLOC,TRAFFCTL,VISIBILITY,LIGHT,RDSFCOND,IMPACTYPE,INVTYPE,INVAGE,INJURY,...,VEHTYPE,MANOEUVER,DRIVACT,DRIVCOND,DIVISION,DISTRICT_North York,DISTRICT_Scarborough,DISTRICT_Toronto and East York,ACCLASS_Non-Fatal Injury,ACCLASS_Property Damage O
0,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,52.0,2,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
1,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,17.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
2,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.457193,57.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
3,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,22.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False
4,0.731234,0.084613,0.479823,0.864958,0.197605,0.165638,0.050588,0.152398,17.0,1,...,0.595664,0.764625,0.723427,0.81495,0.080709,False,False,True,True,False


In [5]:
# Separate data into features and target
X = df.drop(columns=['INJURY'])
y = df['INJURY']

In [6]:
# Split data into training and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

steps for below:
A dictionary called model is createdwhere the keys are model name and the values are the models itself (Liek we are creating the model here.)

In [7]:
# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42, verbosity=0)
}


Steps:
- we are creating an empty listcalled result 
- we are iterating through the dictionary we created. 
- for each model, we fit using the training data, then predict using the test data. 
- We calculate the rmse, mse and r2 fr each model using the predictions and actual y_test values. 
- the results are added to the result. 

In [8]:
# Train and evaluate models
results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    results.append({
        "Model": name,
        "RMSE": rmse,
        "MAE": mae,
        "R2 Score": r2
    })

In [9]:
# Display results as DataFrame
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,RMSE,MAE,R2 Score
0,Linear Regression,0.562325,0.371807,0.141616
1,Ridge Regression,0.56235,0.371768,0.141541
2,Random Forest,0.550922,0.321334,0.176078
3,XGBoost,0.539624,0.328111,0.209524


After evaluating multiple regression models—Linear Regression, Ridge Regression, Random Forest, and XGBoost—we selected **Random Forest** as the final model. 

### Reason for Choosing Random Forest:

- **Lowest MAE**: Indicates the most consistent performance in terms of absolute prediction error.
- **Good balance** of RMSE and R², outperforming Linear and Ridge Regression significantly.
- **Robust to overfitting** and handles feature interactions well.
- Easier to interpret through **feature importance** plots compared to XGBoost.
- Offers strong baseline performance without extensive hyperparameter tuning.

Thus, Random Forest was selected as the final model for deployment and further analysis.