<span style="color: Blue;">**Predicting Road Accident Risk**</span>

**Dataset Description**

This dataset (train and test) was generated from a deep learning model trained on the Simulated Roads Accident dataset. The features are similar to the original dataset, but not exactly the same.

I built the model using <span style="color: Blue;">XGBoost</span>, tuning hyperparameters with <span style="color: Blue;">Random Search</span> and <span style="color: Blue;">Bayesian Optimization</span>, and evaluated performance using Root Mean Squared Error between predicted and actual accident risk values.

**Data Source:** kaggle competitions download -c playground-series-s5e10

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load and preview dataset
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
0,0,urban,2,0.06,35,daylight,rainy,False,True,afternoon,False,True,1,0.13
1,1,urban,4,0.99,35,daylight,clear,True,False,evening,True,True,0,0.35
2,2,rural,4,0.63,70,dim,clear,False,True,morning,True,False,2,0.3
3,3,highway,4,0.07,35,dim,rainy,True,True,morning,False,False,1,0.21
4,4,rural,1,0.58,60,daylight,foggy,False,False,evening,True,False,1,0.56


In [None]:
# Dataset dimensions
print('Row:',df.shape[0])
print('Column:', df.shape[1])

Row: 517754
Column: 14


In [None]:
# Summary of your numerical variables
df.describe(include = 'number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,517754.0,258876.5,149462.849974,0.0,129438.25,258876.5,388314.75,517753.0
num_lanes,517754.0,2.491511,1.120434,1.0,1.0,2.0,3.0,4.0
curvature,517754.0,0.488719,0.272563,0.0,0.26,0.51,0.71,1.0
speed_limit,517754.0,46.112575,15.788521,25.0,35.0,45.0,60.0,70.0
num_reported_accidents,517754.0,1.18797,0.895961,0.0,1.0,1.0,2.0,7.0
accident_risk,517754.0,0.352377,0.166417,0.0,0.23,0.34,0.46,1.0


In [None]:
# Summary of your categorical (object or category) variables
df.describe(include = ['category', 'object']).T

Unnamed: 0,count,unique,top,freq
road_type,517754,3,highway,173672
lighting,517754,3,dim,183826
weather,517754,3,foggy,181463
time_of_day,517754,3,morning,173410


In [None]:
# Summary of all boolean columns
df.describe(include = 'bool').T

Unnamed: 0,count,unique,top,freq
road_signs_present,517754,2,False,259289
public_road,517754,2,True,260045
holiday,517754,2,True,260688
school_season,517754,2,False,260164


In [None]:
# Set index and inspect random samples
df.set_index('id', inplace=True)
df.sample(3)

Unnamed: 0_level_0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
281010,urban,3,0.85,70,night,clear,False,False,afternoon,True,False,2,0.61
505482,highway,1,0.11,70,daylight,foggy,False,True,morning,False,False,1,0.33
425063,urban,3,0.31,45,dim,foggy,False,False,evening,True,False,2,0.19


In [None]:
# Identify and list the categorical and numerical columns
cat = df.select_dtypes(include = ['category', 'object']).columns.tolist()
num = df.select_dtypes(include = 'number').columns.tolist()

print("Categorical Variables:", cat)
print("Numerical Variables:", num)

Categorical Variables: ['road_type', 'lighting', 'weather', 'time_of_day']
Numerical Variables: ['num_lanes', 'curvature', 'speed_limit', 'num_reported_accidents', 'accident_risk']


In [None]:
# Missing data overview
df.isnull().sum().sum()

np.int64(0)

In [None]:
# skewness of all numerical columns
for col in df[num]:
  print(f'{col}: {df[col].skew()}')


num_lanes: 0.01277533250638846
curvature: -0.03868453373471708
speed_limit: 0.18115979057866655
num_reported_accidents: 0.3739685270993894
accident_risk: 0.37841797634228097


<span style="color: Red;">**Using Extreme Gradient Boosting to Predict Road Accident Risk**</span>


In [None]:
# Perform one-way ANOVA to identify which categorical features have a significant relationship with accident risk
from scipy.stats import f_oneway
import pandas as pd

cat_cols = ['road_type', 'lighting', 'weather', 'time_of_day']

for col in cat_cols:
    groups = [group['accident_risk'].values for name, group in df.groupby(col)]
    f_stat, p = f_oneway(*groups)
    print(f"{col}: p-value = {p:.4f}")

road_type: p-value = 0.0000
lighting: p-value = 0.0000
weather: p-value = 0.0000
time_of_day: p-value = 0.0000


I ran ANOVA to find which categorical variables actually affect your continuous target. Only those variables with a strong relationship (significant p-values) should be encoded using Target Encoding, so the model learns useful patterns without being misled by irrelevant features.

In [None]:
# Display unique values for each categorical variable
cat_cols = ['road_type', 'lighting', 'weather', 'time_of_day']

for col in cat_cols:
    print(f"{col}: {df[col].unique()}\n")

road_type: ['urban' 'rural' 'highway']

lighting: ['daylight' 'dim' 'night']

weather: ['rainy' 'clear' 'foggy']

time_of_day: ['afternoon' 'evening' 'morning']



I checked the categorical variables road_type, lighting, weather, and time_of_day to see which ones have many categories, as variables with multiple categories are ideal candidates for Target Encoding, allowing the model to effectively capture their relationship with accident_risk.

In [None]:
!pip install category_encoders



In [None]:
# Target encoding of categorical variables
from category_encoders import TargetEncoder

cat_cols = ['road_type', 'lighting', 'weather', 'time_of_day']
target = df['accident_risk']

te = TargetEncoder(cols=cat_cols)
df_encoded = te.fit_transform(df[cat_cols], target)

In [None]:
# Merging encoded features with the original dataset
df = pd.concat([df.drop(columns=cat_cols), df_encoded], axis=1)
df.head()

Unnamed: 0_level_0,num_lanes,curvature,speed_limit,road_signs_present,public_road,holiday,school_season,num_reported_accidents,accident_risk,road_type,lighting,weather,time_of_day
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2,0.06,35,False,True,False,True,1,0.13,0.357456,0.302923,0.361494,0.351428
1,4,0.99,35,True,False,True,True,0,0.35,0.357456,0.302923,0.31006,0.354736
2,4,0.63,70,False,True,True,False,2,0.3,0.349997,0.300109,0.31006,0.350966
3,4,0.07,35,True,True,False,False,1,0.21,0.349734,0.300109,0.361494,0.350966
4,1,0.58,60,False,False,True,False,1,0.56,0.349997,0.302923,0.386305,0.354736


In [None]:
# Defining features and target variable
X = df.drop('accident_risk', axis=1)
y = df['accident_risk']

In [None]:
# Importing required libraries for data Splitting, Evaluation, and Model Building
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [None]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

<span style="color: Red;">**Using Randomized Search for Hyperparameter Tuning**</span>

In [None]:
# Importing RandomizedSearchCV for Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Define the model
xgb_model = XGBRegressor(objective='reg:squarederror', random_state=1)

# Define the hyperparameter space
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# Setup Randomized Search
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=50,
    scoring='neg_mean_squared_error',
    cv=3,
    verbose=1,
    random_state=1,
    n_jobs=-1
)

# Fit on training data
random_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", random_search.best_params_)

# Predict and evaluate on test set
y_pred = random_search.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Test RMSE:", rmse)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
Best Parameters: {'n_estimators': 200, 'max_depth': 7, 'learning_rate': 0.05}
Test RMSE: 0.0560097910442412


<span style="color: Red;">**Using Bayesian Optimization for Hyperparameter Tuning..**</span>


In [None]:
!pip install scikit-optimize



In [None]:
# Importing BayesSearchCV for Bayesian Hyperparameter Optimization
from skopt import BayesSearchCV

In [None]:
# Define the model
xgb_model_opt = XGBRegressor(objective='reg:squarederror', random_state=1)

# Define the search space
search_spaces = {
    'n_estimators': (100, 600),
    'max_depth': (3, 10),
    'learning_rate': (0.01, 0.3, 'log-uniform')
}

# Define the Bayesian Search
opt = BayesSearchCV(
    estimator=xgb_model_opt,
    search_spaces=search_spaces,
    n_iter=50,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1,
    random_state=1
)

# Fit the search
opt.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", opt.best_params_)
print("Best CV Score (RMSE):", np.sqrt(-opt.best_score_))

# Predict on test data
y_pred = opt.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Test RMSE:", rmse)


Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fi

<span style="color: Red;">**Load the test data for making predictions**</span>

In [None]:
#  Load and prepare the test dataset
df_test = pd.read_csv('test.csv')

In [None]:
# Previewing a sample of the test dataset
df_test.sample(5)

Unnamed: 0,id,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents
156074,673828,urban,1,0.7,25,dim,rainy,False,False,evening,False,False,2
81612,599366,highway,4,1.0,60,night,rainy,False,True,afternoon,True,False,0
130901,648655,highway,1,0.03,45,daylight,rainy,False,True,morning,True,True,1
91390,609144,urban,4,0.32,70,daylight,clear,True,False,afternoon,False,False,2
123371,641125,rural,3,0.15,45,daylight,foggy,False,False,afternoon,False,False,1


In [None]:
# Checking for missing values in the test dataset
df_test.isnull().sum().sum()

np.int64(0)

In [None]:
# Applying target encoding to categorical features in the test dataset
df_test[cat_cols] = te.transform(df_test[cat_cols])

In [None]:
# Setting test dataset index and previewing last rows
df_test.set_index('id', inplace=True)
df_test.tail(5)

Unnamed: 0_level_0,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
690334,0.349997,2,0.01,45,0.300109,0.361494,False,False,0.351428,True,True,2
690335,0.349997,1,0.74,70,0.302923,0.386305,False,True,0.351428,False,False,2
690336,0.357456,2,0.14,70,0.300109,0.31006,False,False,0.354736,True,True,1
690337,0.357456,1,0.09,45,0.302923,0.386305,True,True,0.350966,False,True,0
690338,0.349734,1,0.63,35,0.470467,0.386305,True,False,0.354736,False,False,0


In [None]:
# Aligning test Features and making predictions with tuned model
X_test_pred = df_test[X_train.columns]

accident_risk = random_search.predict(X_test_pred)

In [None]:
#Creating a submission dataset for Randan Search
submission = pd.DataFrame({
    'accident_risk': accident_risk
}, index=df_test.index)

submission.head(5)

In [None]:
# Save to CSV file with the index (Random Search)
submission.to_csv('submission.csv', index=True)

In [None]:
# Generating Predictions on Test Data Using the Bayesian-Optimized Model
accident_risk_opt = opt.predict(X_test_pred)

In [None]:
# Creating a submission dataset for Bayesian Optimization
submission_opt = pd.DataFrame({
    'accident_risk': accident_risk_opt
}, index=df_test.index)

submission_opt.head(5)

Unnamed: 0_level_0,accident_risk
id,Unnamed: 1_level_1
517754,0.296573
517755,0.122078
517756,0.183187
517757,0.318613
517758,0.406914


In [None]:
# Save to CSV file with the index (Bayesian Optimization)
submission_opt.to_csv('submission.csv', index=True)

# I sincerely appreciate you taking the time to review my work. Your feedback and guidance are highly valued and will greatly help in improving the quality of my work.
