<a href="https://colab.research.google.com/github/HassanZeb01/global-road-safety-analytics/blob/main/Accident_Severity_Prediction_Multi_Country_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Core Libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, roc_auc_score, roc_curve

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb

# Utilities
import warnings
warnings.filterwarnings('ignore')

# **Phase 01: GLOBAL ROAD SAFETY ANALYTICS: ACCIDENT SEVERITY PREDICTION :**

---

## 1. **Loading Global Road Accident dataset:**

In [1]:
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Dataset file in Google Drive
road_accident_df_url = '/content/drive/MyDrive/DataScienceProject/GlobalRoadSafetyAnalytics/road_accident_dataset.csv'

# Reading csv file
raw_df = pd.read_csv(road_accident_df_url)

# Loading 10 samples rows of dataset
raw_df.sample(10)

Unnamed: 0,Country,Year,Month,Day of Week,Time of Day,Urban/Rural,Road Type,Weather Conditions,Visibility Level,Number of Vehicles Involved,...,Number of Fatalities,Emergency Response Time,Traffic Volume,Road Condition,Accident Cause,Insurance Claims,Medical Cost,Economic Loss,Region,Population Density
73968,Japan,2023,January,Friday,Morning,Urban,Main Road,Snowy,369.282252,1,...,1,5.784282,8446.831439,Dry,Weather,6,39864.41736,25628.433544,Australia,3846.929161
8269,Germany,2009,March,Sunday,Morning,Urban,Highway,Foggy,288.20462,2,...,1,37.587917,7276.195351,Wet,Speeding,5,7522.340331,33929.656496,Australia,4125.342157
97145,China,2008,April,Friday,Morning,Rural,Highway,Windy,120.526142,3,...,4,22.837915,9490.144404,Wet,Speeding,6,9201.625666,76640.565619,Asia,211.416984
4336,Russia,2023,August,Saturday,Afternoon,Urban,Highway,Windy,109.284266,4,...,4,40.275919,3406.227071,Snow-covered,Mechanical Failure,3,36860.040361,14070.343293,Australia,1298.186576
87049,Canada,2020,December,Monday,Morning,Rural,Main Road,Rainy,166.391522,3,...,2,5.797319,9166.644066,Wet,Drunk Driving,3,33262.415017,78668.767707,South America,3297.523837
47630,USA,2018,January,Tuesday,Afternoon,Rural,Highway,Clear,135.807906,1,...,0,39.175239,755.517889,Icy,Mechanical Failure,9,13458.67188,94727.775709,Europe,2687.734796
42135,Australia,2000,June,Monday,Evening,Urban,Highway,Rainy,311.123409,1,...,0,17.190474,352.458493,Icy,Weather,8,36275.751186,54120.43864,Europe,1207.177616
45670,Brazil,2019,May,Sunday,Morning,Rural,Street,Rainy,376.48268,1,...,2,37.617562,8180.954047,Snow-covered,Distracted Driving,7,15348.865523,88061.803472,South America,1964.789397
53372,China,2020,June,Monday,Morning,Urban,Highway,Foggy,51.828428,3,...,0,9.839355,4999.771111,Snow-covered,Speeding,4,48827.738907,89140.402234,Asia,3744.683913
44470,China,2014,July,Saturday,Evening,Urban,Main Road,Foggy,253.805137,4,...,0,57.171936,3673.159077,Dry,Mechanical Failure,2,29612.179371,54612.088837,South America,472.822207


## **2. Preprocessing - Global Road Accident Dataset**

### **2.1. Dataset Dimensions:**

In [4]:
# Dataset dimensions (rows by column)
raw_df.shape

(132000, 30)

Total Rows : 132000 \
Total Columns : 30

### **2.2. Inspecting Data type and format of the dataset:**

In [5]:
#Checking summary of dataset (Null values, datatype)
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132000 entries, 0 to 131999
Data columns (total 30 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Country                      132000 non-null  object 
 1   Year                         132000 non-null  int64  
 2   Month                        132000 non-null  object 
 3   Day of Week                  132000 non-null  object 
 4   Time of Day                  132000 non-null  object 
 5   Urban/Rural                  132000 non-null  object 
 6   Road Type                    132000 non-null  object 
 7   Weather Conditions           132000 non-null  object 
 8   Visibility Level             132000 non-null  float64
 9   Number of Vehicles Involved  132000 non-null  int64  
 10  Speed Limit                  132000 non-null  int64  
 11  Driver Age Group             132000 non-null  object 
 12  Driver Gender                132000 non-null  object 
 13 

### **2.3 Checking Duplicate Rows:**

In [6]:
# Checking the duplicated rows in dataset
dup = raw_df.duplicated().sum()
print('Dupicate Rows:',dup)

Dupicate Rows: 0


### **2.4 Checking Missing Values:**


In [7]:
# checking the missing values sum in dataset columns:
raw_df.isnull().sum()

Unnamed: 0,0
Country,0
Year,0
Month,0
Day of Week,0
Time of Day,0
Urban/Rural,0
Road Type,0
Weather Conditions,0
Visibility Level,0
Number of Vehicles Involved,0


### **2.5 Checking Total Countries:**


In [8]:
# Total Number of Counteries in Global Accident Dataset
total_countries = raw_df['Country'].nunique()
print('Total Countries',total_countries)

Total Countries 10
