# Weekly Project 2!

## Introduction to Road Traffic Accidents (RTA) Dataset

### Dataset Overview
The RTA Dataset provides a detailed snapshot of road traffic accidents, capturing a range of data from accident conditions to casualty details. This dataset is essential for analyzing patterns and causes of accidents to improve road safety.

### Data Characteristics
- **Entries**: The dataset contains 12,316 entries.
- **Features**: There are 32 features in the dataset, which include:
  - `Time`: Time when the accident occurred.
  - `Day_of_week`: Day of the week.
  - `Age_band_of_driver`: Age group of the driver involved.
  - `Sex_of_driver`: Gender of the driver.
  - `Educational_level`: Educational level of the driver.
  - `Type_of_vehicle`: Type of vehicle involved in the accident.
  - `Cause_of_accident`: Reported cause of the accident.
  - `Accident_severity`: Severity of the accident.
- **Target Column**: `Accident_severity` is used as the target column for modeling. This feature classifies the severity of each accident.

### Objective
Students will use this dataset to apply various data visualization, modeling, and evaluation techniques learned in class. The primary goal is to build models that can accurately predict the severity of accidents and to identify the key factors that contribute to severe accidents.

## Import Libraries
Import all the necessary libraries here. Include libraries for handling data (like pandas), visualization (like matplotlib and seaborn), and modeling (like scikit-learn).

In [556]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [557]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load Data
Load the dataset from the provided CSV file into a DataFrame.

In [558]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/RTA_Dataset.csv')

## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!


In [559]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

In [560]:
df.iloc[10]

Unnamed: 0,10
Time,14:40:00
Day_of_week,Saturday
Age_band_of_driver,18-30
Sex_of_driver,Male
Educational_level,Above high school
Vehicle_driver_relation,Owner
Driving_experience,1-2yr
Type_of_vehicle,Public (13?45 seats)
Owner_of_vehicle,Owner
Service_year_of_vehicle,Unknown


In [561]:
df.sample(3)

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,...,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
8713,9:15:00,Sunday,31-50,Male,Elementary school,Owner,5-10yr,Lorry (41?100Q),Organization,2-5yrs,...,Going straight,Driver or rider,Male,31-50,3,Driver,Normal,Not a Pedestrian,Turnover,Slight Injury
4753,8:20:00,Thursday,18-30,Male,Junior high school,Owner,5-10yr,Automobile,Owner,Unknown,...,Going straight,Driver or rider,Male,31-50,3,Driver,Normal,Not a Pedestrian,Overtaking,Slight Injury
12008,11:50:00,Monday,18-30,Male,Elementary school,Owner,5-10yr,Other,Owner,2-5yrs,...,Going straight,Pedestrian,Female,31-50,3,Driver,Normal,Crossing from nearside - masked by parked or s...,Driving carelessly,Slight Injury


In [562]:
df.head()

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,...,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
0,17:02:00,Monday,18-30,Male,Above high school,Employee,1-2yr,Automobile,Owner,Above 10yr,...,Going straight,na,na,na,na,,,Not a Pedestrian,Moving Backward,Slight Injury
1,17:02:00,Monday,31-50,Male,Junior high school,Employee,Above 10yr,Public (> 45 seats),Owner,5-10yrs,...,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury
2,17:02:00,Monday,18-30,Male,Junior high school,Employee,1-2yr,Lorry (41?100Q),Owner,,...,Going straight,Driver or rider,Male,31-50,3,Driver,,Not a Pedestrian,Changing lane to the left,Serious Injury
3,1:06:00,Sunday,18-30,Male,Junior high school,Employee,5-10yr,Public (> 45 seats),Governmental,,...,Going straight,Pedestrian,Female,18-30,3,Driver,Normal,Not a Pedestrian,Changing lane to the right,Slight Injury
4,1:06:00,Sunday,18-30,Male,Junior high school,Employee,2-5yr,,Owner,5-10yrs,...,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury


In [563]:
df.tail()

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,...,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
12311,16:15:00,Wednesday,31-50,Male,,Employee,2-5yr,Lorry (11?40Q),Owner,,...,Going straight,na,na,na,na,Driver,Normal,Not a Pedestrian,No distancing,Slight Injury
12312,18:00:00,Sunday,Unknown,Male,Elementary school,Employee,5-10yr,Automobile,Owner,,...,Other,na,na,na,na,Driver,Normal,Not a Pedestrian,No distancing,Slight Injury
12313,13:55:00,Sunday,Over 51,Male,Junior high school,Employee,5-10yr,Bajaj,Owner,2-5yrs,...,Other,Driver or rider,Male,31-50,3,Driver,Normal,Not a Pedestrian,Changing lane to the right,Serious Injury
12314,13:55:00,Sunday,18-30,Female,Junior high school,Employee,Above 10yr,Lorry (41?100Q),Owner,2-5yrs,...,Other,na,na,na,na,Driver,Normal,Not a Pedestrian,Driving under the influence of drugs,Slight Injury
12315,13:55:00,Sunday,18-30,Male,Junior high school,Employee,5-10yr,Other,Owner,2-5yrs,...,Stopping,Pedestrian,Female,5,3,Driver,Normal,Crossing from nearside - masked by parked or s...,Changing lane to the right,Slight Injury


In [564]:
df.duplicated().sum()

0

In [565]:
df.shape

(12316, 32)

In [566]:
df.describe()

Unnamed: 0,Number_of_vehicles_involved,Number_of_casualties
count,12316.0,12316.0
mean,2.040679,1.548149
std,0.68879,1.007179
min,1.0,1.0
25%,2.0,1.0
50%,2.0,1.0
75%,2.0,2.0
max,7.0,8.0


In [567]:
df.isna().sum()

Unnamed: 0,0
Time,0
Day_of_week,0
Age_band_of_driver,0
Sex_of_driver,0
Educational_level,741
Vehicle_driver_relation,579
Driving_experience,829
Type_of_vehicle,950
Owner_of_vehicle,482
Service_year_of_vehicle,3928


## Data Preprocessing
Data preprocessing is essential for transforming raw data into a format suitable for further analysis and modeling. Follow these steps to ensure your data is ready for predictive modeling or advanced analytics:
- **Handling Missing Values**: Replace missing values with appropriate statistics (mean, median, mode) or use more complex imputation techniques.
- **Normalization/Scaling**: Scale data to a small, specified range like 0 to 1, or transform it to have a mean of zero and a standard deviation of one.
- **Label Encoding**: Convert categorical text data into model-understandable numbers where the labels are ordered.
- **One-Hot Encoding**: Use for nominal categorical data where no ordinal relationship exists to transform the data into a binary column for each category. (Be careful not to increase the dimensionality significantly)
- **Detection and Treatment of Outliers**: Use statistical tests, box plots, or scatter plots to identify outliers and then cap, trim, or use robust methods to reduce the effect of outliers, depending on the context.
- **Feature Engineering**: Enhance your dataset by creating new features and transforming existing ones. This might involve combining data from different columns, applying transformations, or reducing dimensionality with techniques like PCA to improve model performance.

Consider these steps as a foundation, and feel free to introduce additional preprocessing techniques as needed to address specific characteristics of your dataset.

In [568]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

In [569]:
for col in df.columns:  #removing '?' from some entries
  if df[col].dtype == 'object':
    df[col] = df[col].str.replace('?','')

In [570]:
#converting data type

df['Time'] = df['Time'].astype('datetime64[ns]') #now after looking at the dataframe above we know that the rest of columns are all categorical.
for col in df.columns:
  if df[col].dtype == 'object':
    df[col] = df[col].astype('category')
df['Number_of_casualties'] = df['Number_of_casualties'].astype('int32')               #Saving memory
df['Number_of_vehicles_involved'] = df['Number_of_vehicles_involved'].astype('int32') #Saving memory

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Time                         12316 non-null  datetime64[ns]
 1   Day_of_week                  12316 non-null  category      
 2   Age_band_of_driver           12316 non-null  category      
 3   Sex_of_driver                12316 non-null  category      
 4   Educational_level            11575 non-null  category      
 5   Vehicle_driver_relation      11737 non-null  category      
 6   Driving_experience           11487 non-null  category      
 7   Type_of_vehicle              11366 non-null  category      
 8   Owner_of_vehicle             11834 non-null  category      
 9   Service_year_of_vehicle      8388 non-null   category      
 10  Defect_of_vehicle            7889 non-null   category      
 11  Area_accident_occured        12077 non-nu

  df['Time'] = df['Time'].astype('datetime64[ns]') #now after looking at the dataframe above we know that the rest of columns are all categorical.


In [571]:
df.isna().sum()

Unnamed: 0,0
Time,0
Day_of_week,0
Age_band_of_driver,0
Sex_of_driver,0
Educational_level,741
Vehicle_driver_relation,579
Driving_experience,829
Type_of_vehicle,950
Owner_of_vehicle,482
Service_year_of_vehicle,3928


In [572]:
df['Age_band_of_casualty'].value_counts() #notice there is 'Unknown' and 'Other' one of them must be considered as missing value

Unnamed: 0_level_0,count
Age_band_of_casualty,Unnamed: 1_level_1
na,4443
18-30,3145
31-50,2455
Under 18,1035
Over 51,994
5,244


In [573]:
for col in df.columns:
  if df[col].dtype == 'category':
    print(col,':',df[col].mode()[0])
    print('-------------------------')                            #before substituting with mode/mean we have to
                                                                  #Check if a column has null as it's mode/mean
  elif df[col].dtype == 'int32':
    print(col,':',df[col].mean())
    print('-------------------------')                            #Age_band_of_casualty mode is na
                                                                  #Service_year_of_vehicle mode is Unknown


Day_of_week : Friday
-------------------------
Age_band_of_driver : 18-30
-------------------------
Sex_of_driver : Male
-------------------------
Educational_level : Junior high school
-------------------------
Vehicle_driver_relation : Employee
-------------------------
Driving_experience : 5-10yr
-------------------------
Type_of_vehicle : Automobile
-------------------------
Owner_of_vehicle : Owner
-------------------------
Service_year_of_vehicle : Unknown
-------------------------
Defect_of_vehicle : No defect
-------------------------
Area_accident_occured : Other
-------------------------
Lanes_or_Medians : Two-way (divided with broken lines road marking)
-------------------------
Road_allignment : Tangent road with flat terrain
-------------------------
Types_of_Junction : Y Shape
-------------------------
Road_surface_type : Asphalt roads
-------------------------
Road_surface_conditions : Dry
-------------------------
Light_conditions : Daylight
-------------------------
We

In [574]:
real_mode = df[df['Age_band_of_casualty'] != 'na']['Age_band_of_casualty'].mode()[0]    #take the mode but na
df['Age_band_of_casualty'] = df['Age_band_of_casualty'].replace('na', real_mode)
real_mode = df[df['Service_year_of_vehicle'] != 'Unknown']['Service_year_of_vehicle'].mode()[0]   #take the mode but Unknown
df['Service_year_of_vehicle'] = df['Service_year_of_vehicle'].replace('Unknown', real_mode)


for col in df.columns:
  if df[col].dtype == 'category':
    df[col] = df[col].fillna(df[col].mode()[0])
    df[col] = df[col].replace('Unknown', df[col].mode()[0])                     #Considering Uknown a missing value since there is 'Other' value
    df[col] = df[col].replace('unknown', df[col].mode()[0])
    df[col] = df[col].replace('na', df[col].mode()[0])
  elif df[col].dtype == 'int32':
    df[col] = df[col].fillna(df[col].mean())


In [575]:
df['hour']= df['Time'].dt.hour                #dividing time series
df['minute']= df['Time'].dt.minute
df['second']= df['Time'].dt.second
df['hour'] = df['hour'].astype('category')
df['minute'] = df['minute'].astype('category')
df['second'] = df['second'].astype('category')
df.drop('Time', axis=1, inplace=True)

In [576]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 34 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Day_of_week                  12316 non-null  category
 1   Age_band_of_driver           12316 non-null  category
 2   Sex_of_driver                12316 non-null  category
 3   Educational_level            12316 non-null  category
 4   Vehicle_driver_relation      12316 non-null  category
 5   Driving_experience           12316 non-null  category
 6   Type_of_vehicle              12316 non-null  category
 7   Owner_of_vehicle             12316 non-null  category
 8   Service_year_of_vehicle      12316 non-null  category
 9   Defect_of_vehicle            12316 non-null  category
 10  Area_accident_occured        12316 non-null  category
 11  Lanes_or_Medians             12316 non-null  category
 12  Road_allignment              12316 non-null  category
 13  T

In [577]:
df.isna().sum().sum()

0

In [580]:
#Pipelines
num = ['Number_of_vehicles_involved','Number_of_casualties']
cat = df.drop(num, axis=1).columns
cat_ordinal = ['Age_band_of_driver', 'Educational_level', 'Driving_experience', 'Service_year_of_vehicle', 'Age_band_of_casualty', 'Casualty_severity', 'Accident_severity']
cat_nominal = cat.drop(cat_ordinal)
num_transformer = Pipeline([
    ('scaler', StandardScaler())
])

cat_nominal_transformer = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

cat_ordinal_transformer = Pipeline([
    ('Label', LabelEncoder())
])
col_transformer = ColumnTransformer(
    transformers=[
    ('num', num_transformer, num),
    ('cat_nominal', cat_nominal_transformer, cat_nominal),
    ('cat_ordinal', cat_ordinal_transformer, cat_ordinal)
]
)

## Data Visualization
Create various plots to visualize the relationships in the data. Consider using the following to show different aspects of the data:

* Heatmap of Correlation Matrix.
* Line plots.
* Scatter plots.
* Histograms.
* Boxplots.

Use more if needed!

## Feature Selection
- Choose features that you believe will most influence the outcome based on your analysis and the insights from your visualizations. Focus on those that appear most impactful to include in your modeling.

## Train-Test Split
* Divide the dataset into training and testing sets to evaluate the performance of your models.

## Modeling

Once the data is split into training and testing sets, the next step is to build models to make predictions. Here, we will explore several machine learning algorithms, each with its unique characteristics and suitability for different types of data and problems. You will implement the following models:

### 1. Logistic Regression

### 2. Decision Tree Classifier

### 3. Support Vector Machine (SVM)

### 4. K-Neighbors Classifier

### Implementing the Models
- For each model, use the training data you have prepared to train the model.

#### Logistic Regression

#### Decision Tree Classifier

#### Support Vector Machine (SVM)

#### K-Neighbors Classifier

## Model Evaluation

After training your models, it's crucial to evaluate their performance to understand their effectiveness and limitations. This section outlines various techniques and metrics to assess the performance of each model you have implemented.

### Evaluation Techniques
1. **Confusion Matrix**

2. **Accuracy**

3. **Precision and Recall**

4. **F1 Score**

5. **ROC Curve and AUC**

### Implementing Evaluation
- Calculate the metrics listed above using your test data.

## Project Questions:

### Comparative Analysis

- **Compare Metrics**: Examine the performance metrics (such as accuracy, precision, and recall) of each model. Document your observations on which model performs best for your dataset and the problem you're addressing.
- **Evaluate Trade-offs**: Discuss the trade-offs you encountered when choosing between models. Consider factors like computational efficiency, ease of implementation, and model interpretability.
- **Justify Your Choice**: After comparing and evaluating, explain why you believe one model is the best choice. Provide a clear rationale based on the performance metrics and trade-offs discussed.
- **Feature Importance**: Identify and discuss the most important features for the best-performing model. How do these features impact the predictions? Use the visualizations you have created to justify your answer if necessary.
- **Model Limitations**: Discuss any limitations you encountered with the models you used. Are there any aspects of the data or the problem that these models do not handle well?
- **Future Improvements**: Suggest potential improvements or further steps you could take to enhance model performance. This could include trying different algorithms, feature engineering techniques, or tuning hyperparameters.

### Answer Here:

In [None]:
#making pipelines
num_cols = ['Number_of_vehicles_involved','Number_of_casualties']
cat_cols = df.drop(num_cols, axis=1).columns
cat_ordinal = ['Age_band_of_driver', 'Educational_level', 'Driving_experience', 'Service_year_of_vehicle', 'Age_band_of_casualty', 'Casualty_severity', 'Accident_severity']
cat_nominal = cat_cols.drop(cat_ordinal)

num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())      #Normalizing/Scaling
])
ordinal_transformer = Pipeline(steps=[
    ('Label', OrdinalEncoder())       #Since LabelEncoder doesn't work with pipeline we use ordinal encoder.
])
nominal_transformer = Pipeline(steps=[
    ('OneHot', OneHotEncoder(handle_unknown='ignore'))  #OneHotEncoder
])

col_transformer = ColumnTransformer(transformers=[
    ('num', num_transformer, num_cols),
    ('ordinal', ordinal_transformer, cat_ordinal),
    ('nominal', nominal_transformer, cat_nominal)
])

col_transformer.fit(df)
