# Weekly Project 2!

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Introduction to Road Traffic Accidents (RTA) Dataset

### Dataset Overview
The RTA Dataset provides a detailed snapshot of road traffic accidents, capturing a range of data from accident conditions to casualty details. This dataset is essential for analyzing patterns and causes of accidents to improve road safety.

### Data Characteristics
- **Entries**: The dataset contains 12,316 entries.
- **Features**: There are 32 features in the dataset, which include:
  - `Time`: Time when the accident occurred.
  - `Day_of_week`: Day of the week.
  - `Age_band_of_driver`: Age group of the driver involved.
  - `Sex_of_driver`: Gender of the driver.
  - `Educational_level`: Educational level of the driver.
  - `Type_of_vehicle`: Type of vehicle involved in the accident.
  - `Cause_of_accident`: Reported cause of the accident.
  - `Accident_severity`: Severity of the accident.
- **Target Column**: `Accident_severity` is used as the target column for modeling. This feature classifies the severity of each accident.

### Objective
Students will use this dataset to apply various data visualization, modeling, and evaluation techniques learned in class. The primary goal is to build models that can accurately predict the severity of accidents and to identify the key factors that contribute to severe accidents.

## Import Libraries
Import all the necessary libraries here. Include libraries for handling data (like pandas), visualization (like matplotlib and seaborn), and modeling (like scikit-learn).

In [4]:
import pandas as pd

## Load Data
Load the dataset from the provided CSV file into a DataFrame.

In [6]:
df = pd.read_csv('/content/drive/MyDrive/RTA_Dataset.csv')

## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!



- Check for null values.

In [7]:
# Find the dataset information and observe the datatypes
df.dtypes

Unnamed: 0,0
Time,object
Day_of_week,object
Age_band_of_driver,object
Sex_of_driver,object
Educational_level,object
Vehicle_driver_relation,object
Driving_experience,object
Type_of_vehicle,object
Owner_of_vehicle,object
Service_year_of_vehicle,object


In [8]:
# Check the shape of the data to understand its structure
df.shape

(12316, 32)

In [9]:
# View the the data with various functions to get an initial sense of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

In [22]:
# View the the data with various functions to get an initial sense of the data
df.count()

Unnamed: 0,0
Time,12316
Day_of_week,12316
Age_band_of_driver,12316
Sex_of_driver,12316
Educational_level,11575
Vehicle_driver_relation,11737
Driving_experience,11487
Type_of_vehicle,11366
Owner_of_vehicle,11834
Service_year_of_vehicle,8388


In [26]:
# Perform summary statistics on the dataset to grasp central tendencies and variability
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Number_of_vehicles_involved,12316.0,2.040679,0.68879,1.0,2.0,2.0,2.0,7.0
Number_of_casualties,12316.0,1.548149,1.007179,1.0,1.0,1.0,2.0,8.0


In [40]:
# Check for duplicated data
df.duplicated().sum()

0

In [44]:
# Check for null values
df.isnull().sum()

Unnamed: 0,0
Time,0
Day_of_week,0
Age_band_of_driver,0
Sex_of_driver,0
Educational_level,741
Vehicle_driver_relation,579
Driving_experience,829
Type_of_vehicle,950
Owner_of_vehicle,482
Service_year_of_vehicle,3928


In [55]:
df.sample(15)

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,...,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
8140,12:49:00,Sunday,Over 51,Male,High school,Employee,1-2yr,,Owner,,...,Going straight,Passenger,Male,31-50,3,Driver,Normal,Not a Pedestrian,No priority to vehicle,Slight Injury
9694,15:50:00,Monday,18-30,Male,Junior high school,Employee,Above 10yr,Public (13?45 seats),Owner,,...,Going straight,Driver or rider,Female,31-50,3,Self-employed,Normal,Not a Pedestrian,No priority to vehicle,Slight Injury
4724,22:35:00,Monday,18-30,Male,Junior high school,Employee,5-10yr,Public (13?45 seats),Owner,Above 10yr,...,Entering a junction,Pedestrian,Male,31-50,3,Driver,Normal,Crossing from nearside - masked by parked or s...,No distancing,Slight Injury
11529,15:31:00,Monday,Over 51,Male,Junior high school,Employee,2-5yr,Public (12 seats),Owner,Unknown,...,Going straight,Driver or rider,Female,31-50,3,Driver,Normal,Not a Pedestrian,Moving Backward,Slight Injury
10445,8:55:00,Thursday,Over 51,Male,,,,Long lorry,Owner,,...,Other,na,na,na,na,,,Not a Pedestrian,Changing lane to the left,Slight Injury
3438,7:45:00,Wednesday,Unknown,Male,High school,Owner,5-10yr,Pick up upto 10Q,Owner,Unknown,...,Going straight,na,na,na,na,Driver,,Not a Pedestrian,No priority to pedestrian,Slight Injury
5102,19:00:00,Monday,18-30,Male,Junior high school,Employee,2-5yr,Taxi,Owner,5-10yrs,...,Going straight,Passenger,Female,18-30,3,Self-employed,Normal,Not a Pedestrian,No distancing,Slight Injury
12200,12:00:00,Saturday,Over 51,Male,Elementary school,Owner,2-5yr,Lorry (41?100Q),Owner,,...,Going straight,Driver or rider,Female,18-30,3,Driver,Normal,Not a Pedestrian,Changing lane to the right,Serious Injury
8282,13:05:00,Sunday,Over 51,Male,Junior high school,Employee,1-2yr,Lorry (41?100Q),Owner,2-5yrs,...,Going straight,Driver or rider,Male,Under 18,3,,,Not a Pedestrian,No priority to pedestrian,Slight Injury
3281,10:37:00,Thursday,31-50,Male,Elementary school,Employee,5-10yr,Automobile,Organization,2-5yrs,...,Moving Backward,Driver or rider,Male,18-30,3,Self-employed,Normal,Not a Pedestrian,Driving carelessly,Slight Injury


In [122]:
# There are null values, So i will use fill function:

# Using Mode
df['Educational_level'].fillna(df['Educational_level'].mode()[0], inplace = True)
df['Vehicle_driver_relation'].fillna(df['Vehicle_driver_relation'].mode()[0], inplace = True)
df['Type_of_vehicle'].fillna(df['Type_of_vehicle'].mode()[0], inplace = True)
df['Owner_of_vehicle'].fillna(df['Owner_of_vehicle'].mode()[0], inplace = True)
df['Defect_of_vehicle'].fillna(df['Defect_of_vehicle'].mode()[0], inplace = True)
df['Area_accident_occured'].fillna(df['Area_accident_occured'].mode()[0], inplace = True)
df['Lanes_or_Medians'].fillna(df['Lanes_or_Medians'].mode()[0], inplace = True)
df['Road_allignment'].fillna(df['Road_allignment'].mode()[0], inplace = True)
df['Types_of_Junction'].fillna(df['Types_of_Junction'].mode()[0], inplace = True)
df['Road_surface_type'].fillna(df['Road_surface_type'].mode()[0], inplace = True)
df['Type_of_collision'].fillna(df['Type_of_collision'].mode()[0], inplace = True)
df['Vehicle_movement'].fillna(df['Vehicle_movement'].mode()[0], inplace = True)
df['Work_of_casuality'].fillna(df['Work_of_casuality'].mode()[0], inplace = True)
df['Fitness_of_casuality'].fillna(df['Fitness_of_casuality'].mode()[0], inplace = True)

# I think it be better to convert it to digits "flout" and then fill it with the Mean not Mode
df['Driving_experience'].fillna(df['Driving_experience'].mode()[0], inplace = True)

# I think it be better to convert "Unknown"
df['Service_year_of_vehicle'].fillna(df['Service_year_of_vehicle'].mode()[0], inplace = True)

In [135]:
# After fill function
df.isnull().sum()

Unnamed: 0,0
Time,0
Day_of_week,0
Age_band_of_driver,0
Sex_of_driver,0
Educational_level,0
Vehicle_driver_relation,0
Driving_experience,0
Type_of_vehicle,0
Owner_of_vehicle,0
Service_year_of_vehicle,0


## Data Preprocessing
Data preprocessing is essential for transforming raw data into a format suitable for further analysis and modeling. Follow these steps to ensure your data is ready for predictive modeling or advanced analytics:
- **Handling Missing Values**: Replace missing values with appropriate statistics (mean, median, mode) or use more complex imputation techniques.
- **Normalization/Scaling**: Scale data to a small, specified range like 0 to 1, or transform it to have a mean of zero and a standard deviation of one.
- **Label Encoding**: Convert categorical text data into model-understandable numbers where the labels are ordered.
- **One-Hot Encoding**: Use for nominal categorical data where no ordinal relationship exists to transform the data into a binary column for each category. (Be careful not to increase the dimensionality significantly)
- **Detection and Treatment of Outliers**: Use statistical tests, box plots, or scatter plots to identify outliers and then cap, trim, or use robust methods to reduce the effect of outliers, depending on the context.
- **Feature Engineering**: Enhance your dataset by creating new features and transforming existing ones. This might involve combining data from different columns, applying transformations, or reducing dimensionality with techniques like PCA to improve model performance.

Consider these steps as a foundation, and feel free to introduce additional preprocessing techniques as needed to address specific characteristics of your dataset.

In [None]:
# Handling Missing Values: Replace missing values with appropriate statistics (mean, median, mode) or use more complex imputation techniques.

# I solved it above.

In [None]:
# Normalization/Scaling:




In [None]:
# Label Encoding:




In [None]:
# One-Hot Encoding:



In [None]:
# Detection and Treatment of Outliers: Use statistical tests, box plots, or scatter plots to identify outliers and then cap, trim, or use robust methods to reduce the effect of outliers, depending on the context.




In [None]:
# Feature Engineering: Enhance your dataset by creating new features and transforming existing ones. This might involve combining data from different columns, applying transformations, or reducing dimensionality with techniques like PCA to improve model performance.




## Data Visualization
Create various plots to visualize the relationships in the data. Consider using the following to show different aspects of the data:

* Heatmap of Correlation Matrix.
* Line plots.
* Scatter plots.
* Histograms.
* Boxplots.

Use more if needed!

In [144]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure=((8,4))
sns.scatterplot(data= df, x= df['Number_of_vehicles_involved'], y= df['Number_of_casualties'], hue= df['Accident_severity'])
plt.title('Accident Severity')
plt.xlabel('Number of vehicles involved')
plt.ylabel('Number of casualties')
plt.legend()

plt.show()

## Feature Selection
- Choose features that you believe will most influence the outcome based on your analysis and the insights from your visualizations. Focus on those that appear most impactful to include in your modeling.

## Train-Test Split
* Divide the dataset into training and testing sets to evaluate the performance of your models.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
# Features: 32 features
# Target Column: Accident_severity

X = df.drop(['Accident_severity'], axis= 1)
Y = df['Accident_severity']

x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size= 0.2, Random_state= 42)

## Modeling

Once the data is split into training and testing sets, the next step is to build models to make predictions. Here, we will explore several machine learning algorithms, each with its unique characteristics and suitability for different types of data and problems. You will implement the following models:

### 1. Logistic Regression

### 2. Decision Tree Classifier

### 3. Support Vector Machine (SVM)

### 4. K-Neighbors Classifier

### Implementing the Models
- For each model, use the training data you have prepared to train the model.

#### Logistic Regression

In [None]:
model = LogisticRegression()

model.fit(x_train, y_train)
prd = model.predict(x_test)

#### Decision Tree Classifier

In [None]:
model = (DecisionTree)

model.fit(x_train, y_train)
prd = model.predict(x_test)

#### Support Vector Machine (SVM)

In [None]:
model = (SupportVectorMachine)

model.fit(x_train, y_train)
prd = model.predict(x_test)

#### K-Neighbors Classifier

In [None]:
model = (KNN)

model.fit(x_train, y_train)
prd = model.predict(x_test)

## Model Evaluation

After training your models, it's crucial to evaluate their performance to understand their effectiveness and limitations. This section outlines various techniques and metrics to assess the performance of each model you have implemented.

### Evaluation Techniques
1. **Confusion Matrix**

2. **Accuracy**

3. **Precision and Recall**

4. **F1 Score**

5. **ROC Curve and AUC**

### Implementing Evaluation
- Calculate the metrics listed above using your test data.

## Project Questions:

### Comparative Analysis

- **Compare Metrics**: Examine the performance metrics (such as accuracy, precision, and recall) of each model. Document your observations on which model performs best for your dataset and the problem you're addressing.
- **Evaluate Trade-offs**: Discuss the trade-offs you encountered when choosing between models. Consider factors like computational efficiency, ease of implementation, and model interpretability.
- **Justify Your Choice**: After comparing and evaluating, explain why you believe one model is the best choice. Provide a clear rationale based on the performance metrics and trade-offs discussed.
- **Feature Importance**: Identify and discuss the most important features for the best-performing model. How do these features impact the predictions? Use the visualizations you have created to justify your answer if necessary.
- **Model Limitations**: Discuss any limitations you encountered with the models you used. Are there any aspects of the data or the problem that these models do not handle well?
- **Future Improvements**: Suggest potential improvements or further steps you could take to enhance model performance. This could include trying different algorithms, feature engineering techniques, or tuning hyperparameters.

### Answer Here: