# Capstone Project - Predicting Traffic Accident Severity
            
   ###                 Applied Data science Capstone by Makram B.Amor

## Steps
1. [Introduction](#introduction)
2. [Data](#data)
3. [Data Cleaning](#clean)
4. [EDA](#eda)
5. [Data Preparation](#prep)
6. [Model Development](#modeling)
5. [Results](#results)

# 1. Introduction <a id='introduction'>
* Every year car accidents cause hundreds of thousands of deaths worldwide. 
    *According to a research conducted by the World Health Organization (WHO) there were 1.35 million road traffic deaths globally in 2016, 
    *millions of sustaining serious injuries and living with long-term adverse health consequences. 
* Road traffic crashes are a leading cause of death among young people, and the main cause of death among those aged 15-29 years. 
* Road traffic injuries are currently estimated to be the eighth leading cause of death across all age groups globally and are predicted to become the seventh leading cause of death by 2030 [1]. 
    
* Leveraging the tools and all the information nowadays available, an extensive analysis to predict traffic accidents and its severity would make a difference to the death toll. Analysing a significant range of factors, including weather conditions, locality, type of road and lighting among others, an accurate prediction of the severity of the accidents can be performed.
* Trends that commonly lead to severe traffic incidents can help in identifying the highly severe accidents. This kind of information could be used by emergency services, to send the exact required staff and equipment to the place of the accident, leaving more resources available for accidents occurring simultaneously. Moreover, this severe accident situation can be warned to nearby hospitals which can have all the equipment ready for a severe intervention in advance. 


    

### who would be intrested in this study ? 

* Governments
* local authorities
* private companies investing in technologies that improve overall driver safety
* insurance companies
* car manufactures/ sellers

# 2. Data <a id='data'>
    
The original data for this project comes from the Canadian gouvernment[https://open.canada.ca/](https://opendatatc.blob.core.windows.net/opendatatc/NCDB_1999_to_2017.csv). 
The features of the dataset resulting are the following:

In the *characteristics* dataset, I will keep the features: "lighting", "localisation"(agg), "type of intersection", "atmospheric conditions", "type of collisions", "department", "adress", "time" and the coordinates. I added two new features from this original dataset: "date" and "weekend" indicating if the accident occurred during the weekend or not.

In the *places* dataset, I will keep only the features: "road categorie", "traffic regime", "number of traffic lanes", "road profile", "road shape", "surface condition", "situation", "school nearby" and "infrastructure".

From the *users* dataset, I have created the following features: 
+ num_us: total number of users involved in the accident.
+ ped: Wether there are pedestrians involved or not.
+ critic_age: If there is any user in between 17 and 31 y.o.
+ sev : maximum gravity suffered by any user involved in the accident:
    + 0 = Unscathered or Light injury
    + 1 = Hospitalized wonded or Death

I used the *holiday* dataset to craft a new feature indicating the accident accurred during a holiday.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

In [2]:
!wget -O Data-Collisions.csv https://opendatatc.blob.core.windows.net/opendatatc/NCDB_1999_to_2017.csv

--2020-09-12 15:49:49--  https://opendatatc.blob.core.windows.net/opendatatc/NCDB_1999_to_2017.csv
Resolving opendatatc.blob.core.windows.net (opendatatc.blob.core.windows.net)... 40.86.232.64
Connecting to opendatatc.blob.core.windows.net (opendatatc.blob.core.windows.net)|40.86.232.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 535032631 (510M) [application/vnd.ms-excel]
Saving to: ‘Data-Collisions.csv’


2020-09-12 15:50:32 (12.1 MB/s) - ‘Data-Collisions.csv’ saved [535032631/535032631]



In [3]:
df = pd.read_csv("Data-Collisions.csv")

In [4]:
df.head()

Unnamed: 0,C_YEAR,C_MNTH,C_WDAY,C_HOUR,C_SEV,C_VEHS,C_CONF,C_RCFG,C_WTHR,C_RSUR,...,V_TYPE,V_YEAR,P_ID,P_SEX,P_AGE,P_PSN,P_ISEV,P_SAFE,P_USER,C_CASE
0,1999,1,1,20,2,2,34,UU,1,5,...,06,1990,1,M,41,11,1,UU,1,752
1,1999,1,1,20,2,2,34,UU,1,5,...,01,1987,1,M,19,11,1,UU,1,752
2,1999,1,1,20,2,2,34,UU,1,5,...,01,1987,2,F,20,13,2,02,2,752
3,1999,1,1,8,2,1,1,UU,5,3,...,01,1986,1,M,46,11,1,UU,1,753
4,1999,1,1,8,2,1,1,UU,5,3,...,NN,NNNN,1,M,5,99,2,UU,3,753


In [5]:
df.shape

(6772563, 23)

# 3. Data Cleaning <a id='clean'>
    
Before starting to run any ML algorithm on the data, the data have to go through the preprocessing part. In this part, data will be cleaned so there is no missing or unusual value. The goal is that the data is the best possible before applying the algorithms.

### Missing vaules

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6772563 entries, 0 to 6772562
Data columns (total 23 columns):
C_YEAR    int64
C_MNTH    object
C_WDAY    object
C_HOUR    object
C_SEV     int64
C_VEHS    object
C_CONF    object
C_RCFG    object
C_WTHR    object
C_RSUR    object
C_RALN    object
C_TRAF    object
V_ID      object
V_TYPE    object
V_YEAR    object
P_ID      object
P_SEX     object
P_AGE     object
P_PSN     object
P_ISEV    object
P_SAFE    object
P_USER    object
C_CASE    int64
dtypes: int64(3), object(20)
memory usage: 1.2+ GB


In [7]:
print('Missing values in C_SEV:', df["C_SEV"].isna().sum(),'\n'
    'Missing values in sex:', df["P_SEX"].isna().sum(), '\n'
    'Missing values in age:', df["P_AGE"].isna().sum(),'\n'
    'Missing values in V_YEAR:', df["V_YEAR"].isna().sum())

Missing values in C_SEV: 0 
Missing values in sex: 0 
Missing values in age: 0 
Missing values in V_YEAR: 0
