# Applied Data Science Capstone Project - Car accident severity (Week 2)

## The Business Problem 

According to the WHO, every year the lives of approximately 1.35 million people are cut short as a result of a road traffic crash. Between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury. For children and young adults aged 5-29 years this is the leading cause of death.
Due to the severity of this situation, The2030 Agenda for Sustainable Development has set an ambitious target of halving the global number of deaths and injuries from road traffic crashes by 2020.
By developing an algorithim to predict the severity of an accident given the current weather, road and visibility condition, it will be possible to alert drivers about bad conditions, enabling them and be more careful. Therefore, the frequency of car accidents can be decreased.

## Data

The data collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present consists of 37 independent variables and 194,673 rows. The variable, “SEVERITYCODE”, classifird the level of severity caused by an accident as:<br> 

0: Little to no Probability (Clear Conditions) <br>
1: Very Low Probability — Chance or Property Damage<br>
2: Low Probability — Chance of Injury<br>
3: Mild Probability — Chance of Serious Injury<br>
4: High Probability — Chance of Fatality

In order to prepare the data, a few steps have to be taken. They are: <br>
<ul>
<li>Remove unnecessary columns</li>
<li>Balance the dataset (since the class 1 of variable "SEVERITYCODE" is almost three times the size of the class 2) </li>
</ul>

In [41]:
#Importing the dataset and the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import resample
data = pd.read_csv("Data-Collisions.csv")

In [42]:
#Checking the unnecessary columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
SEVERITYCODE      194673 non-null int64
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null obje

In [43]:
#removing unnecessary columns
data.drop(columns=["HITPARKEDCAR","CROSSWALKKEY","SEGLANEKEY","ST_COLDESC","ST_COLCODE","SPEEDING","SDOTCOLNUM","PEDROWNOTGRNT","UNDERINFL","INATTENTIONIND","SDOT_COLCODE","SDOT_COLDESC","JUNCTIONTYPE","INCDTTM","INCDATE","VEHCOUNT","X","Y","OBJECTID","INCKEY","COLDETKEY","REPORTNO","STATUS","ADDRTYPE","INTKEY","LOCATION","EXCEPTRSNCODE","EXCEPTRSNDESC","SEVERITYCODE.1","SEVERITYDESC","COLLISIONTYPE","PERSONCOUNT","PEDCOUNT","PEDCYLCOUNT"], inplace=True)

In [44]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 4 columns):
SEVERITYCODE    194673 non-null int64
WEATHER         189592 non-null object
ROADCOND        189661 non-null object
LIGHTCOND       189503 non-null object
dtypes: int64(1), object(3)
memory usage: 5.9+ MB


In [45]:
#Checking the unbalanced dataset
data["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [49]:
#Balacing dataset
data_max = data[data.SEVERITYCODE==1]
data_min = data[data.SEVERITYCODE==2]

data_maxsample = resample(data_max, replace=False, n_samples=58188, random_state=123)

balanced_data = pd.concat([data_maxsample, data_min])
balanced_data.SEVERITYCODE.value_counts()


2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

In [50]:
#The prepared dataset
data.head(5)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight
