   #     Predicting Traffic Accident Severity

## 1. Introduction/Business Problem

Road traffic injuries are currently estimated to be the eighth leading cause of death across all age groups globally, and are predicted to become the seventh leading cause of death by 2030.

Analysing a significant range of factors, including weather conditions, special events, roadworks, traffic jams among others, an accurate prediction of the severity of the accidents can be performed.

These insights, could allow law enforcement bodies to allocate their resources more effectively in advance of potential accidents, preventing when and where a severe accidents can occur as well as saving both, time and money. In addition, this knowledge of a severe accident situation can be warned to drivers so that they would drive more carefully or even change their route if it is possible or to hospital which could have set everything ready for a severe intervention in advance.

Governments should be highly interested in accurate predictions of the severity of an accident, in order to reduce the time of arrival and thus save a significant amount of people each year. Others interested could be private companies investing in technologies aiming to improve road safeness.

## 2. Data

For an accurate prediction of the severity of damage caused by accidents, we require a large number of reports on traffic accidents with accurate data to train prediction models. The data set provided for this work allows the analysis of a record of 200,000 accidents in the state of Seattle, from 2004 to the date it is issued, in which 37 attributes or variables are recorded and the codification of the type of accident is allowed, grouped according to 84 codes.

The data will be used so that we can determine which attributes are most common in traffic accidents in order to target prevention at these high-incidence points.

Data Source: These data have been collected and shared by the Seattle Police Department (Traffic Records)

#### Extract Dataset and Convert

In [13]:
import numpy as np
import pandas as pd
from sklearn.utils import resample

In [14]:
df = pd.read_csv("Data-Collisions.csv")
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [15]:
df.shape

(194673, 38)

#### Balancing the dataset

Our target variable SEVERITYCODE is only 42% balanced. In fact, severitycode in class 1 is nearly three times the size of class 2.

In [12]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

We can fix this by downsampling the majority class.

Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

The most common heuristic for doing so is resampling without replacement.

1. First, we'll separate observations from each class into different DataFrames.
2. Next, we'll resample the majority class without replacement, setting the number of samples to        match that of the minority class.
3. Finally, we'll combine the down-sampled majority class DataFrame with the original minority class    DataFrame.

In [16]:
# seperate majority and minority classes
df_majority = df[df.SEVERITYCODE==1]
df_minority = df[df.SEVERITYCODE==2]

# Downsample the majority class
df_majority_downsampled = resample(df_majority,
                                  replace=False,
                                  n_samples=58188,
                                  random_state=123)

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

In [17]:
df_downsampled.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

In [19]:
df_downsampled.shape

(116376, 38)

This time, the new DataFrame has fewer observations than the original, and the ratio of the two classes is now 1:1 and the dataset is perfectly balanced.

#### Converting categorical variables into numerical type

Our predictor or target variable will be 'SEVERITYCODE' because it is used to measure the severity of an accident within the dataset. Attributes used to weigh the severity of an accident are 'LOCATION','WEATHER','ROADCOND','JUNCTIONTYPE','SPEEDING','LIGHTCOND','VEHCOUNT' and 'PERSONCOUNT'.

In [21]:
features=df_downsampled[['LOCATION','WEATHER','ROADCOND','JUNCTIONTYPE','SPEEDING','LIGHTCOND','VEHCOUNT',
             'PERSONCOUNT']]

In it's original form, this data is not fit for analysis. Most of the features are of type object, when they should be numerical type.

In [22]:
features.dtypes

LOCATION        object
WEATHER         object
ROADCOND        object
JUNCTIONTYPE    object
SPEEDING        object
LIGHTCOND       object
VEHCOUNT         int64
PERSONCOUNT      int64
dtype: object

We must use label encoding to covert the features to our desired data type.