# Applied Data Science Capstone Project


## Introduction

   Every year, roughly 1.3 million people die in car accidents worldwide - an average of 3,287 deaths per day. Road traffic crashes cause up to 50 million injuries globally each year. On the other hand, the use of smart cars is increasing each year. 
   Wouldn't it be great if your car would warn you, given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, so that you would drive more carefully or even change your travel if you are able to?
   This is what this project is about.

## Business Problem

Our problem is to develop a machine learning model in an effort to reduce the amount of severe car accidents. The model should predict the severity of car accidents, given the current weather, road and visibility conditions. If the
severity predicted is high it should warn the driver to drive more carefully or even change the travel.

## The Data

 The dataset we will be using is the shared data for Seattle City. 

Let's start by loading our dataset.

In [111]:
import pandas as pd
import numpy as np
import os

In [112]:
#Load the dataset as pandas dataframe
cwd = os.getcwd()
path = cwd + "//Data-Collisions.csv"
df = pd.read_csv(path, low_memory=False)
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

Let's keep only the columns we are going to use

In [113]:
df.drop(['X','Y','OBJECTID','INCKEY', 'COLDETKEY', 'REPORTNO', 'STATUS', 'INTKEY', 'EXCEPTRSNCODE',
         'LOCATION', 'EXCEPTRSNDESC', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT',
         'INCDATE', 'INCDTTM', 'SDOT_COLCODE', 'SDOT_COLDESC', 'SEVERITYCODE.1', 'JUNCTIONTYPE',
         'INATTENTIONIND', 'UNDERINFL', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE',
         'SEGLANEKEY','CROSSWALKKEY', 'HITPARKEDCAR', 'ST_COLDESC', 'SEVERITYDESC', 'COLLISIONTYPE',
         'ADDRTYPE'], axis = 1, inplace = True)
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


 As you can see, we keep the severity code, weather, road and light conditions. The Severity Code will be used as the depentent variable y and the rest will be used as the indepentent variables X.

Next we are going to convert our categorical values to numerical ones. As shown also bellow Weather, Roadcond and Lightcond are object types. However first we are going to check if there are any NULL values and drop the rows with this values.

In [114]:
df.dtypes


SEVERITYCODE     int64
WEATHER         object
ROADCOND        object
LIGHTCOND       object
dtype: object

In [115]:
df.replace('Unknown',np.nan, regex=True, inplace=True)
nanvalues = df.isnull().sum().sum()
print("The total NaN Values are: " + str(nanvalues))
df.dropna(inplace = True)
df.reset_index(drop=True, inplace=True)
nanvalues = df.isnull().sum().sum()
print("After droping the rows with NaN Values: ", str(nanvalues))

The total NaN Values are: 58916
After droping the rows with NaN Values:  0


Now we can check all the unique values in each row and than convert them to numeric types

In [116]:
weather = df.WEATHER.unique()
roadcond = df.ROADCOND.unique()
lightcond = df.LIGHTCOND.unique()
print(weather)
print(roadcond)
print(lightcond)

['Overcast' 'Raining' 'Clear' 'Snowing' 'Other' 'Fog/Smog/Smoke'
 'Sleet/Hail/Freezing Rain' 'Blowing Sand/Dirt' 'Severe Crosswind'
 'Partly Cloudy']
['Wet' 'Dry' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt' 'Standing Water'
 'Oil']
['Daylight' 'Dark - Street Lights On' 'Dark - No Street Lights' 'Dusk'
 'Dawn' 'Dark - Street Lights Off' 'Other']


In [117]:
df['WEATHER'].replace(to_replace=['Overcast', 'Raining', 'Clear', 'Snowing', 'Other', 'Fog/Smog/Smoke',
                                  'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt', 'Severe Crosswind',
                                  'Partly Cloudy'], 
                      value=[3, 4, 1, 6, 10, 5, 7, 8, 9, 2], inplace=True)
df['ROADCOND'].replace(to_replace=['Wet', 'Dry', 'Snow/Slush', 'Ice', 'Other', 'Sand/Mud/Dirt', 
                                    'Standing Water', 'Oil'], 
                      value=[2, 1, 4, 5, 8, 3, 6, 7], inplace=True)
df['LIGHTCOND'].replace(to_replace=['Daylight', 'Dark - Street Lights On', 'Dark - No Street Lights', 'Dusk',
                                    'Dawn', 'Dark - Street Lights Off', 'Other'], 
                      value=[1, 2, 5, 3, 4, 6, 7], inplace=True)      
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,3,2,1
1,1,4,2,2
2,1,3,1,1
3,1,1,1,1
4,2,4,2,1


In [118]:
df.dtypes

SEVERITYCODE    int64
WEATHER         int64
ROADCOND        int64
LIGHTCOND       int64
dtype: object

Now we are going to check if our data is balanced

In [121]:
df['SEVERITYCODE'].value_counts()

1    114654
2     55847
Name: SEVERITYCODE, dtype: int64

As we can see from the result above, our target variable is not balanced. Therefore we have to balance it.

In [125]:
df = (df.groupby('SEVERITYCODE', as_index=False)
        .apply(lambda x: x.sample(n=55847))
        .reset_index(drop=True))
df['SEVERITYCODE'].value_counts()

2    55847
1    55847
Name: SEVERITYCODE, dtype: int64

In [129]:
df

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,1,3,2,1
1,1,1,1,1
2,1,1,1,1
3,1,4,2,1
4,1,1,1,1
...,...,...,...,...
111689,2,1,1,1
111690,2,1,1,1
111691,2,1,1,1
111692,2,1,1,1


Now our dataset is ready to use. If needed we wi