#  IBM Data Science Capstone: Car Accident Severity Report

## Introduction | Business Undertanding

As an effort to reduce the frequency of vehicular accidents in a city, an algorithim must be developed to predict the severity and the chances of a road accident given the current weather conditions, road and visibility. When conditions are bad, this model will alert drivers to remind them to be more careful, or to take a different route to their destination.

In most of the cases, not paying enough attention during driving, drug abuse or overspeeding are the main causes of serious accidents which would have been otherwise prevented by enacting harsher regulations. Several uncontrollable factors like the weather, visibility, or road conditions, etc are also contributing factors to a number of road accidents. These can be prevented by revealing hidden patterns in the data and warning to the local government, and notify the police and drivers traveling on those roads about the same. If these patterns are discovered early on, local government can know when to send alerts to the public and the respective authorities to drive more carefully or even avoid those roads entirely.

The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance institutes. The model and its results are going to provide some insights for the target audience to make important decisions for reducing the number of road accidents occuring in the city.

## Data Understanding

The data used in this project is taken from collision and accident reports in Seattle from the years 2004 to present. This data was collected by the Seattle Police Department(SPD) and the Traffic Records department of Seattle. The data has 37 independent variables and 194,673 records.

We will be using this data here to identify the key variables that may cause car accidents. For example, the “WEATHER” column may be used to show the types and number of accidents that occur for different weather conditions at the time of the accient. Furthermore, the “INTKEY” column can be grouped and the sum of the car accidents that happned at that paticular intersection can be known for all the intersections. This list can be sorted in descending order of the total sums to identify the most dangerous intersections, this data can be used by the respective authorities to provide better facilities and monitoring at that intersection. Finally, a supervised learning model will be used to come up with a formula that can predict the severity of an accident based on the inputs.

Our predictor or target variable will be 'SEVERITYCODE' because it is used measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'. “SEVERITYCODE” contains numbers that correspond to different levels of severity caused by an accident from 0 to 4 which are as follows -

0. Little to no Probability (Clear Conditions)
1. Very Low Probability — Chance or Property Damage
2. Low Probability — Chance of Injury
3. Mild Probability — Chance of Serious Injury
4. High Probability — Chance of Fatality

## Extracting Data and Pre-Processing

In it's original form, this data is not fit for analysis. For one, there are many columns that we will not use for this model. Also, most of the features are of type object, when they should be numerical type.

We must use label encoding to covert the features to our desired data type.

In [1]:
import pandas as pd
import numpy as np

file_name='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df=pd.read_csv(file_name)

# Drop all columns with no predictive value for the context of this project
colData = df.drop(columns = ['OBJECTID', 'SEVERITYCODE.1', 'REPORTNO', 'INCKEY', 'COLDETKEY', 
              'X', 'Y', 'STATUS','ADDRTYPE',
              'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
              'EXCEPTRSNDESC', 'SEVERITYDESC', 'INCDATE',
              'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
              'SDOT_COLDESC', 'PEDROWNOTGRNT', 'SDOTCOLNUM',
              'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY',
              'CROSSWALKKEY', 'HITPARKEDCAR', 'PEDCOUNT', 'PEDCYLCOUNT',
              'PERSONCOUNT', 'VEHCOUNT', 'COLLISIONTYPE',
              'SPEEDING', 'UNDERINFL', 'INATTENTIONIND'])

# Label Encoding
# Convert column to category
colData["WEATHER"] = colData["WEATHER"].astype('category')
colData["ROADCOND"] = colData["ROADCOND"].astype('category')
colData["LIGHTCOND"] = colData["LIGHTCOND"].astype('category')

# Assign variable to new column for analysis
colData["WEATHER_CAT"] = colData["WEATHER"].cat.codes
colData["ROADCOND_CAT"] = colData["ROADCOND"].cat.codes
colData["LIGHTCOND_CAT"] = colData["LIGHTCOND"].cat.codes

colData.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Overcast,Wet,Daylight,4,8,5
1,1,Raining,Wet,Dark - Street Lights On,6,8,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,8,5


With the new columns, we can now use this data in our analysis and ML models!

Now let's check the data types of the new columns in our dataframe. Moving forward, we will only use the new columns for our analysis.

In [3]:
colData.dtypes

SEVERITYCODE        int64
WEATHER          category
ROADCOND         category
LIGHTCOND        category
WEATHER_CAT          int8
ROADCOND_CAT         int8
LIGHTCOND_CAT        int8
dtype: object

In [4]:
colData.describe()

Unnamed: 0,SEVERITYCODE,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
count,194673.0,194673.0,194673.0,194673.0
mean,1.298901,2.977254,2.507122,4.25642
std,0.457778,2.892011,3.64866,1.900722
min,1.0,-1.0,-1.0,-1.0
25%,1.0,1.0,0.0,2.0
50%,1.0,1.0,0.0,5.0
75%,2.0,6.0,7.0,5.0
max,2.0,10.0,8.0,8.0


### Balancing Dataset

### Analyzing Value Counts

In [5]:
colData["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

Our target variable SEVERITYCODE is only 42% balanced. In fact, severitycode in class 1 is nearly three times the size of class 2.

We can fix this by downsampling the majority class.

In [7]:
from sklearn.utils import resample

# Seperate majority and minority classes
colData_majority = colData[colData.SEVERITYCODE==1]
colData_minority = colData[colData.SEVERITYCODE==2]

#Downsample majority class
colData_majority_downsampled = resample(colData_majority,
                                        replace=False,
                                        n_samples=58188,
                                        random_state=123)

# Combine minority class with downsampled majority class
colData_balanced = pd.concat([colData_majority_downsampled, colData_minority])

# Display new class counts
colData_balanced.SEVERITYCODE.value_counts()



2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

Now its Balanced!

We are using only 3 independent variables for our purpose, these are as follows --
- Weather type (Clear, raining, etc.)
- Road Condition (Dry, wet, etc.)
- Light Condition (Daylight, dark, etc.)

Lets find out the total number of accidents using these individial parameters only!

In [8]:
colData["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

This proves that most accients took place in clear/dry weather.

In [9]:
colData["ROADCOND"].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

Here, we can see that the most number of accidents took place in dry weather! Followed by wet weather conditions. This is surprising but true.

In [10]:
colData["LIGHTCOND"].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

From this data, it is clear that most car accidents took place in Daylight.

## Methodology

Our data is now ready to be fed into machine learning models. Statistical testing was not performed because the data revolved around categorical variables, not numerical ones.

Key variables such as pedestrian right of way, inattentive drivers, and whether the car was speeding had a majority of null values. Therefore, they were dropped and not part of the analysis. However, it is likely that these variables play a key factor in vehicle accidents.

We will use the following models:

 - K-Nearest Neighbor (KNN)
KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.

 - Decision Tree
A decision tree model gives us a layout of all possible outcomes so we can fully analyze the concequences of a decision. It context, the decision tree observes all possible outcomes of different weather conditions.

 - Logistic Regression
Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.

## Initialization

In [11]:
X = np.asarray(colData_balanced[['WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']])
X[0:5]

array([[ 6,  8,  2],
       [ 1,  0,  5],
       [10,  7,  8],
       [ 1,  0,  5],
       [ 1,  0,  5]], dtype=int8)

In [12]:
y = np.asarray(colData_balanced['SEVERITYCODE'])
y [0:5]

array([1, 1, 1, 1, 1])

#### Normalize the dataset

In [13]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]



array([[ 1.15236718,  1.52797946, -1.21648407],
       [-0.67488   , -0.67084969,  0.42978835],
       [ 2.61416492,  1.25312582,  2.07606076],
       [-0.67488   , -0.67084969,  0.42978835],
       [-0.67488   , -0.67084969,  0.42978835]])

#### Train/Test Split
We will use 30% of our data for testing and 70% for training

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (81463, 3) (81463,)
Test set: (34913, 3) (34913,)


### K-Nearest Neighbors (KNN)

In [17]:
# Building the KNN Model
from sklearn.neighbors import KNeighborsClassifier

k = 25

In [18]:
#Train Model & Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

Kyhat = neigh.predict(X_test)
Kyhat[0:5]

array([2, 2, 1, 1, 2])

### Decision Tree

In [19]:
# Building the Decision Tree
from sklearn.tree import DecisionTreeClassifier
colDataTree = DecisionTreeClassifier(criterion="entropy", max_depth = 7)
colDataTree
colDataTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=7,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [20]:
# Train Model & Predict
DTyhat = colDataTree.predict(X_test)
print (y_test [0:5])

[2 2 1 1 1]


### Logistic Regression

In [21]:
# Building the LR Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=6, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [22]:
# Train Model & Predicr
LRyhat = LR.predict(X_test)
LRyhat

array([1, 2, 1, ..., 2, 2, 2])

In [23]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.57295252, 0.42704748],
       [0.47065071, 0.52934929],
       [0.67630201, 0.32369799],
       ...,
       [0.46929132, 0.53070868],
       [0.47065071, 0.52934929],
       [0.46929132, 0.53070868]])

##  Results & Evaluation

In [33]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

#### K-Nearest Neighbor

In [34]:
# Jaccard Similarity Score
jaccard_similarity_score(y_test, Kyhat)

0.564001947698565

In [35]:
# F1-SCORE
f1_score(y_test, Kyhat, average='macro')

0.5401775308974308

#### Decision Tree

In [36]:
# Jaccard Similarity Score
jaccard_similarity_score(y_test, DTyhat)

0.5664365709048206

In [37]:
# F1-SCORE
f1_score(y_test, DTyhat, average='macro')

0.5450597937389444

#### Logistic Regression

In [38]:
# Jaccard Similarity Score
jaccard_similarity_score(y_test, LRyhat)

0.5260218256809784

In [39]:
# F1-SCORE
f1_score(y_test, LRyhat, average='macro')

0.511602093963383

In [40]:
# LOGLOSS
yhat_prob = LR.predict_proba(X_test)
log_loss(y_test, yhat_prob)

0.6849535383198887


### K-Nearest Neighbor

- Jaccard Similarity Score = 0.563973305072609
- F1-SCORE                 = 0.540128347154051

Therefore, Model is most accurate when k is 25

### Decision Tree

- Jaccard Similarity Score = 0.5664365709048206
- F1-SCORE                 = 0.5450597937389444

Therefore, Model is most accurate with a max depth of 7.

### Logistic Regression

- Jaccard Similarity Score = 0.5260218256809784
- F1-SCORE                 = 0.511602093963383
- LOGLOSS                  = 0.6849535383198887

Model is most accurate when hyperparameter C is 6.

# Discussion



In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algorithm, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made the most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyperamater C values helped to improve our accuracy to be the best possible.


# Conclusion 

Based on the dataset and the model provided for this capstone from weather, road, and light conditions pointing to certain classes, we can conclude that particular conditions have some kind of impact on - if travelling or not travelling in that particular weather condition could result in property damage of either Class - 1 or Class - 2. I.E., the current weather barely succeeds to describe the probability or the severity of the accident. But if you just follow the numbers, most cases of car accidents seem to be in Dry weather conditions(124510 cases), followed by Wet weather conditions(47474 cases), but this could be because people prefer to travel when there is  dry weather outside and maybe because these two weather conditions are prevalent throughout a major time period around the year. 