# CAR ACCIDENT SEVERITY REPORT
## IBM - APPLIED DATA SCIENCE CAPSTONE

## Table of Contents
* Business Understanding
* Data Understanding
* Data Preprocessing
* Methodology
* Results & Evaluation
* Discussion
* Conclusion

[Data_Collisions.csv]( https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv)

## BUSINESS UNDERSTANDING
To reduce the frequency of car collisions at a location, current weather, road, and visibility conditions should be taken into account and an algorithm should be developed to estimate the material and spiritual seriousness of the accident. In an application, when conditions are above the 0 code,drivers will be alerted of the severity level.

In [18]:
import pandas as pd
import numpy as np

# load the dataset
path='https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df = pd.read_csv(path)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [19]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

---

## DATA UNDERSTANDING
The data was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present. The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident from 0 to 4.

**Severity codes are as follows:**
* 0: Little to no Probability (Clear Conditions)
* 1: Very Low Probability - Chance or Property Damage
* 2: Low Probability - Chance of Injury 
* 3: Mild Probability - Chance of Serious Injury
* 4: High Probability - Chance of Fatality

In [20]:
# drop all columns with no predictive value
colData = df.drop(columns = ['OBJECTID', 'SEVERITYCODE.1', 'REPORTNO', 'INCKEY', 'COLDETKEY', 
              'X', 'Y', 'STATUS','ADDRTYPE',
              'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
              'EXCEPTRSNDESC', 'SEVERITYDESC', 'INCDATE',
              'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
              'SDOT_COLDESC', 'PEDROWNOTGRNT', 'SDOTCOLNUM',
              'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY',
              'CROSSWALKKEY', 'HITPARKEDCAR', 'PEDCOUNT', 'PEDCYLCOUNT',
              'PERSONCOUNT', 'VEHCOUNT', 'COLLISIONTYPE',
              'SPEEDING', 'UNDERINFL', 'INATTENTIONIND'])

# Label Encoding
# Convert column to category
colData["WEATHER"] = colData["WEATHER"].astype('category')
colData["ROADCOND"] = colData["ROADCOND"].astype('category')
colData["LIGHTCOND"] = colData["LIGHTCOND"].astype('category')

# Assign variable to new column for analysis
colData["WEATHER_CAT"] = colData["WEATHER"].cat.codes
colData["ROADCOND_CAT"] = colData["ROADCOND"].cat.codes
colData["LIGHTCOND_CAT"] = colData["LIGHTCOND"].cat.codes

colData.head(5)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Overcast,Wet,Daylight,4,8,5
1,1,Raining,Wet,Dark - Street Lights On,6,8,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,8,5


In [21]:
colData.dtypes

SEVERITYCODE        int64
WEATHER          category
ROADCOND         category
LIGHTCOND        category
WEATHER_CAT          int8
ROADCOND_CAT         int8
LIGHTCOND_CAT        int8
dtype: object

In [22]:
colData["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [23]:
colData["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [24]:
colData["ROADCOND"].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [25]:
colData["LIGHTCOND"].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

---

## DATA PREPROCESSING

In [26]:
from sklearn.utils import resample

In [27]:
# Seperate majority and minority classes
colData_majority = colData[colData.SEVERITYCODE==1]
colData_minority = colData[colData.SEVERITYCODE==2]

#Downsample majority class
colData_majority_downsampled = resample(colData_majority,
                                        replace=False,
                                        n_samples=58188,
                                        random_state=123)

# Combine minority class with downsampled majority class
colData_balanced = pd.concat([colData_majority_downsampled, colData_minority])

# Display new class counts
colData_balanced.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

---

## METHODOLOGY

In [28]:
import numpy as np
X = np.asarray(colData_balanced[['WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']])
X[0:5]

array([[ 6,  8,  2],
       [ 1,  0,  5],
       [10,  7,  8],
       [ 1,  0,  5],
       [ 1,  0,  5]], dtype=int8)

In [31]:
y = np.asarray(colData_balanced['SEVERITYCODE'])
y [0:5]

array([1, 1, 1, 1, 1])

In [32]:
# preprocessing
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]



array([[ 1.15236718,  1.52797946, -1.21648407],
       [-0.67488   , -0.67084969,  0.42978835],
       [ 2.61416492,  1.25312582,  2.07606076],
       [-0.67488   , -0.67084969,  0.42978835],
       [-0.67488   , -0.67084969,  0.42978835]])

In [33]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (81463, 3) (81463,)
Test set: (34913, 3) (34913,)


**Building the KNN Model**

KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.

In [37]:
from sklearn.neighbors import KNeighborsClassifier

k = 25

#Train Model & Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)

Kpred = neigh.predict(X_test)
Kpred[0:5]

array([2, 2, 1, 1, 2])

**Building the Decision Tree**

A decision tree model gives us a layout of all possible outcomes so we can fully analyze the consequences of a decision. It context, the decision tree observes all possible outcomes of different weather conditions.

In [44]:
from sklearn.tree import DecisionTreeClassifier
colDataTree = DecisionTreeClassifier(criterion="entropy", max_depth = 7).fit(X_train,y_train)
colDataTree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=7,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [45]:
# Train Model & Predict
DTpred = colDataTree.predict(X_test)
print (DTpred [0:5])
print (y_test [0:5])

[2 2 1 1 2]
[2 2 1 1 1]


**Building the LR Model**

Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=6, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [47]:
# Train Model & Predict
LRpred = LR.predict(X_test)
LRpred

array([1, 2, 1, ..., 2, 2, 2])

In [48]:
ypred_prob = LR.predict_proba(X_test)
ypred_prob

array([[0.57295252, 0.42704748],
       [0.47065071, 0.52934929],
       [0.67630201, 0.32369799],
       ...,
       [0.46929132, 0.53070868],
       [0.47065071, 0.52934929],
       [0.46929132, 0.53070868]])

---

## RESULTS & EVALUATION
Now we will check the accuracy of our models.

In [49]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [56]:
#for KNN
knn_f1 = f1_score(y_test, Kpred, average='macro')
knn_jc = jaccard_similarity_score(y_test, Kpred)

print("F1-score of KNN is :", knn_f1)
print("Jaccard-score of KNN is :", knn_jc)

F1-score of KNN is : 0.5401775308974308
Jaccard-score of KNN is : 0.564001947698565


In [57]:
# for Desicion Tree
dt_f1 = f1_score(y_test, DTpred, average='macro')
dt_jc = jaccard_similarity_score(y_test, DTpred)

print("F1-score of Desicion Tree is :", dt_f1)
print("Jaccard-score of Desicion Tree is :", dt_jc)

F1-score of Desicion Tree is : 0.5450597937389444
Jaccard-score of Desicion Tree is : 0.5664365709048206


In [59]:
# for Logistic Regression
lr_f1 = f1_score(y_test, LRpred, average='macro')
lr_jc = jaccard_similarity_score(y_test, LRpred)
lr_ll = log_loss(y_test, ypred_prob)

print("F1-score of Logistic Regression is :", lr_f1)
print("Jaccard-score of Logistic Regression is :", lr_jc)
print("Logloss of Logistic Regression is :", lr_ll)

F1-score of Logistic Regression is : 0.511602093963383
Jaccard-score of Logistic Regression is : 0.5260218256809784
Logloss of Logistic Regression is : 0.6849535383198887


In [60]:
list_f1score = [knn_f1, dt_f1, lr_f1]
list_jaccard = [knn_jc, dt_jc, lr_jc]
list_logloss = ['NA', 'NA', lr_ll]

import pandas as pd

df = pd.DataFrame(list_f1score, index=['KNN','Decision Tree','Logistic Regression'])
df.columns = ['F1-Score']
df.insert(loc=1, column='Jaccard', value=list_jaccard)
df.insert(loc=2, column='LogLoss', value=list_logloss)
df.columns.name = 'Algorithm'
df

Algorithm,F1-Score,Jaccard,LogLoss
KNN,0.540178,0.564002,
Decision Tree,0.54506,0.566437,
Logistic Regression,0.511602,0.526022,0.684954


---

## DISCUSSION

* Firstly, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algorithm, so label encoding was used to created new classes that were of type int8; a numerical data type.

* Secondly, resolving this issue we were presented with another imbalanced data. Class 1 was nearly three times larger than class 2. We downsampled to match the minority class exactly with 58188 values each with sklearn's resample tool.

* Thirdly, we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. 

* Finally, evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyperamater C values helped to improve our accuracy to be the best possible. We have reached the correct result by modifying the values.

---

## CONCLUSION

As a result, we can conclude that certain weather conditions have some effect on whether the trip will result in property damage (class 1) or injury (class 2). Although the rainy weather driving effect on the probability of injury, certain parts of the city, blind spots, there may be other factors, such as lightning conditions and more. The change in the speed limit or it may be wise to create more awareness of drivers to understand the dangerous driving conditions and improved it appropriate models.