# Predicting if a Crash wll be fatal or not

Here we will be predicting if a crash will be fatal or not

Steps: 
1. Create 2 new dfs 
    - df with only fatal crashes
    - data frame that samples all non fatal crashes from original df
2. Concat both dfs
3. New columns with 0 (non-fatal crash) or 1 (fatal crash)
4. Clean up data and make dummy variables for each crash
5. Try ML 

Columns to use for ML
### Columns to use for ML
- 'Road Grade', 
- 'Collision Type', 
- 'Weather', 
- 'Surface Condition', 
- 'Driver Substance Abuse', 
- 'Road Condition', 
- 'Fixed Objec Struck', 
- 'Junction', 
- 'Intersection Type', 
- 'Road Alignment', 
- 'Road Division',
- 'Traffic Control',
- 'Number of Lanes'

### Columns to delete
- 'Report Number', 
- 'Local Case Number', 
- 'Agency Name', 
- 'Mile Point', 
- 'Mile Point Direction', 
- 'Lane Direction', 
- 'Direction', 
- 'Distance', 
- 'Distance Unit', 
- 'Road Name', 
- 'Off-Road Description', 
- 'First Harmful Event', 
- 'Latitude', 
- 'Longitude', 
- 'Location', 

In [70]:
# import libraries 

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [71]:
# reading in data 
data = pd.read_csv('data/00-crash_reporting_incidents.csv', low_memory = False)
data.shape

(47592, 44)

In [72]:
# create new y column 'Fatal' 
data['Fatal'] = ""

# fill new column using factorize method, to turning category into int
data['Fatal'] = data['ACRS Report Type'].factorize()[0]

In [73]:
print(data['Fatal'].value_counts())
print(data['ACRS Report Type'].value_counts())

0    30442
1    17031
2      119
Name: Fatal, dtype: int64
Property Damage Crash    30442
Injury Crash             17031
Fatal Crash                119
Name: ACRS Report Type, dtype: int64


In [74]:
data['Road Condition'] = data['Road Condition'].factorize()[0]
data['Road Alignment'] = data['Road Alignment'].factorize()[0]

# Decision Tree Classifier Using 2 Features

In [75]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [76]:
feature_cols = ['Road Condition', 'Road Alignment']

X = data[feature_cols]
y = data.Fatal

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   test_size = 0.2,
                                                    stratify = y, 
                                                   random_state = 3)

In [78]:
dt = DecisionTreeClassifier(criterion = 'gini', random_state = 1)

In [79]:
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

In [80]:
y_pred = dt.predict(X_test)

In [81]:
accuracy_score(y_test, y_pred)

0.6400882445635045

In [82]:
# gives us the importance of each feature
ft_importance

array([0.05353844, 0.94646156])

In [83]:
# zips the feature importance to the feature cols
list(zip(feature_cols, dt.feature_importances_))

[('Road Condition', 0.05353843927955279),
 ('Road Alignment', 0.9464615607204472)]