# Car accident severity
## IBM Applied Data Science Capstone Project
### By LN-Coursera

## Table of Contents:
* Introduction/Business Problem
* Data
* Methodology
* Results
* Discussion
* Conclusion

## Introduction/Business Problem:
Trying to reduce the frequency of car collisions, I'm going to use the given data for the Capstone. During this project I'll predict the severity of car accidents given the current weather, road and visibility conditions. The goal of my project is to alert drivers when the current conditions are bad ('SEVERITYCODE' > 0), so that they can drive more careful.

## Data
### Data requirements
For our given problem we need a specific set of data. It should be a large amount of data, so that we can train our model as good as possible and predict severity of car accidents as precise as possible to prevent more accidents in the future.
### Data description
The target variable for the Capstone is going to be 'SEVERITYCODE' - it is used measure the severity of a car accident from 0 to 4 within the dataset. To weigh the severity of a car accident the following attributes are used: 'WEATHER', 'ROADCOND' and 'LIGHTCOND'.


Severity codes are as follows:


0 : Little to no Probability (Clear Conditions)

1 : Very Low Probability - Chance or Property Damage

2 : Low Probability - Chance of Injury

3 : Mild Probability - Chance of Serious Injury

4 : High Probability - Chance of Fatality


Other important attributes are:


OBJECTID: ESRI unique identifier

ADDRTYPE: Collision address type: Alley, Block, Intersection

LOCATION: Description of the general location of the collision

COLLISIONTYPE: Collision type

PERSONCOUNT: The total number of people involved in the collision

PEDCOUNT: The number of pedestrians involved in the collision

PEDCYLCOUNT: The number of bicycles involved in the collision

VEHCOUNT: The number of vehicles involved in the collision

INCDTTM: The date and time of the incident

PEDROWNOTGRNT: Whether or not the pedestrian right of way was not granted

SPEEDING: Whether or not speeding was a factor in the collision


In it's original form, this data is not fit for analysis. For one, there are many columns that we will not use for this model. Also, most of the features are of type object, when they should be numerical type.

We must use label encoding to covert the features to our desired data type.

### Let's start!

In [44]:
# The code was removed by Watson Studio for sharing.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [2]:
# Drop all unnecessary columns for this project
colData = df.drop(columns = ['OBJECTID', 'SEVERITYCODE.1', 'REPORTNO', 'INCKEY', 'COLDETKEY', 
              'X', 'Y', 'STATUS','ADDRTYPE',
              'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
              'EXCEPTRSNDESC', 'SEVERITYDESC', 'INCDATE',
              'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
              'SDOT_COLDESC', 'PEDROWNOTGRNT', 'SDOTCOLNUM',
              'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY',
              'CROSSWALKKEY', 'HITPARKEDCAR', 'PEDCOUNT', 'PEDCYLCOUNT',
              'PERSONCOUNT', 'VEHCOUNT', 'COLLISIONTYPE',
              'SPEEDING', 'UNDERINFL', 'INATTENTIONIND'])

# Label Encoding
# Convert column to category
colData["WEATHER"] = colData["WEATHER"].astype('category')
colData["ROADCOND"] = colData["ROADCOND"].astype('category')
colData["LIGHTCOND"] = colData["LIGHTCOND"].astype('category')

# Assign variable to new column for analysis
colData["WEATHER_CAT"] = colData["WEATHER"].cat.codes
colData["ROADCOND_CAT"] = colData["ROADCOND"].cat.codes
colData["LIGHTCOND_CAT"] = colData["LIGHTCOND"].cat.codes

colData.head(5)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Overcast,Wet,Daylight,4,8,5
1,1,Raining,Wet,Dark - Street Lights On,6,8,2
2,1,Overcast,Dry,Daylight,4,0,5
3,1,Clear,Dry,Daylight,1,0,5
4,2,Raining,Wet,Daylight,6,8,5


In [3]:
colData.dtypes

SEVERITYCODE        int64
WEATHER          category
ROADCOND         category
LIGHTCOND        category
WEATHER_CAT          int8
ROADCOND_CAT         int8
LIGHTCOND_CAT        int8
dtype: object

In [4]:
# Analyze Value Counts
colData["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [5]:
colData["WEATHER"].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [6]:
colData["ROADCOND"].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [7]:
colData["LIGHTCOND"].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

#### Balance the Dataset
As you might have realized, the dataset is unbalanced. The severity code for class 1 is nearly three times as large as for class 2.
To balance our data we sample down the majority class.

In [9]:
from sklearn.utils import resample

In [43]:
# Seperate the different classes
colData_majority = colData[colData.SEVERITYCODE == 1]
colData_minority = colData[colData.SEVERITYCODE == 2]

# Downsample majority class
colData_majority_downsampled = resample(colData_majority,
                                        replace = False,
                                        n_samples = 58188,
                                        random_state = 123)

# Combine minority class with downsampled majority class
colData_balanced = pd.concat([colData_majority_downsampled, colData_minority])

# Display new class counts
colData_balanced.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

### Perfectly balanced - as all things should be ;)

## Methodology
After analyzing and cleaning our dataset, the data is now ready to be fed through ML models.
We will use the models we already know from the previous chapters:

* K-Nearest Neighbor -> predicting the severity code of an outcome by finding the most similar data point within k distance 
* Decision Tree -> displaying all possible outcomes to analyze the consequences of a decision
* Logistic Regression -> predicting onee of our two severity codes (1 or 2)

In [13]:
# Define X 
X = np.asarray(colData_balanced[['WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']])
X[0:5]

array([[ 6,  8,  2],
       [ 1,  0,  5],
       [10,  7,  8],
       [ 1,  0,  5],
       [ 1,  0,  5]], dtype=int8)

In [14]:
# Define Y
y = np.asarray(colData_balanced['SEVERITYCODE'])
y [0:5]

array([1, 1, 1, 1, 1])

In [15]:
# Normalize Dataset
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]



array([[ 1.15236718,  1.52797946, -1.21648407],
       [-0.67488   , -0.67084969,  0.42978835],
       [ 2.61416492,  1.25312582,  2.07606076],
       [-0.67488   , -0.67084969,  0.42978835],
       [-0.67488   , -0.67084969,  0.42978835]])

In [42]:
# Train/Test Split
# 70% Train Data; 30% Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (81463, 3) (81463,)
Test set: (34913, 3) (34913,)


#### KNN

In [17]:
from sklearn.neighbors import KNeighborsClassifier
k = 25

In [41]:
# Train and predict model
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
neigh

Kyhat = neigh.predict(X_test)
Kyhat[0:5]

array([2, 2, 1, 1, 2])

#### Decision Tree

In [40]:
from sklearn.tree import DecisionTreeClassifier
colDataTree = DecisionTreeClassifier(criterion = "entropy", max_depth = 7)
colDataTree
colDataTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=7,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [21]:
# Train and predict model
predTree = colDataTree.predict(X_test)
print (predTree [0:5])
print (y_test [0:5])

[2 2 1 1 2]
[2 2 1 1 1]


#### Logistic Regression

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C = 6, solver = 'liblinear').fit(X_train, y_train)
LR

LogisticRegression(C=6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [23]:
# Train and predict model
LRyhat = LR.predict(X_test)
LRyhat

array([1, 2, 1, ..., 2, 2, 2])

In [24]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob

array([[0.57295252, 0.42704748],
       [0.47065071, 0.52934929],
       [0.67630201, 0.32369799],
       ...,
       [0.46929132, 0.53070868],
       [0.47065071, 0.52934929],
       [0.46929132, 0.53070868]])

## Results
Checking the accuracy of our models using the Jaccard Similarity Score and the F1-Score.

In [25]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

#### KNN

In [28]:
# Jaccard
jaccard_similarity_score(y_test, Kyhat)

0.564001947698565

In [38]:
# F1
f1_score(y_test, Kyhat, average = 'macro')

0.5401775308974308

Most accurate with k = 25.

#### Decision Tree

In [31]:
# Jaccard
jaccard_similarity_score(y_test, DTyhat)

0.5664365709048206

In [37]:
# F1
f1_score(y_test, DTyhat, average = 'macro')

0.5450597937389444

Most accurate with max_depth = 7.

#### Logistic Regression

In [33]:
# Jaccard
jaccard_similarity_score(y_test, LRyhat)

0.5260218256809784

In [39]:
# F1
f1_score(y_test, LRyhat, average = 'macro')

0.511602093963383

In [35]:
# LogLoss
yhat_prob = LR.predict_proba(X_test)
log_loss(y_test, yhat_prob)

0.6849535383198887

Most accurate with C = 6.

## Discussion
In the beginning we changed the data type of some of our given data from 'object' to 'int8' - a numerical data type - to use it for our algorithm.
After the first issue we had to take care of unbalanced data. To match the minority class we sampled down the majority class, which was class 1, until the values of both classes matched.
That step was followed by analyzing and cleaning our data for the three machine learning models: K-Nearest Neighbor, Decision Tree and Logistic Regression. 
To see how accurate our models were, we lastly used the Jaccard Similarity Index, the F1-Score and LogLoss for Logistic Regression. To improve the accuracy of the models the different variables of each model (k, max_depth, C) had to be adjusted several times for the best possible result.

## Conclusion
Based on the given historical data about weather conditions in relation to car accidents, we can say that the weather does have an impact on whether or not driving could result in a car accident leading to property damages (class 1) or injuries (class 2).