## Business Problem

 The objective of this project is to predict the correlation between the severity of accidents and the driver's traveling conditions. It aims to determine whether car accidents are caused by the weather, the condition of the road or speeding. 



## Data 

The data is provided by Coursera for IBM Data Science course. The data presents the number of accidents/collisions from 2004 to the present in Seattle. The total number of observation consists of about 194670 rows and 37 attributes. 

## Methodology

Two key variables will be selected to build a model to predict the correlations. They are road conditions [ROADCOND] and speeding [SPEEDING]. The target variable is severity code [SEVERITYCODE].

Libraries used are Pandas, Numpy, Scikit-learn.

The dataset will be split into training set (80%) and testing set (20%). 

Model will be built and predicted using Decision tree, Linear Regression and KNN method. 


In [255]:
import pandas as pd
import numpy as np

In [256]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

### Import Dataset

In [257]:
Data = pd.read_csv("Data-Collisions.csv")
Data

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.703140,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.334540,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194668,2,-122.290826,47.565408,219543,309534,310814,E871089,Matched,Block,,...,Dry,Daylight,,,,24,From opposite direction - both moving - head-on,0,0,N
194669,1,-122.344526,47.690924,219544,309085,310365,E876731,Matched,Block,,...,Wet,Daylight,,,,13,From same direction - both going straight - bo...,0,0,N
194670,2,-122.306689,47.683047,219545,311280,312640,3809984,Matched,Intersection,24760.0,...,Dry,Daylight,,,,28,From opposite direction - one left turn - one ...,0,0,N
194671,2,-122.355317,47.678734,219546,309514,310794,3810083,Matched,Intersection,24349.0,...,Dry,Dusk,,,,5,Vehicle Strikes Pedalcyclist,4308,0,N


### Variable Selection

In [258]:
df = ['ROADCOND','SPEEDING','SEVERITYCODE']
df = Data[df]
df

Unnamed: 0,ROADCOND,SPEEDING,SEVERITYCODE
0,Wet,,2
1,Wet,,1
2,Dry,,1
3,Dry,,1
4,Wet,,2
...,...,...,...
194668,Dry,,2
194669,Wet,,1
194670,Dry,,2
194671,Dry,,2


In [259]:
df['ROADCOND'] = df['ROADCOND'].fillna('Unknown')
df['SPEEDING'] = df['SPEEDING'].fillna('N')
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,ROADCOND,SPEEDING,SEVERITYCODE
0,Wet,N,2
1,Wet,N,1
2,Dry,N,1
3,Dry,N,1
4,Wet,N,2
...,...,...,...
194668,Dry,N,2
194669,Wet,N,1
194670,Dry,N,2
194671,Dry,N,2


### Assign New Values

In [260]:
df['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], 
                              value = ['Dangerous','Safe','Safe','Dangerous','Dangerous','Safe','Dangerous','Dangerous','Dangerous'], inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


Unnamed: 0,ROADCOND,SPEEDING,SEVERITYCODE
0,Dangerous,N,2
1,Dangerous,N,1
2,Safe,N,1
3,Safe,N,1
4,Dangerous,N,2
...,...,...,...
194668,Safe,N,2
194669,Dangerous,N,1
194670,Safe,N,2
194671,Safe,N,2


In [261]:
df["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df['ROADCOND'].replace(to_replace=['Dangerous','Safe'],value=[0,1],inplace=True)
testdf = df[['SPEEDING','ROADCOND']]
testdf.head()

Unnamed: 0,SPEEDING,ROADCOND
0,0,0
1,0,0
2,0,1
3,0,1
4,0,0


### Training model 

In [266]:
x = testdf
y = df['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)

Training set:  (155738, 2) (155738,)
Testing set:  (38935, 2) (38935,)


#### Tree Model

In [267]:
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)

  'precision', 'predicted', average, warn_for)


#### Logistic Regression

In [268]:
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

  'precision', 'predicted', average, warn_for)


#### KNN model

In [269]:
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

## Result

In [271]:
results = {
    "Method of Analysis": ["Decision Tree", "LogisticRegression","KNN"],
    "F1-score": [Tree_f1, LR_f1, KNN_f1],
    "Accuracy": [Tree_acc, LR_acc, KNN_acc,]
}

results = pd.DataFrame(results)
results

Unnamed: 0,Method of Analysis,F1-score,Accuracy
0,Decision Tree,0.576051,0.699679
1,LogisticRegression,0.576051,0.699679
2,KNN,0.591378,0.696751


## Discussion

After generating results, it can be seen that F1 score and accuracy score of KNN is better than the other two models. 

## Conclusion

Based on the result, the prediction result shows correlations between speeding, the weather conditions and the severity of accidents. In other words, dangerous road conditions and high speed driving result in higher chance of traffic accidents. 