This is some simple code to try to predict the severity of an accident given a basic set of two features. 

As a first step we will import some required libraries. We will limit this to Pandas and to a bunch of scikit-learn functions. 

In [104]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc,roc_auc_score,roc_curve

We can read the data in one step. 

In [110]:
data = pd.read_csv("cleaned.csv",encoding = "iso8859_2")
data.head(3)

Unnamed: 0.1,Unnamed: 0,Reference Number,Easting,Northing,Number of Vehicles,Accident Date,Time (24hr),Hour of Day,Day of Week,1st Road Class,Road Surface,Lighting Conditions,Weather Conditions,Casualty Class,Casualty Severity,Sex of Casualty,Age of Casualty,Type of Vehicle
0,0,202609,421937,443972,2,14-Mar-09,2330,23,6,Unclassified,Dry,Darkness: no street lighting,Fine with high winds,Driver/Rider,Slight,Male,30,Car
1,1,202609,421937,443972,2,14-Mar-09,2330,23,6,Unclassified,Dry,Darkness: no street lighting,Fine with high winds,Driver/Rider,Slight,Female,20,Car
2,2,810209,441193,448825,1,03-Oct-09,630,6,6,A(M),Dry,Darkness: no street lighting,Fine with high winds,Driver/Rider,Slight,Male,29,Car


Wow, quite a lot of features. But say that we are convinced that we might need just a couple of them and a simple model...

In [157]:
target = data["Casualty Severity"]
X =  data[["Hour of Day","Road Surface"]]

from collections import Counter
Counter(target)

Counter({'Serious': 2192, 'Slight': 16691})

The vast majority are "Slight" severity incidents, our objective is to actually try to predict whether an incident will be "Serious" or not. We have chosen to use only two features, how bad is the road surface and the hour of the day. 

If there is snow and it's late at night chances are one is more likely to have an accident, right?

But before jumping to the the modelling part I would need to convert the "Road Surface" from a categorical to a set of binary dummy features.

In [158]:
dummies  = pd.get_dummies(X["Road Surface"])
X = pd.concat([X["Hour of Day"],d] ,axis = 1)
X.head(3)

Unnamed: 0,Hour of Day,Dry,Flood (surface water over 3cm deep),Frost / Ice,Snow,Wet / Damp
0,23,1,0,0,0,0
1,23,1,0,0,0,0
2,6,1,0,0,0,0


Looks cool! Now I'm all set to train my model. Since I have a categorical response variable - "Slight"/"Serious" - I'll use the scikit learn implementation of Logistic Regression. 

In [159]:
# Create train test split first
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.15)

lr = LogisticRegression()
lr_fit = lr.fit(X_train,y_train)

preds = lr.predict(X_test)

All looked fine, so let's see what's the accuracy of my model!

In [160]:
print("Accuracy Score is, ",lr.score(X_test,y_test))

Accuracy Score is,  0.886339569361


Awesome!!! But wait...let me check something.

In [161]:
print("Predictions Distribution: ",Counter(preds))
print("-------------------")
print("Test Set Label Distribution: ",Counter(y_test))

Predictions Distribution:  Counter({'Slight': 2833})
-------------------
Test Set Label Distribution:  Counter({'Slight': 2511, 'Serious': 322})


Arghh the accuracy is completely misleading, the model always predict the same thing and the accuracy doesn't do anything else than reflecting  my class distribution.

In fact 2521/(2521+312) is exactly equal to 0.889 which is my accuracy score. 

I'll have to try much harder than that!