In [None]:
%load_ext autoreload
%autoreload 2

Hereafer I will import some of the python libraries I will use in the solution:

In [None]:
import numpy as np
import pandas as pd
import logging
import os, sys
from functools import partial
import yaml
from sklearn.model_selection import train_test_split
import logging
from functools import partial

I am defining the logger, I will use it to print some information about the execution:

In [None]:
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%d/%m/%Y %I:%M:%S%p")

logger = logging.getLogger(__name__)

# Machine Learning Challenge - Solution

This notebook contains the solution of the challange and some explanation.

In [None]:
from source.featureExtractor import FeatureExtraxtor, plotHistogram
from source.model import MLmodel

I am loading the dataset from the web:

In [None]:
url = "http://data.dft.gov.uk/road-accidents-safety-data/DfTRoadSafety_Accidents_2014.zip"
dataset = pd.read_csv(url)

In the featureExtractor.py module I defined a class that contains all the method that I used to preprocess the data. Now I am instatiating an element of this class. In the constructor it needs the path of the csv file with the data.

In [None]:
myFE = FeatureExtraxtor(url)

Now I will generate the features. I used different methods for each column or group of columns, you can read the docstrings and the comment in featureExtractor.py for all the details: 

In [None]:
features = myFE.getFeatures()
features.head()

Generating the ground truth:

In [None]:
groundTruth = myFE.getGroundTruth()

Merging all togheter:

In [None]:
train_set = pd.merge(groundTruth, features).set_index('Accident_Index')

In [None]:
for feature in features.set_index('Accident_Index').columns:
    plotHistogram(feature, train_set).show()

## Model

In this section I will define and train the model.

I am defining X and y:

In [None]:
X = train_set.drop(columns='GROUND_TRUTH')
y = train_set['GROUND_TRUTH']

I am splitting both the features and the ground truth into a train and test sample:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    stratify=y,
                                                    shuffle=True)

For this exercise, I've tested several tree-based model. I defined them in a configuration file called model_settings.yaml, now I am importing that in the code:

In [None]:
with open(os.path.join('source', 'model_settings.yaml'), 'r', encoding="utf-8") as handler:
    model_params = yaml.load(handler, Loader=yaml.FullLoader)

Now I am defining an instance of a MLmodel object, a class defined in the model.py module (in the source folder) that contain several useful methods for dealing with tree based classification models.
I've choosen this class of models because they offer a great compromise between simplicity and quality of the predictios.

In [None]:
myClassificationModel = MLmodel(model_params['LightGBM'])

Now I am launching the optimization routine based on gridsearch or randomsearch (to be chosen in the configuration file).

In [None]:
myClassificationModel.optimize(X_train, y_train)

Finally, I am fitting the model:

In [None]:
myClassificationModel.fit(X_train, y_train)

## KPIs

Now I am assembling the predictions:

In [None]:
results_train = y_train.reset_index()
results_test = y_test.reset_index()

results_train['Score'] = myClassificationModel.predict(X_train)
results_train['Prediction'] = (results_train['Score'] > 0.5).astype(int)

results_test['Score'] = myClassificationModel.predict(X_test)
results_test['Prediction'] = (results_test['Score'] > 0.5).astype(int)

In the cell below, I'll print some quality metrics, to check the quality of the model:

In [None]:
print("\n\nTrain set:")

myClassificationModel.compute_model_gof_kpis(predictions = results_train,
                                             true_class_name='GROUND_TRUTH',
                                             pred_class_name='Prediction',
                                             pred_score_name='Score',)

print("\n\nTest Set:")
myClassificationModel.compute_model_gof_kpis(predictions = results_test,
                                             true_class_name='GROUND_TRUTH',
                                             pred_class_name='Prediction',
                                             pred_score_name='Score',)
print("")

Let's plot the features importances. Tree based models offer a simple method to check the relative importances of the features, so I think they are a great choice to easily get some extra-insight about the goodness of the chosen features. In this case, I made several attempts and I deleted some features which importance were always 0 or close to 0. There are also automatic routines defined in scikit-learn to do this task (for example sklearn.feature_selection.RFE) but they require a lot of time to run.

In [None]:
myClassificationModel.plotFeaturesImportances(X.columns)

# Answers

### a. What was approach taken (e.g. algorithms and tools)?

I used pandas for data manipulation, plotly for visual data exploration, lightGBM and some scikit-learn utilities for the model. I've chosen a gradient boosted algorithm because they are pretty easy to optimize, they give good predictions (expecially for medium sized datasets, like in this case) and it give the possibility to easily infer the feature importance ranking.

### b. What were the main challenges?

Understanding the dataset, because I was unfamiliar with the problem. Another challenge was to deal with the short time available, that made me to take many shortcuts and to rely heavily to already done code (lack of customization).

### c. What insight did you gain from working with the data?

I found that the location of the accident (expressed in my feature set with the latitude/longitude pair) is by far the best predictor of our ground truth. I guess that it's due to the fact that there are places where there are more often police officers.

### d. How useful is the model?

The model is pretty useful in predicting the presence of the police officer, less useful in predicting his/her absence (on the test set, the number of false negatives is twice as big as the number of true negatives).

### e. What might you do differently if you had more time/resource? 

Hereafter some ideas to improve the model with more time available:

    1. I could extract different features from LAT/LON pair, using some geocoding API
    
    2. I would explore in more detail the categorical features, that are not used in this example
    
    3. I would try to increase the number of feature, using other data sources (in the website http://data.dft.gov.uk/road-accidents-safety-data/ there are other files that I could try to explore)
    
    4. I would try to increase the dimension of the dataset (adding data from previous years)
    
    5. I would try different models 
    
    6. I would perform a better hyperparameter optimization (using a gridsearch on a larger hyperparameter space)