# Model Predicting
by Prince Joseph Erneszer Javier

## Executive Summary

In this notebook, we evaluate the performance of our trained machine learning classifier on test data for Grab AI for SEA challenge under the Safety category. We were provided Telematics data and from these data we would develop models that can predict if a driver is driving safely or not. The raw features dataset contains 16 million samples and 11 columns including the `bookingID`. There are 20,000 unique `bookingID`'s each with either 0 or 1 corresponding to safe or unsafe driving. The data were preprocessed in `grab-ai-preprocessing-eda`. The models were trained in `grab-ai-training`. In this notebook, we developed a pipeline for processing and evaluating the performance of our trained models on test data that look like the raw data provided by Grab. Finally, `grab-ai-predicting` contains the pipeline for predicting on a new dataset.

To improve the generalizeability of the predictions, the average predictions of the top 7 models were used. The predictions for both min-max scaling and standard scaling were also derived, giving a total of 14 predictions. The average prediction was used as the final prediction. The models for predicting are Gradient Boosting Machines, Nonlinear Support Vector Machines, Linear Support Vector Machines with L1 Norm, Linear Support Vector Machines with L2 Norm, and Neural Network. These were all trained in `grab-ai-training`.

Using this ensemble of ML models, we got an accuracy of around 71%, which is higher than the proportional chance criterion (accuracy by random chance) of 62%. 75% of safe driving was predicted correctly, while 60% of unsafe driving was predicted correctly.

## Introduction

Grab AI for SEA challenge is a hackathon organized by Grab. Grab offers three challenges that can be solved using AI: Traffic Management, Computer Vision, and Safety. We tackle the Safety Challenge. The `Ride Safety` dataset was provided by Grab, which contains Telematics data (acceleration, gyroscope data, speed, etc.), `bookingID`, and labels (0 or 1 for safe or unsafe driving). The raw dataset was prepared in `grab-ai-preprocessing-eda`. The output of that notebook is used as input for machine learning classifier training. The models trained by `grab-ai-training` are evaluated on the 5% test set in this notebook.

## About the Data

The `Ride Safety` dataset contains two folders: `features` and `labels`. `features` contains 10 CSV files which contain a total of 16 million telematics data samples. The columns in the `features` dataset as described in `data_dictionary.xlsx` are:

|Column Name|Description|
|:--|:--|
|`bookingID`|trip id|
|`Accuracy`|accuracy inferred by GPS in meters|
|`Bearing`|GPS bearing|
|`acceleration_x`|accelerometer reading in x axis (m/s2)|
|`acceleration_y`|accelerometer reading in y axis (m/s2)|
|`acceleration_z`|accelerometer reading in z axis (m/s2)|
|`gyro_x`|gyroscope reading in x axis (rad/s)|
|`gyro_y`|gyroscope reading in y axis (rad/s)|
|`gyro_z`|gyroscope reading in z axis (rad/s)|
|`second`|time of the record by number of seconds|
|`Speed`|speed measured by GPS in m/s|

In `grab-ai-preprocessing-eda`, the samples were aggregated and features were engineered. `bookingID` and `second` were not included as features. The following measures were calculated for each feature: min, max, range, mean, standard deviation, skewness, kurtosis, dominant frequency (from fourier transform periodogram), and maximum power (from fourier transform periodogram). An additional feature was added which is the trip length.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline

from scipy.stats import kurtosis, skew
from scipy.signal import periodogram
from collections import Counter

from tensorflow import keras
from keras.models import load_model

seed = 42
np.random.seed(seed)

Using TensorFlow backend.


## Loading Data

In [2]:
# paths of features and labels
path_feats ="data/processed/df_test.csv"
path_labels = "data/safety/labels/part-00000-e9445087-aa0a-433b-a7f6-7f4c19d78ad6-c000.csv"

df_feats = pd.read_csv(path_feats).drop_duplicates()
df_labels = pd.read_csv(path_labels).drop_duplicates()

## Data Preprocessing

### Feature Engineering

In [3]:
def dominant_f(y):
    """Given time series y, get frequency of maximum power
    from periodogram"""
    f, p = periodogram(y, scaling='spectrum')
    ind = np.argsort(p)
    f_max = f[ind[-1]]
    return f_max

def max_power(y):
    """Given time series y, get maximum power"""
    f, p = periodogram(y, scaling='spectrum')
    return p.max()

In [4]:
# we engineer feature, aggregating feature values per bookingID
# getting min, max, range, mean, std, skewness, and kurtosis

df_engg_feats = df_feats.drop("second", axis=1)
df_engg_feats = df_engg_feats.groupby(by="bookingID", as_index=True).agg([np.min, np.max, np.ptp, np.mean, np.std, skew, kurtosis, dominant_f, max_power])


  return ptp(axis=axis, out=out, **kwargs)


In [5]:
# flatten column names
cols = [df_engg_feats.columns[i][0]+"_"+df_engg_feats.columns[i][1] for i in range(len(df_engg_feats.columns))]

In [6]:
df_engg_feats.columns = cols
df_engg_feats.reset_index(inplace=True)

In [7]:
# add length of each trip
df_len = df_feats.groupby(by="bookingID", as_index=True).agg(len).iloc[:, 0:1]
df_len.columns = ['trip_len']
df_len.reset_index(inplace=True)

In [8]:
# merge along bookingID
df_engg_feats_2 = pd.merge(df_engg_feats, df_len, how="inner", on="bookingID")
df_engg_feats_2.head()

Unnamed: 0,bookingID,Accuracy_amin,Accuracy_amax,Accuracy_ptp,Accuracy_mean,Accuracy_std,Accuracy_skew,Accuracy_kurtosis,Accuracy_dominant_f,Accuracy_max_power,...,Speed_amin,Speed_amax,Speed_ptp,Speed_mean,Speed_std,Speed_skew,Speed_kurtosis,Speed_dominant_f,Speed_max_power,trip_len
0,8,3.0,18.204,15.204,7.008253,3.153024,1.070632,0.602108,0.002584,2.271603,...,-1.0,18.27,19.27,5.351266,5.661732,0.804261,-0.753928,0.002584,12.302757,387.0
1,13,3.0,1251.564,1248.564,11.157522,67.183017,15.394113,241.992297,0.000814,58.380261,...,-1.0,26.152094,27.152094,15.521918,9.09648,-0.561274,-1.259059,0.000814,58.821424,1228.0
2,33,3.0,5.1,2.1,3.537573,0.451916,-0.071964,-1.263237,0.000813,0.057636,...,0.0,19.625328,19.625328,6.496606,6.343458,0.435514,-1.305701,0.000813,18.016253,1230.0
3,35,3.198,5.8,2.602,5.223068,0.723952,-2.177536,2.978317,0.000601,0.135486,...,0.0,26.021782,26.021782,16.619441,8.364577,-0.92044,-0.561377,0.000601,39.768221,1665.0
4,91,3.0,8.0,5.0,3.763156,0.931148,2.333683,6.639845,0.022222,0.197087,...,0.0,17.478025,17.478025,4.900992,4.673188,0.838455,-0.097581,0.022222,8.685996,180.0


In [9]:
# left join with labels (aggregating and feature engineering)
df_engg_feats_labels = pd.merge(df_engg_feats_2, df_labels, how="inner", on="bookingID")
df_engg_feats_labels.head()

Unnamed: 0,bookingID,Accuracy_amin,Accuracy_amax,Accuracy_ptp,Accuracy_mean,Accuracy_std,Accuracy_skew,Accuracy_kurtosis,Accuracy_dominant_f,Accuracy_max_power,...,Speed_amax,Speed_ptp,Speed_mean,Speed_std,Speed_skew,Speed_kurtosis,Speed_dominant_f,Speed_max_power,trip_len,label
0,8,3.0,18.204,15.204,7.008253,3.153024,1.070632,0.602108,0.002584,2.271603,...,18.27,19.27,5.351266,5.661732,0.804261,-0.753928,0.002584,12.302757,387.0,0
1,13,3.0,1251.564,1248.564,11.157522,67.183017,15.394113,241.992297,0.000814,58.380261,...,26.152094,27.152094,15.521918,9.09648,-0.561274,-1.259059,0.000814,58.821424,1228.0,0
2,13,3.0,1251.564,1248.564,11.157522,67.183017,15.394113,241.992297,0.000814,58.380261,...,26.152094,27.152094,15.521918,9.09648,-0.561274,-1.259059,0.000814,58.821424,1228.0,1
3,33,3.0,5.1,2.1,3.537573,0.451916,-0.071964,-1.263237,0.000813,0.057636,...,19.625328,19.625328,6.496606,6.343458,0.435514,-1.305701,0.000813,18.016253,1230.0,0
4,35,3.198,5.8,2.602,5.223068,0.723952,-2.177536,2.978317,0.000601,0.135486,...,26.021782,26.021782,16.619441,8.364577,-0.92044,-0.561377,0.000601,39.768221,1665.0,1


In [10]:
df_engg_feats_labels.drop_duplicates(subset="bookingID", inplace=True)
df_engg_feats_labels.shape

(1000, 84)

In [11]:
# get feature set
X0 = df_engg_feats_2.drop(["bookingID"], axis=1)
cols = X0.columns

In [12]:
y = df_engg_feats_labels.label

In [13]:
# 0s and 1s
Counter(y)

Counter({0: 748, 1: 252})

In [14]:
# proportional chance criterion
state_counts = Counter(y)
df_state = pd.DataFrame.from_dict(state_counts, orient='index')
num = (df_state[0] / df_state[0].sum())**2
pcc = num.sum()
pcc

0.623008

## Model Predicting

In [15]:
# ensemble of top models

def predict(scaler="minmax"):
    
    """
    Given scaler name, scale dataset and predict using pretrained models
    Uses all features
    Return prediction as array of 0s and 1s from all models
    """

    # scaling
    path = f"scalers/{scaler}.sav"
    sc = pickle.load(open(path, 'rb'))
    
    y_preds = []

    # scale X
    X = pd.DataFrame(sc.transform(X0))
    X.columns = cols

    model_names = [f"models/{model}_{scaler}.sav" for model in ['gbm', 'svc', 'linear_svc_l1', 'linear_svc_l2']]

    for model_name in model_names:
        # load the model from disk (machine learning)
        filename = model_name
        model = pickle.load(open(filename, 'rb'))
        y_pred = model.predict(X)
        print(accuracy_score(y, y_pred), model_name)

        y_preds.append(y_pred)

    # Deep learning
    filename = f"models/mlp_{scaler}_relu_adam_dropout-0.5_cols-82.hdf5"
    model = load_model(filename)
    y_pred_2 = np.round(model.predict(X))
    print(accuracy_score(y, y_pred_2), f"nn_{scaler}")
    
    y_preds.append(y_pred_2)
    
    return y_preds


In [16]:
y_pred_minmax = predict("minmax")
y_pred_std = predict("std")

# concatenate results from std and minmax models
y_pred_all = y_pred_std + y_pred_minmax

# get average of all values
y_pred = np.round(np.array([np.array(i).flatten() for i in y_pred_all]).mean(axis=0), 0)

0.703 models/gbm_minmax.sav
0.694 models/svc_minmax.sav
0.674 models/linear_svc_l1_minmax.sav
0.701 models/linear_svc_l2_minmax.sav
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
0.724 nn_minmax
0.701 models/gbm_std.sav
0.675 models/svc_std.sav
0.689 models/linear_svc_l1_std.sav
0.693 models/linear_svc_l2_std.sav
0.707 nn_std
