# Model Predicting
by Prince Joseph Erneszer Javier

**Notes on Virtual Machine and Environment**
<br>Virtual Machine: AWS c5.9xlarge
<br>Operating System: Deep Learning AMI (Ubuntu) Version 23.0
<br>Environment: tensorflow_p36
<br>Storage Size: 85 GB

## Executive Summary

In this notebook, we use trained machine learning classifiers to predict on new data if driving is safe or unsafe. This is part of my entry for Grab AI for SEA challenge under the Safety category. We were provided Telematics data and from these data we would develop models that can predict if a driver is driving safely or not. The raw features dataset contains 16 million samples and 11 columns including the `bookingID`. There are 20,000 unique `bookingID`'s each with either 0 or 1 corresponding to safe or unsafe driving. The data were preprocessed in `grab-ai-preprocessing-eda`. The models were trained in `grab-ai-training`. In this notebook, we developed a pipeline for predicting on a new dataset. The predictions are then saved as a CSV under the folder `prediction`. The CSV contains two columns: `bookingID` and the predicted class (0 or 1).

## Introduction

Grab AI for SEA challenge is a hackathon organized by Grab. Grab offers three challenges that can be solved using AI: Traffic Management, Computer Vision, and Safety. We tackle the Safety Challenge. The `Ride Safety` dataset was provided by Grab, which contains Telematics data (acceleration, gyroscope data, speed, etc.), `bookingID`, and labels (0 or 1 for safe or unsafe driving). The raw dataset was prepared in `grab-ai-preprocessing-eda`. The output of that notebook is used as input for machine learning classifier training. This notebook contains the pipeline for predicting on new dataset.

## About the Data

The `Ride Safety` dataset contains two folders: `features` and `labels`. `features` contains 10 CSV files which contain a total of 16 million telematics data samples. The columns in the `features` dataset as described in `data_dictionary.xlsx` are:

|Column Name|Description|
|:--|:--|
|`bookingID`|trip id|
|`Accuracy`|accuracy inferred by GPS in meters|
|`Bearing`|GPS bearing|
|`acceleration_x`|accelerometer reading in x axis (m/s2)|
|`acceleration_y`|accelerometer reading in y axis (m/s2)|
|`acceleration_z`|accelerometer reading in z axis (m/s2)|
|`gyro_x`|gyroscope reading in x axis (rad/s)|
|`gyro_y`|gyroscope reading in y axis (rad/s)|
|`gyro_z`|gyroscope reading in z axis (rad/s)|
|`second`|time of the record by number of seconds|
|`Speed`|speed measured by GPS in m/s|

In `grab-ai-preprocessing-eda`, the samples were aggregated and features were engineered. `bookingID` and `second` were not included as features. The following measures were calculated for each feature: min, max, range, mean, standard deviation, skewness, kurtosis, dominant frequency (from fourier transform periodogram), and maximum power (from fourier transform periodogram). An additional feature was added which is the trip length.

## How to Use This?

The features dataset should be in CSV format. The CSV files should follow the same format as the CSV files in the `features` folder in the training set. The columns are the 11 columns in the table above. Save all features CSVs under one folder. The folder's default path is `data/for_prediction/` and the default filepath for the predictions is `prediction/predictions.csv`, but you can change them as needed below:

In [1]:
# paths of features
path_feats ="data/for_prediction/"

# path of predictions csv file
filepath = "prediction/predictions.csv"

Once the CSVs are placed in the paths above, just restart and run the entire Jupyter notebook. The predicted classes per `bookingID` will be saved as CSV in the folder `prediction`.

## Predicting Pipeline

In [2]:
import pandas as pd
import numpy as np
import pickle
import glob

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline

from scipy.stats import kurtosis, skew
from scipy.signal import periodogram

from tensorflow import keras
from keras.models import load_model

seed = 42
np.random.seed(seed)

Using TensorFlow backend.


## Loading Data

In [3]:
# contents of dataset folder
paths = glob.glob(path_feats+'*')

# let's combine all feature into one pandas dataframe
df_feats = pd.DataFrame()

for path in paths:
    _ = pd.read_csv(path, header="infer")
    df_feats = pd.concat([df_feats, _])
    
df_feats = df_feats.drop_duplicates()

## Data Preprocessing

In [4]:
def dominant_f(y):
    """Given time series y, get frequency of maximum power
    from periodogram"""
    f, p = periodogram(y, scaling='spectrum')
    ind = np.argsort(p)
    f_max = f[ind[-1]]
    return f_max

def max_power(y):
    """Given time series y, get maximum power"""
    f, p = periodogram(y, scaling='spectrum')
    return p.max()

In [5]:
# we engineer feature, aggregating feature values per bookingID
# getting min, max, range, mean, std, skewness, and kurtosis

df_engg_feats = df_feats.drop("second", axis=1)
df_engg_feats = df_engg_feats.groupby(by="bookingID", as_index=True).agg([np.min, np.max, np.ptp, np.mean, np.std, skew, kurtosis, dominant_f, max_power])

# flatten column names
cols = [df_engg_feats.columns[i][0]+"_"+df_engg_feats.columns[i][1] for i in range(len(df_engg_feats.columns))]

df_engg_feats.columns = cols
df_engg_feats.reset_index(inplace=True)

# add length of each trip
df_len = df_feats.groupby(by="bookingID", as_index=True).agg(len).iloc[:, 0:1]
df_len.columns = ['trip_len']
df_len.reset_index(inplace=True)

# merge along bookingID
df_engg_feats_2 = pd.merge(df_engg_feats, df_len, how="inner", on="bookingID")

# get booking ID
bookingID = df_engg_feats_2.bookingID

# get feature set
X0 = df_engg_feats_2.drop(["bookingID"], axis=1)
cols = X0.columns

  return ptp(axis=axis, out=out, **kwargs)


## Model Predicting

In [6]:
# ensemble of top models

def predict(scaler="minmax"):
    
    """
    Given scaler name, scale dataset and predict using pretrained models
    Uses all features
    Return prediction as array of 0s and 1s from all models
    """

    # scaling
    path = f"scalers/{scaler}.sav"
    sc = pickle.load(open(path, 'rb'))
    
    y_preds = []

    # scale X
    X = pd.DataFrame(sc.transform(X0))
    X.columns = cols

    model_names = [f"models/{model}_{scaler}.sav" for model in ['gbm', 'svc']]

    for model_name in model_names:
        # load the model from disk (machine learning)
        filename = model_name
        model = pickle.load(open(filename, 'rb'))
        y_pred = model.predict(X)
        y_preds.append(y_pred)
    
    return y_preds


In [7]:
y_pred_minmax = predict("minmax")
y_pred_std = predict("std")

# concatenate results from std and minmax models
y_pred_all = y_pred_std + y_pred_minmax

# # get average of all values
y_pred = np.round(np.array([np.array(i).flatten() for i in y_pred_all]).mean(axis=0), 0)

# append bookingID with y_pred
predictions = pd.DataFrame(bookingID)
predictions['label'] = y_pred

In [8]:
# save predictions to CSV
predictions.to_csv(filepath, index=False)
predictions.head()

Unnamed: 0,bookingID,label
0,8,0.0
1,13,1.0
2,33,1.0
3,35,1.0
4,91,0.0
