# Safety Dataset Preprocessing and EDA
by Prince Joseph Erneszer Javier

## Executive Summary

In this notebook, we load and process data for Grab AI for SEA challenge under the Safety category. We were provided Telematics data and from these data we would develop models that can predict if a driver is driving safely or not. The raw features dataset contains 16 million samples and 11 columns including the `bookingID`. There are 20,000 unique `bookingID`'s each with either 0 or 1 corresponding to safe or unsafe driving. In this notebook, we loaded the data, engineered features, sampled the data, and saved the processed data to be used to train the machine learning classifiers in the notebook `grab-ai-training`. `grab-ai-testing` contains the trained models predicting on test data. Finally, `grab-ai-predicting` contains the pipeline for predicting on a new dataset.

The following preprocessing were performed in this notebook: the features data saved in multiple CSVs were concatenated into one dataframe, a 5% test data from the features dataset was separated from the 95% training data, features were engineered and samples were aggregated per `bookingID`, the class labels of 0 or 1 per `bookingID` was concatenated to the aggregated and engineered feature set. Finally, since the classes were imbalanced, the number of classes were equalized to be 50-50 for the machine learning models. The final processed datasets were saved as CSV files.

## Introduction

Grab AI for SEA challenge is a hackathon organized by Grab. Grab offers three challenges that can be solved using AI: Traffic Management, Computer Vision, and Safety. We tackle the Safety Challenge. The `Ride Safety` dataset was provided by Grab, which contains Telematics data (acceleration, gyroscope data, speed, etc.), `bookingID`, and labels (0 or 1 for safe or unsafe driving). The raw dataset was prepared for the machine learning models in `grab-ai-training` notebook.

## About the Data

The `Ride Safety` dataset contains two folders: `features` and `labels`. `features` contains 10 CSV files which contain a total of 16 million telematics data samples. The columns in the `features` dataset as described in `data_dictionary.xlsx` are:

|Column Name|Description|
|:--|:--|
|`bookingID`|trip id|
|`Accuracy`|accuracy inferred by GPS in meters|
|`Bearing`|GPS bearing|
|`acceleration_x`|accelerometer reading in x axis (m/s2)|
|`acceleration_y`|accelerometer reading in y axis (m/s2)|
|`acceleration_z`|accelerometer reading in z axis (m/s2)|
|`gyro_x`|gyroscope reading in x axis (rad/s)|
|`gyro_y`|gyroscope reading in y axis (rad/s)|
|`gyro_z`|gyroscope reading in z axis (rad/s)|
|`second`|time of the record by number of seconds|
|`Speed`|speed measured by GPS in m/s|

## Preprocessing

In [1]:
# loading packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob

from collections import Counter
from scipy.stats import kurtosis, skew
from scipy.signal import periodogram
import multiprocessing
from sklearn.utils import shuffle

import warnings
warnings.simplefilter('ignore')

n_jobs = multiprocessing.cpu_count()

### Loading the Data

There are two folders `features` and `labels` and one excel file `data_dictionary.xlsx`.

In [2]:
# check the dataset folders
!ls data/safety/

data_dictionary.xlsx  features	labels


We read `data_dictionary.xlsx` and check the contents.

In [3]:
# read the data dictionary
xl = pd.ExcelFile("data/safety/data_dictionary.xlsx")

In [4]:
# sheet names
xl.sheet_names

['telematics data', 'labels']

In [5]:
# let's see the first sheet
xl.parse('telematics data')

Unnamed: 0.1,Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,description,trip id,accuracy inferred by GPS in meters,GPS bearing,accelerometer reading in x axis (m/s2),accelerometer reading in y axis (m/s2),accelerometer reading in z axis (m/s2),gyroscope reading in x axis (rad/s),gyroscope reading in y axis (rad/s),gyroscope reading in z axis (rad/s),time of the record by number of seconds,speed measured by GPS in m/s
1,samples,1,5,303.695,-0.00636292,-0.393829,-0.922379,"-0.020000606102604086,0.03205247529964867,-0.0...",,,0,0.57
2,,1,10,325.39,0.183914,-0.355026,-0.92041,"-0.028598887998033916,0.025720543491876274,-0....",,,1,0.28
3,,1,5,303.695,-0.00636292,-0.392944,-0.922226,"-0.01894040167264354,0.030980020328673762,-0.0...",,,2,0.57
4,,1,10,324.23,0.165924,-0.332092,-0.920578,"-0.0577245492596855,0.002558232543130116,0.014...",,,3,0.28
5,,1,5,303.695,-0.00642395,-0.392166,-0.924164,"-0.017865283540578553,0.03203915949419828,-0.0...",,,4,0.57
6,,1,10,324.23,0.169724,-0.333694,-0.939575,"-0.04030587783391324,0.03112276576310201,-0.00...",,,5,0.28
7,,1,5,303.695,-0.00480652,-0.391861,-0.923065,"-0.020006731373111267,0.03205726898961082,-0.0...",,,6,0.57
8,,2,10,322.99,0.174759,-0.344498,-0.918839,"-0.00411618178082647,0.02562493600874243,0.004...",,,0,0.28
9,,2,5,303.695,-0.00646973,-0.391953,-0.923889,"-0.0189390700920985,0.032050611086885616,-0.00...",,,1,0.57


In [6]:
# let's see the second sheet
xl.parse('labels')

Unnamed: 0,bookingID,label
0,1,1
1,2,0
2,3,0
3,4,0
4,5,1
5,6,1
6,7,1
7,8,1
8,9,0
9,10,0


We then checked the contents of `features`. There are 10 CSV files.

In [7]:
# let's see the contents of features and labels folder
!ls data/safety/features

# there are many csvs, we either want to run this in Pyspark 
# or we can combine them into just one Pandas dataset

part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00005-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00007-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00008-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
part-00009-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv


`labels` contains only one CSV file.

In [8]:
!ls data/safety/labels

part-00000-e9445087-aa0a-433b-a7f6-7f4c19d78ad6-c000.csv


Below shows some contents of one CSV file in `features`.

In [9]:
# let's load one features csv, there are many csvs we either
_ = pd.read_csv("data/safety/features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv")
_.head()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,1202590843006,3.0,353.0,1.228867,8.9001,3.986968,0.008221,0.002269,-0.009966,1362.0,0.0
1,274877907034,9.293,17.0,0.032775,8.659933,4.7373,0.024629,0.004028,-0.010858,257.0,0.19
2,884763263056,3.0,189.0,1.139675,9.545974,1.951334,-0.006899,-0.01508,0.001122,973.0,0.667059
3,1073741824054,3.9,126.0,3.871543,10.386364,-0.136474,0.001344,-0.339601,-0.017956,902.0,7.913285
4,1056561954943,3.9,50.0,-0.112882,10.55096,-1.56011,0.130568,-0.061697,0.16153,820.0,20.419409


Below are some of the contents in the CSV file inside `labels`.

In [10]:
# let's load the labels csv
labels = pd.read_csv("data/safety/labels/part-00000-e9445087-aa0a-433b-a7f6-7f4c19d78ad6-c000.csv")

# drop duplicates
labels = labels.dropna()

# sort
labels.sort_values(by="bookingID", inplace=True)
labels.reset_index(drop=True, inplace=True)

labels.head()

Unnamed: 0,bookingID,label
0,0,0
1,1,1
2,2,1
3,4,1
4,6,0


There are no null values in the `labels` dataset.

In [11]:
# how many null
labels.isnull().sum()

bookingID    0
label        0
dtype: int64

There are 20,018 rows in the `labels` dataset but there are only 20,000 unique `bookingID`. We will remove the duplicates later.

In [12]:
# how many labels are there?
len(labels)

20018

In [13]:
# how many unique bookings are there?
len(labels.bookingID.unique())

20000

We then combined the 10 CSV files under `features` into one Pandas dataframe.

In [None]:
# load all paths into list
paths = glob.glob("data/safety/features/*.csv")

# let's combine all feature into one pandas dataframe
df_feats0 = pd.DataFrame()

for path in paths:
    _ = pd.read_csv(path, header="infer")
    df_feats0 = pd.concat([df_feats0, _])

In [None]:
df_feats0.head()

In [None]:
df_feats0.shape

We dropped any duplicates in the combined `features` dataset.

In [None]:
# drop duplicates
df_feats0 = df_feats0.drop_duplicates()

In [None]:
# count null values
df_feats0.isnull().sum()

In [None]:
# since there is only one label per bookingID, we engineer features per booking trip
# min, max, mean, std, skewness, kurtosis, length_of_trip, 

# sorting by booking number and seconds
df_feats0 = df_feats0.sort_values(by=["bookingID", "second"]).reset_index(drop=True)
df_feats0.head()

# # save to csv df_feats0
# df_feats0.to_csv("data/processed/df_feats0.csv", index=False)

In [None]:
# # load csv
# df_feats0 = pd.read_csv("data/processed/df_feats0.csv")
# df_feats0.head()

In [None]:
# setting aside test data 5% of all booking IDs
num_test = int(len(df_feats0.bookingID.unique()) * 0.05)
print(num_test)

# select random booking IDs
test_bookingIDs = np.random.choice(df_feats0.bookingID.unique(), num_test, replace=False)
train_bookingIDs = [i for i in df_feats0.bookingID.unique() if i not in test_bookingIDs]
df_test = df_feats0[df_feats0.bookingID.isin(test_bookingIDs)]
df_train = df_feats0[df_feats0.bookingID.isin(train_bookingIDs)]

In [None]:
# save to csv df_test
df_test.to_csv("data/processed/df_test.csv", index=False)
df_train.to_csv("data/processed/df_train.csv", index=False)

In [None]:
# load csv
df_test = pd.read_csv("data/processed/df_test.csv")
df_train = pd.read_csv("data/processed/df_train.csv")

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
def dominant_f(y):
    """Given time series y, get frequency of maximum power
    from periodogram"""
    f, p = periodogram(y, scaling='spectrum')
    ind = np.argsort(p)
    f_max = f[ind[-1]]
    return f_max

def max_power(y):
    """Given time series y, get maximum power"""
    f, p = periodogram(y, scaling='spectrum')
    return p.max()

In [None]:
# we engineer feature, aggregating feature values per bookingID
# getting min, max, range, mean, std, skewness, and kurtosis, dominant fft freq, and max fft power

df_engg_feats = df_train.drop("second", axis=1)
df_engg_feats = df_engg_feats.groupby(by="bookingID", as_index=True).agg([np.min, np.max, np.ptp, np.mean, np.std, skew, kurtosis, dominant_f, max_power])
df_engg_feats.head()


In [None]:
# flatten column names
cols = [df_engg_feats.columns[i][0]+"_"+df_engg_feats.columns[i][1] for i in range(len(df_engg_feats.columns))]
cols[:5]

In [None]:
df_engg_feats.columns = cols
df_engg_feats.reset_index(inplace=True)
df_engg_feats.head()

In [None]:
# add length of each trip
df_len = df_train.groupby(by="bookingID", as_index=True).agg(len).iloc[:, 0:1]
df_len.columns = ['trip_len']
df_len.reset_index(inplace=True)
df_len.head()

In [None]:
# merge along bookingID
df_engg_feats_2 = pd.merge(df_engg_feats, df_len, how="inner", on="bookingID")
df_engg_feats_2.head()

In [None]:
# left join with labels (aggregating and feature engineering)
df_engg_feats_labels = pd.merge(df_engg_feats_2, labels, how="inner", on="bookingID")

# save to CSV
save_path = "data/processed/engg_feats_labels.csv"
df_engg_feats_labels.to_csv(save_path, index=False)

In [None]:
df_engg_feats_labels.shape

In [None]:
df_engg_feats_labels.head()

In [None]:
df_engg_feats_labels.info()

In [None]:
# load saved combined dataset
save_path = "data/processed/engg_feats_labels.csv"
df_engg_feats_labels0 = pd.read_csv(save_path)
df_engg_feats_labels0.drop("bookingID", axis=1, inplace=True)
df_engg_feats_labels0.head()

In [None]:
df_feats = df_engg_feats_labels0.drop("label", axis=1)
df_feats.head()

## EDA

In [None]:
# we look at the correlations between the features using a covariance matrix
plt.figure(figsize=(20, 18))

df_corr_matrix = pd.DataFrame(np.corrcoef(df_engg_feats_labels0.values.T), 
                              index=df_engg_feats_labels0.columns, columns=df_engg_feats_labels0.columns)
# heatmap
sns.heatmap(df_corr_matrix, annot=True, fmt='0.1f', cmap="viridis")

In [None]:
plt.figure(figsize=(7, 15))

# barplot
y = df_corr_matrix.iloc[:, -1].drop("label").values
x = df_corr_matrix.iloc[:, [-1]].drop("label").index
inds = np.argsort(y)
y = y[inds]
x = x[inds]
plt.barh(x, y)
plt.title("Correlations of features with label")
plt.xlabel("correlation with label")
plt.ylabel("feature")
plt.margins(0.02)

In [None]:
# number of samples per class
print(Counter(df_engg_feats_labels0.label))
y = Counter(df_engg_feats_labels0.label).values()
x = Counter(df_engg_feats_labels0.label).keys()

plt.bar(x, y)
plt.xlabel("class")
plt.ylabel("counts")

## Preprocessing for Models

In [None]:
# equalizing the number of classes
# getting number of samples for class 1 (lower)
num_per_class = len(df_engg_feats_labels0[df_engg_feats_labels0.label == 1])
print(num_per_class)

# getting a sample for class 1
df_labels_1 = df_engg_feats_labels0[df_engg_feats_labels0.label == 1].sample(n=num_per_class, replace=False, random_state=42)

# getting a sample for class 0
df_labels_0 = df_engg_feats_labels0[df_engg_feats_labels0.label == 0].sample(n=num_per_class, replace=False, random_state=42)
print(len(df_labels_0))

In [None]:
# concatenate and shuffle
df_for_ml = shuffle(pd.concat([df_labels_1, df_labels_0]), random_state=42).reset_index(drop=True)

# shape
df_for_ml.shape

In [None]:
# save
df_for_ml.to_csv("data/processed/df_for_ml.csv", index=False)