# Zindi UmojaHack South Africa: Yassir ETA Prediction Challenge by UmojaHack Africa
**Team**
- Taahir Bhorat
- Stuart Mesham
- Mahmood-Ali Parker

**Date:** 25 July 2020
 
**Time Constraint:** 9 hours
 
**URL:** https://zindi.africa/hackathons/umojahack-south-africa-yassir-eta-prediction-challenge

## Spec
Ride-hailing apps like Uber and Yassir rely on real-time data and machine learning algorithms to automate their services. Accurately predicting the estimated time of arrival (ETA) for Yassir trips will make Yassir’s services more reliable and attractive; this will have a direct and indirect impact on both customers and business partners. The solution would help the company save money and allocate more resources to other parts of the business.

The objective of this hackathon is to predict the estimated time of arrival at the dropoff point for a single Yassir journey.

## The Data
The data contains details for 119,549 trips (train and test are split by date). Each row contains a start location and end location (reported as latitude and longitude to within approximately 100m) and the travel distance along the fastest route. Each trip also has a timestamp, which can be used to pull the weather for that day from Weather.csv file. The weather data includes temperature, rainfall and wind speed for the time period during which the trip data was collected.
 
For confidentiality reasons, the data itself isn't present in the repository but the notebook should provide an adequate outlook on our modelling process.

## Package Imports

In [38]:
import pandas as pd
from xgboost.sklearn import XGBRegressor
from catboost import CatBoostRegressor

## Dataframe Population and Feature Engineering
It's worth mentioning that we had methods to handle a weather dataframe and join that to our main dataframe based on date. However, we found using weather variables severely decreased performance and therefore ended up omitting weather entirely.
 
The pre_process method processes our dataframe's timestamp column to encode important components of time into their own columns. We only used day of week and hour because we found the other components detracted from our model performance.

In [39]:
def pre_process(df):    
    StartTime = pd.to_datetime(df['Timestamp'], infer_datetime_format=True)
    
    df['Day_in_week'] = StartTime.dt.dayofweek
    df['Hour_in_Day'] = StartTime.dt.hour
    df = df.drop('Timestamp', axis=1)
    
    return df

The clean_training_set method removes rows with unreasonably high average speeds (faster than 200 km/h)

In [40]:
def clean_training_set(trips_df):
    return trips_df[(trips_df['Trip_distance'] / 1000) / (trips_df['ETA'] / (60 * 60)) <= 200]

The split_X_y method drops the ETA and ID columns from a dataframe.

In [41]:
def split_X_y(df):
    return df.drop(['ETA', 'ID'], axis=1), df['ETA']

We read in the 'Train.csv' which contains all validatable data we are given for the hackathon into a dataframe. We then sort, clean and pre_process our dataframe

In [42]:
data = pd.read_csv('Train.csv')

data = data.sort_values('Timestamp', ascending=False)
data = clean_training_set(data)
data = pre_process(data)

In [43]:
data.shape

(83904, 9)

Next we split our data into training and validation sets. We don't have a test set because the hackathon submission fulfills that roll. We use 8000 items for validation (about a 90:10 split) because 8000 items provides adequate data for accurate validation.

In [44]:
train = data.iloc[8000:]
val = data.iloc[:8000]
X_train, y_train = split_X_y(train)
X_val, y_val = split_X_y(val)

## Modelling
Initially, we used both XGBoost and CatBoost but after extensive experimentation, we got better performance with CatBoost. We did, however use XGBoost's feature importance functionality for some invaluable insights into variable selection for what would eventually be our 4th place model.
 
Our performance metric of choice is root mean squared error (RMSE) as that's what the hackathon spec required.
### XGBoost
We create our XGBoost Regression model structure first. The model structure was the most effective for XGBoost after extensive extensive testing.

In [34]:
xgb1 = XGBRegressor(
 learning_rate =0.1,
 objective='reg:squarederror',
 tree_method='hist',
 n_estimators=7000,
 max_depth=20,
 max_leaves=120,
 min_child_weight=2,
 gamma=0,
 subsample=0.7,
 colsample_bytree=0.8,
 scale_pos_weight=1)

[0]	validation_0-rmse:1126.08	validation_1-rmse:1120.14
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 15 rounds.
[200]	validation_0-rmse:124.003	validation_1-rmse:159.849
[400]	validation_0-rmse:103.003	validation_1-rmse:153.334
Stopping. Best iteration:
[534]	validation_0-rmse:94.4361	validation_1-rmse:151.855



XGBRegressor(colsample_bytree=0.8, max_depth=20, max_leaves=120,
             min_child_weight=2, n_estimators=7000,
             objective='reg:squarederror', subsample=0.7, tree_method='hist')

Next we fit our training data on our model. The early_stopping_rounds are set to 15 to prevent overfitting. 

In [None]:
xgb1.fit(
    X_train, 
    y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, y_train), (X_val, y_val)], 
    verbose=200, 
    early_stopping_rounds = 15)

We can see our best validation accuracy is RMSE = 151. While this could possibly be improved, we'll show that the CatBoost model provided a slightly better predictive performance.
 
We compute feature importances using our xgb1 model next to see which variables are more useful for predicting ETA. This was useful in identifying that weather variables were detrimental to performance. This lead to us ultimately omitting weather variables and certain time components. The lower the values, the less useful the variables.

In [36]:
pd.DataFrame({'Variable':X_train.columns,
              'Importance':xgb1.feature_importances_}).sort_values('Importance', ascending=False)

Unnamed: 0,Variable,Importance
4,Trip_distance,0.717411
3,Destination_lon,0.100044
1,Origin_lon,0.062338
2,Destination_lat,0.046074
0,Origin_lat,0.036616
6,Hour_in_Day,0.025214
5,Day_in_week,0.012304


### CatBoost
We create our CatBoost Regression model structure next. This model structure was the most effective overall.

In [45]:
cb1 = CatBoostRegressor(
    loss_function='RMSE',
    iterations=7000,
    grow_policy='Lossguide',
    bootstrap_type='Bayesian',
    max_leaves=120,
    task_type='CPU'
)

Next we fit our training data on our model. We didn't use any measures to prevent overfitting because our model was most accurate by completing all the iterations.

In [46]:
cb1.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    verbose=200
)

Learning rate set to 0.031071
0:	learn: 550.2847104	test: 537.1941735	best: 537.1941735 (0)	total: 84.6ms	remaining: 9m 52s
200:	learn: 175.3634796	test: 181.0933405	best: 181.0933405 (200)	total: 11.4s	remaining: 6m 24s
400:	learn: 157.2134543	test: 168.7192825	best: 168.7192825 (400)	total: 22.3s	remaining: 6m 7s
600:	learn: 146.5637327	test: 162.3385801	best: 162.3385801 (600)	total: 33.4s	remaining: 5m 56s
800:	learn: 139.5812420	test: 158.9032908	best: 158.9032908 (800)	total: 44.4s	remaining: 5m 43s
1000:	learn: 133.9608654	test: 156.5615393	best: 156.5615393 (1000)	total: 55.4s	remaining: 5m 31s
1200:	learn: 129.4984824	test: 155.0104922	best: 155.0053389 (1198)	total: 1m 5s	remaining: 5m 18s
1400:	learn: 125.5763762	test: 153.7655379	best: 153.7655379 (1400)	total: 1m 15s	remaining: 5m
1600:	learn: 122.2885239	test: 152.7685718	best: 152.7685718 (1600)	total: 1m 24s	remaining: 4m 45s
1800:	learn: 119.5511852	test: 151.9945630	best: 151.9945630 (1800)	total: 1m 32s	remaining: 4m

<catboost.core.CatBoostRegressor at 0x258ebf6efc8>

We can see our best validation accuracy is RMSE = 146. This is the final model that got our team 4th place. We got this result less than an hour before the deadline so it's fair to assume that there's room for at least a little improvement.
