### Model building.

As we now have our features and targets, we start model building.

We start by splitting the data into training and test sets, I found that for time series data it is better to split the data using a cut off date/time rather than randomly splitting.

we start with baseline models to gauge model performance, then further iterate.

We use mean absolute error (MAE) to evalute our model performance.

In this notebook we build our first baseline model.

### Baseline model 1

Here, we just want to get a quick and easy approaximate estimate of a good / bad result of our error metric.

This baseline model simply takes the observed demand for the last hour and predicts demand for the next hour.

Loading the training data.

In [2]:
import sys
sys.path.append(r"C:\Users\User\capstone_project")

In [3]:
import pandas as pd
from src.paths import TRANSFORMED_DATA_DIR

df = pd.read_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')
df

Unnamed: 0,rides_previous_672_hour,rides_previous_671_hour,rides_previous_670_hour,rides_previous_669_hour,rides_previous_668_hour,rides_previous_667_hour,rides_previous_666_hour,rides_previous_665_hour,rides_previous_664_hour,rides_previous_663_hour,...,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,...,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2022-01-29,1,0.0
1,0.0,0.0,0.0,0.0,0.0,4.0,1.0,2.0,1.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2022-01-30,1,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,1.0,2.0,0.0,0.0,0.0,0.0,2022-01-31,1,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,2.0,1.0,0.0,1.0,1.0,0.0,0.0,2022-02-01,1,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-02-02,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89300,3.0,0.0,2.0,3.0,2.0,3.0,13.0,8.0,9.0,9.0,...,6.0,5.0,3.0,1.0,6.0,1.0,3.0,2022-12-27,265,3.0
89301,6.0,4.0,0.0,0.0,2.0,0.0,14.0,7.0,8.0,4.0,...,4.0,2.0,1.0,2.0,2.0,2.0,8.0,2022-12-28,265,1.0
89302,7.0,2.0,3.0,4.0,7.0,4.0,10.0,9.0,7.0,11.0,...,2.0,3.0,5.0,1.0,1.0,0.0,8.0,2022-12-29,265,3.0
89303,6.0,5.0,4.0,3.0,0.0,3.0,11.0,12.0,9.0,10.0,...,3.0,3.0,1.0,2.0,0.0,1.0,2.0,2022-12-30,265,7.0


Splitting the data into train and test sets. Here we use the train_test_split function.

In [4]:
from datetime import datetime
from src.data_split import train_test_split

X_train, y_train, X_test, y_test = train_test_split(
    df,
    cutoff_date=datetime(2022, 6, 1, 0, 0, 0), # cutoff date is 1st June 2022, this ensures that the datasets are large enough.
    target_column_name='target_rides_next_hour'
)

print(f'{X_train.shape=}')
print(f'{y_train.shape=}')
print(f'{X_test.shape=}')
print(f'{y_test.shape=}')

X_train.shape=(32595, 674)
y_train.shape=(32595,)
X_test.shape=(56710, 674)
y_test.shape=(56710,)


In [5]:
# Creating a python class for the baseline model 1

import numpy as np

class BaselineModelPreviousHour:
    """
    Prediction = actual demand observed in the last hour
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series): # fitting the model, here this does essentially nothing
        pass
    
    def predict(self, X_test: pd.DataFrame) -> np.array: # inference : predicting the target values
        """"""
        return X_test[f'rides_previous_1_hour']

Now let us create an object to use the above class.

In [6]:
model = BaselineModelPreviousHour() # instantiating the model
predictions = model.predict(X_test) # making predictions
predictions

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
56705    3.0
56706    8.0
56707    8.0
56708    2.0
56709    7.0
Name: rides_previous_1_hour, Length: 56710, dtype: float32

### Evaluating baseline model 1

As we are predicting a value (target - number of taxi rides in the next hour) that can be any positive number.

A standard metric for this is `Mean Absolute Error (MAE)` which is a measure of the average absolute difference between the predicted values and the actual values. It measures the effectiveness of our model.

In [7]:
from sklearn.metrics import mean_absolute_error

test_mae = mean_absolute_error(y_test, predictions)
print(f'{test_mae=:.4f}')

test_mae=6.0558


### Baseline model 2

As noted earlier, our data has weekly seasonality meaning at any given point we can use observations from exactly a week ago to make a prediction about the number of rides in the next hour.

We implement this logic here.

In [8]:
# Creating a python class for the baseline model 2
class BaselineModelPreviousWeek:
    """
    Prediction = actual demand observed at t - 7 days
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        pass
    
    def predict(self, X_test: pd.DataFrame) -> np.array:
        """"""
        return X_test[f'rides_previous_{7*24}_hour']

In [9]:
# Creating an object to use the above class.
model = BaselineModelPreviousWeek()
predictions = model.predict(X_test)

### Evaluating baseline model 2

In [10]:
test_mae = mean_absolute_error(y_test, predictions)
print(f'{test_mae=:.4f}')

test_mae=3.6811


The logic used in making predictions based weekly seasonality is good as the model here performs better than baseline_model_1 given obverved MAE.

baseline_model_1_mae = 6.05 Vs baseline_model_2_mae = 3.68

Baseline model 2 improved as compared to baseline model 1

### Baseline model 3

Now, we take the above logic further. We leverage the weekly seasonality but take the average estimate over the last four weeks.

In [11]:
# Creating a python class for the baseline model 3
class BaselineModelLast4Weeks:
    """
    Prediction = actual demand observed at t - 7 days, t - 14 days, t - 21 days, t - 28 days
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        pass
    
    def predict(self, X_test: pd.DataFrame) -> pd.Series:
        """"""
        return 0.25*(
            X_test[f'rides_previous_{7*24}_hour'] + \
            X_test[f'rides_previous_{2*7*24}_hour'] + \
            X_test[f'rides_previous_{3*7*24}_hour'] + \
            X_test[f'rides_previous_{4*7*24}_hour']
        )

In [12]:
# Creating an object to use the above class and making predictions.
model = BaselineModelLast4Weeks()
predictions = model.predict(X_test)

### Evaluating baseline model 3

In [13]:
test_mae = mean_absolute_error(y_test, predictions)
print(f'{test_mae=:.4f}')

test_mae=3.1963


As observed our model further improved based on MAE.

As we now have a relative understanding of model performance following the baseline models, let us move on to building machine learning models.