# Model Building

With the data ready, I'll test different prediction models. In this case, I will use three regression models: linera, lasso and a random forest. Also, I will evaluate their results through cross validation.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from mlxtend.regressor import StackingCVRegressor

In [2]:
df = pd.read_csv("eda_data.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,travel_time,distance,traffic_rating,car_or_bus,day,hour
0,0,8.282933,2.652,3,1,5,7
1,1,10.289083,5.29,3,1,5,13
2,2,4.061917,0.918,3,2,2,5
3,3,23.372667,7.7,3,2,2,5
4,4,9.288033,3.995,2,1,4,15


In [4]:
df = df.drop("Unnamed: 0", axis=1)

In [5]:
df_dum = pd.get_dummies(df) #Get dummy data

X = df_dum.drop('travel_time', axis=1)
y = df_dum.travel_time.values

#Data split, 20% for test and 80% for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

In [6]:
#Linear regression model from the sklearn package
lr = LinearRegression()
lr.fit(X_train, y_train)

np.mean(cross_val_score(lr,X_train,y_train, scoring = 'r2', cv= 5))

0.6394889269480691

In [7]:
#Just wanted to try a Random Forest as I heard it was really efficient for regression, lol
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

np.mean(cross_val_score(rf,X_train,y_train,scoring = 'r2', cv= 5))

0.5496188108166062

In [17]:
#Lasso regression from sklearn
lasso = Lasso(alpha=.13)
lasso.fit(X_train,y_train)

np.mean(cross_val_score(lasso,X_train,y_train, scoring = 'r2', cv= 5))

0.6409531664227399

In [14]:
pred_lr = lr.predict(X_test)
mean_absolute_error(y_test,pred_lr)

6.9488136343796265

In [15]:
pred_rf = rf.predict(X_test)
mean_absolute_error(y_test,pred_rf)

7.6264008955938705

In [18]:
pred_lasso = lasso.predict(X_test)
mean_absolute_error(y_test,pred_lasso)

6.864456453404545

Definitely the linear and lasso regression models outperformed the random forest, which does not surprise me since lasso regression, for example, is very good at handling variables with low correlation (like the ones we have in the database).

## Stacking Models

A good practice to improve prediction is "model stacking", which is nothing more than using the results of the initial models as new features to be evaluated with a second model (meta-learner). Thanks to Casper Hansen's article on this topic, I used the mlxtend package as it integrates well with sci-kit learn.

In [19]:
#the linear model gave very good results as a meta-regressor
stack = StackingCVRegressor(regressors=(lr, lasso),
                            meta_regressor=lr, cv=5,
                            use_features_in_secondary=True,
                            store_train_meta_features=True,
                            shuffle=False,
                            random_state=42)

stack.fit(X_train, y_train)

np.mean(cross_val_score(stack,X_train,y_train, scoring = 'r2', cv= 5))

0.6554018302367147

*Accuracy increased 1.5% over the first model, jeje*

In [22]:
pred_stack = stack.predict(X_test)
mean_absolute_error(y_test,pred_stack)

6.510461639639896

In [24]:
total_pred = stack.predict(X)
mean_absolute_error(y,total_pred)

5.407176932827016

In [38]:
#Save the results from the meta-regressor in a new dataframe 
results = df[["distance","traffic_rating","car_or_bus","day","hour","travel_time"]]
results = results.rename({"travel_time":"real_travel_time"}, axis=1)
results["pred_travel_time"] = total_pred

In [39]:
results.head()

Unnamed: 0,distance,traffic_rating,car_or_bus,day,hour,real_travel_time,pred_travel_time
0,2.652,3,1,5,7,8.282933,9.016553
1,5.29,3,1,5,13,10.289083,14.624929
2,0.918,3,2,2,5,4.061917,6.092793
3,7.7,3,2,2,5,23.372667,23.044601
4,3.995,2,1,4,15,9.288033,13.691697


In [40]:
results.to_csv('results.csv', index=False)