# Block 6 Exercise 2: finding the best parameters for predicting the fare of taxi rides
We return to our Random Forest Regression and want to automatically optimize all free parameters ...

In [1]:
import pandas as pd
import numpy as np
import folium
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from skopt import BayesSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from skopt.space import Categorical, Integer, Real

In [2]:
# we load the data we have saved after wrangling and pre-processing in block I
X=pd.read_csv('../../DATA/train_cleaned.csv')
drop_columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1','key','pickup_datetime','pickup_date','pickup_latitude_round3','pickup_longitude_round3','dropoff_latitude_round3','dropoff_longitude_round3']
X=X.drop(drop_columns,axis=1)
X=pd.get_dummies(X)# one hot coding
#generate labels
y=X['fare_amount']
X=X.drop(['fare_amount'],axis=1)

### Scikit Optimize
Scikit Optimize (https://scikit-optimize.github.io/stable/index.html) is a AutoML toolbox wrapped around Scikit-Learn. It allows us to use state-of-the-art automatic hyper-parameter optimization on top of our learning algorithms.   



In [3]:
# install 
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.8.1-py2.py3-none-any.whl (101 kB)
Collecting pyaml>=16.9
  Downloading pyaml-20.4.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-20.4.0 scikit-optimize-0.8.1


### E 2.1 Bayesian Optimization of a Random Forest Regression Model
use Bayesian Optimization with Cross-Validation (https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV) to find the best regression model. Compare
* linear regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) 
* Random Forest regression (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* and SVM regression (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

NOTES: this can become quite compute intensive! Hence,
* use a smaller subset of the training data to run the experiments 
* think about the range of your parameters (e.g. larger number of trees in RF or high C-values in SMV will make models expensive)
* optimize only the following parameters per model type:
    * linear: no parameters to optimize
    * RF: #trees and depth
    * SVM: C and gamma (use RBF kernel)
* parallelize -> n_jobs
* use CoLab to rum the job for up to 12h 


## Linear Regression

In [3]:
X_train_LR, X_test_LR, Y_train_LR, Y_test_LR = train_test_split(X, y, train_size = 300000)

In [4]:
pipe = make_pipeline(StandardScaler(), LinearRegression(n_jobs = -2))

In [5]:
Bayes_LR = BayesSearchCV(pipe, {'linearregression__fit_intercept':Integer(0,1)})

In [6]:
Bayes_LR.fit(X_train_LR, Y_train_LR)



BayesSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                        ('linearregression',
                                         LinearRegression(n_jobs=-2))]),
              search_spaces={'linearregression__fit_intercept': Integer(low=0, high=1, prior='uniform', transform='identity')})

In [7]:
Bayes_LR.best_params_

OrderedDict([('linearregression__fit_intercept', 1)])

In [8]:
Bayes_LR.score(X_test_LR, Y_test_LR)

0.7324761899308405

## Random Forest

In [27]:
X_train_RF, X_test_RF, Y_train_RF, Y_test_RF = train_test_split(X, y, train_size = 100000)

In [28]:
pipe = make_pipeline(StandardScaler(), RandomForestRegressor(n_jobs = -2))

In [35]:
Bayes_RF = BayesSearchCV(pipe, {'randomforestregressor__n_estimators':Integer(25,40), 'randomforestregressor__max_depth':Integer(5,15)})

In [36]:
Bayes_RF.fit(X_train_RF, Y_train_RF)



BayesSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                        ('randomforestregressor',
                                         RandomForestRegressor(n_jobs=-2))]),
              search_spaces={'randomforestregressor__max_depth': Integer(low=5, high=15, prior='uniform', transform='identity'),
                             'randomforestregressor__n_estimators': Integer(low=25, high=40, prior='uniform', transform='identity')})

In [37]:
Bayes_RF.best_params_

OrderedDict([('randomforestregressor__max_depth', 13),
             ('randomforestregressor__n_estimators', 40)])

In [38]:
Bayes_RF.score(X_test_RF, Y_test_RF)

0.8036131022766586

## SVM regression

In [15]:
X_train_SVM, X_test_SVM, Y_train_SVM, Y_test_SVM = train_test_split(X, y, train_size = 10000)

In [16]:
pipe = make_pipeline(StandardScaler(), SVR(kernel = 'rbf'))

In [17]:
Bayes_SVM = BayesSearchCV(pipe, {'svr__C':Integer(3,6), 'svr__gamma':Categorical(['auto', 'scale'])})

In [18]:
Bayes_SVM.fit(X_train_SVM, Y_train_SVM)



BayesSearchCV(estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                        ('svr', SVR())]),
              search_spaces={'svr__C': Integer(low=3, high=6, prior='uniform', transform='identity'),
                             'svr__gamma': Categorical(categories=('auto', 'scale'), prior=None)})

In [19]:
Bayes_SVM.best_params_

OrderedDict([('svr__C', 6), ('svr__gamma', 'auto')])

In [20]:
Bayes_SVM.score(X_test_SVM, Y_test_SVM)

0.7465752407614337

We can see that the Random Forest Regressor is the best model in this case, in relation score-computation time

The SVM has a lower score than  the Random Forest, but it is also due to the low number of samples (mandatory because of computation time (almost one hour for the SVM))