# Block 6 Exercise 2: finding the best parameters for predicting the fare of taxi rides
We return to our Random Forest Regression and want to automatically optimize all free parameters ...

In [5]:
!pip install folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 1.6 MB/s eta 0:00:011
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [6]:
import pandas as pd
import numpy as np
import folium


In [7]:
# we load the data we have saved after wrangling and pre-processing in block I
X=pd.read_csv('../../DATA/train_cleaned.csv')
drop_columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1','key','pickup_datetime','pickup_date','pickup_latitude_round3','pickup_longitude_round3','dropoff_latitude_round3','dropoff_longitude_round3']
X=X.drop(drop_columns,axis=1)
X=pd.get_dummies(X)# one hot coding
#generate labels
y=X['fare_amount']
X=X.drop(['fare_amount'],axis=1)

### Scikit Optimize
Scikit Optimize (https://scikit-optimize.github.io/stable/index.html) is a AutoML toolbox wrapped around Scikit-Learn. It allows us to use state-of-the-art automatic hyper-parameter optimization on top of our learning algorithms.   



In [2]:
# install 
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.8.1-py2.py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 5.9 MB/s eta 0:00:01
Collecting pyaml>=16.9
  Downloading pyaml-20.4.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-20.4.0 scikit-optimize-0.8.1


### E 2.1 Bayesian Optimization of a Random Forest Regression Model
use Bayesian Optimization with Cross-Validation (https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV) to find the best regression model. Compare
* linear regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) 
* Random Forest regression (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* and SVM regression (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

NOTES: this can become quite compute intensive! Hence,
* use a smaller subset of the training data to run the experiments 
* think about the range of your parameters (e.g. larger number of trees in RF or high C-values in SMV will make models expensive)
* optimize only the following parameters per model type:
    * linear: no parameters to optimize
    * RF: #trees and depth
    * SVM: C and gamma (use RBF kernel)
* parallelize -> n_jobs
* use CoLab to rum the job for up to 12h 


In [32]:
from skopt import BayesSearchCV as BSCV
from skopt.space.space import Integer as skopt_int
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.svm import SVR 
from sklearn.metrics import mean_squared_error as mse
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [9]:
np.shape(X)

(400000, 31)

In [10]:
y_less = y[:10000]
X_less=X[:10000]
print(np.shape(X_less), np.shape(y_less))

(10000, 31) (10000,)


In [11]:
X_l_train,X_l_test,y_l_train,y_l_test = train_test_split(X_less,y_less, test_size=0.1, random_state = 1)

In [23]:
print(type(y_l_train), type(y_l_test))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


***Linear regression.***
- There are no parameters to be optimzied. So no BayesSearchCV is used

In [25]:
LRm = LR().fit(X_l_train,y_l_train)

In [26]:
res_LR_y_l_test = LRm.predict(X_l_test)

In [27]:
mse_LR_test = mse(y_l_test,res_LR_y_l_test)
mse_LR_test

25.88405237662758

***Random Forest Tree***

In [29]:
RFR_pl = make_pipeline(StandardScaler(),RFR(random_state = 0))

In [43]:
RFR_pl.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'randomforestregressor', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'randomforestregressor__bootstrap', 'randomforestregressor__ccp_alpha', 'randomforestregressor__criterion', 'randomforestregressor__max_depth', 'randomforestregressor__max_features', 'randomforestregressor__max_leaf_nodes', 'randomforestregressor__max_samples', 'randomforestregressor__min_impurity_decrease', 'randomforestregressor__min_impurity_split', 'randomforestregressor__min_samples_leaf', 'randomforestregressor__min_samples_split', 'randomforestregressor__min_weight_fraction_leaf', 'randomforestregressor__n_estimators', 'randomforestregressor__n_jobs', 'randomforestregressor__oob_score', 'randomforestregressor__random_state', 'randomforestregressor__verbose', 'randomforestregressor__warm_start'])

In [44]:
search_spaces = {'randomforestregressor__n_estimators': skopt_int(1,1000),
                  'randomforestregressor__max_depth' : skopt_int(2,100)}

In [45]:
RFR_BSCV = BSCV(estimator = RFR_pl, search_spaces = search_spaces, n_iter=2,n_jobs=-1,random_state=0)

In [46]:
RFR_BSCV.fit(X_l_train,y_l_train)

BayesSearchCV(cv=None, error_score='raise',
              estimator=Pipeline(memory=None,
                                 steps=[('standardscaler',
                                         StandardScaler(copy=True,
                                                        with_mean=True,
                                                        with_std=True)),
                                        ('randomforestregressor',
                                         RandomForestRegressor(bootstrap=True,
                                                               ccp_alpha=0.0,
                                                               criterion='mse',
                                                               max_depth=None,
                                                               max_features='auto',
                                                               max_leaf_nodes=None,
                                                               max_samples=None,
        

In [47]:
RFR_BSCV.best_estimator_

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=27,
                                       max_features='auto', max_leaf_nodes=None,
                                       max_samples=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=979, n_jobs=None,
                                       oob_score=False, random_state=0,
                                       verbose=0, warm_start=False))],
         verbose=False)

In [50]:
RFR_BSCV.cv_results_

defaultdict(list,
            {'split0_test_score': [0.8356214852066786, 0.83668392818056],
             'split1_test_score': [0.6682446028326302, 0.6760126203359915],
             'split2_test_score': [0.7236574772221898, 0.7251167362670168],
             'split3_test_score': [0.7887024315955146, 0.790589601939754],
             'split4_test_score': [0.8567592825592136, 0.8569528801573842],
             'mean_test_score': [0.7745970558832452, 0.7770711533761413],
             'std_test_score': [0.07011313576589957, 0.06786694365061458],
             'rank_test_score': [2, 1],
             'mean_fit_time': [24.963960361480712, 35.80827255249024],
             'std_fit_time': [1.082761511067307, 0.9936755870596244],
             'mean_score_time': [0.20319738388061523, 0.3004201889038086],
             'std_score_time': [0.005120540610088816, 0.0070816043766760775],
             'param_randomforestregressor__max_depth': [54, 27],
             'param_randomforestregressor__n_estimators':