# Block 6 Exercise 2: finding the best parameters for predicting the fare of taxi rides
We return to our Random Forest Regression and want to automatically optimize all free parameters ...

In [0]:
import pandas as pd
import numpy as np
import folium

In [7]:
#check if notebook runs in colab
import sys
IN_COLAB = 'google.colab' in sys.modules
print('running in Colab:',IN_COLAB)
path='..'
if IN_COLAB:
  #in colab, we need to clone the data from the repo
  !git clone https://github.com/keuperj/DataScienceSS20.git
  path='DataScienceSS20'

running in Colab: True
fatal: destination path 'DataScienceSS20' already exists and is not an empty directory.


In [0]:
# we load the data we have saved after wrangling and pre-processing in block I
X=pd.read_csv(path+'/DATA/train_cleaned.csv')
drop_columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1','key','pickup_datetime','pickup_date','pickup_latitude_round3','pickup_longitude_round3','dropoff_latitude_round3','dropoff_longitude_round3']
X=X.drop(drop_columns,axis=1)
X=pd.get_dummies(X)# one hot coding
#generate labels
y=X['fare_amount']
X=X.drop(['fare_amount'],axis=1)

### Scikit Optimize
Scikit Optimize (https://scikit-optimize.github.io/stable/index.html) is a AutoML toolbox wrapped around Scikit-Learn. It allows us to use state-of-the-art automatic hyper-parameter optimization on top of our learning algorithms.   



In [9]:
# install 
!pip install scikit-optimize



### E 2.1 Bayesian Optimization of a Random Forest Regression Model
use Bayesian Optimization with Cross-Validation (https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV) to find the best regression model. Compare
* linear regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) 
* Random Forest regression (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* and SVM regression (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

NOTES: this can become quite compute intensive! Hence,
* use a smaller subset of the training data to run the experiments 
* think about the range of your parameters (e.g. larger number of trees in RF or high C-values in SMV will make models expensive)
* optimize only the following parameters per model type:
    * linear: no parameters to optimize
    * RF: #trees and depth
    * SVM: C and gamma (use RBF kernel)
* parallelize -> n_jobs
* use CoLab to rum the job for up to 12h 


In [10]:
print (np.shape(X), " and ", np.shape(y))

(400000, 31)  and  (400000,)


In [0]:
X_subset = X[0:30000].astype(np.float64)
y_subset = y[0:30000].astype(np.float64)

X_test = X[30000:40000].astype(np.float64)
y_test = y[30000:40000].astype(np.float64)

In [12]:
print(np.shape(X_subset), " and ", np.shape(y_subset))

(30000, 31)  and  (30000,)


In [0]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_subset, y_subset)

In [14]:
reg.score(X_test, y_test)

0.7655512903352881

In [0]:
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

In [0]:
optRF = BayesSearchCV(RandomForestRegressor(), {'n_estimators': [int(x) for x in np.linspace(start = 20, stop = 200, num = 10)], 'max_depth': [int(x) for x in np.linspace(10, 80, num = 8)]}, n_jobs=16)
optRF.fit(X_subset, y_subset)

In [0]:
optRF.best_params_

OrderedDict([('max_depth', 80), ('n_estimators', 120)])

In [0]:
rfStandard = RandomForestRegressor()
rfOpt = RandomForestRegressor(max_depth=80, n_estimators = 120)

In [0]:
rfStandard.fit(X_subset, y_subset)
rfOpt.fit(X_subset, y_subset)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=80, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=120, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [0]:
print("Standard Random Forest Score: ", rfStandard.score(X_test, y_test), "  Random Forest with optimal parameters Score: ", rfOpt.score(X_test, y_test))

Standard Random Forest Score:  0.815680624144585   Random Forest with optimal parameters Score:  0.816800691151756


In [12]:
optSVR = BayesSearchCV(SVR(kernel='rbf'), {'C': [0.1, 1, 10, 100], 'gamma': [0.1,0.01,0.001]}, n_jobs=16, n_iter=32, random_state=0)
optSVR.fit(X_subset, y_subset)

NameError: ignored

In [0]:
optSVR.best_params_

In [0]:
svrOpt = SVR(kernel='rbf', C=100, gamma=0.001)
svrStandard = SVR()

NameError: ignored

In [0]:
svrOpt.fit(X_test, y_test)
svrStandard.fit(X_test, y_test)

In [0]:
print("Standard Random Forest Score: ", svrStandard.score(X_test, y_test), "  Random Forest with optimal parameters Score: ", svrOpt.score(X_test, y_test))