# **Optimize models: Tuning Hyperparameters**
---
> This notebook entails optimizing the performance of the models in the previous notebook by tuning / optimizing model **hyper-parameters**
  

> The _estimators_ used in this learning path so far have a wide range of parameters that control how machine learning models are trained.
>> + **Parameters** - Values that can be determined from the data  
>> + **Hyperparameters** - More correctly,  values that you specify to affect the behavior of a training algorithm

> Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem.  

> Often, you don't immediately know what the optimal model architecture should be for a given model, and thus you'd like to be able to explore a range of possibilities.

> In a true machine learning fashion, you'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically

> Fortunately, [`SciKit-Learn`](https://scikit-learn.org/stable/index.html) provides ways to tune hyperparameters by trying multiple combinations and finding the best result for a given performance metric. This can be achieved through:
>> + [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) - Exhaustive search over specified parameter values for an estimator.  
>> + [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) - Randomized search on hyper parameters   
>> + [`Optuna`](https://optuna.readthedocs.io/en/stable/index.html) - An automatic hyperparameter optimization software framework, particularly designed for machine learning (not provided by Scikit-Learn)
>> + [`BayesSearchCV`](https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html)  

## `GridSearchCV`
> `GridSearchCV` performs an exhaustive search over a specified grid of hyperparameters. For each combination of hyperparameters specified in the grid, GridSearchCV trains a new model and evaluates its performance using cross-validation.

> For example, if you have two hyperparameters, each with three possible values, `GridSearchCV` will try all 3x3=9 combinations of these hyperparameters.

> Parameters (a few of the most important):
+ **estimator** - estimator object  
+ **param_grid: dict or list of dictionaries** - Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries     
+ **n_jobs: int, default=None** - Number of jobs to run in parallel. `None` means `1` unless in a `joblib.parallel_backend` context. `-1` means using all processors  
+ **cv: int, cross-validation generator or an iterable, default=None (5)**  
+ **scoring: str, callable, list, tuple or dict, default=None** - Strategy to evaluate the performance of the cross-validated model on the test set.  
+ **return_train_score: bool, default=False** - If `False`, the `cv_results_` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive  

> Attributes are listed and explained [`here`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) on the documentation page



## `RandomizedSearchCV`  
> While `GridSearchCV` exhaustively searches through all combinations of hyperparameters specified in a grid, `RandomizedSearchCV` samples a fixed number of hyperparameter settings from specified probability distributions.  


+ `RandomizedSearchCV` is more efficient than `GridSearchCV` when the hyperparameter search space is large. Since it doesn't try every single combination, it can explore a broader range of values in a shorter amount of time.
+ When the search space is huge and it's impractical to try every possible combination, `RandomizedSearchCV` provides a more feasible alternative.
+ The `n_iter` parameter specifies the number of parameter settings that are sampled  
+ It is highly recommended to use **continuous distributions** for continuous parameters.

> Parameters (a few of the most important):
+ **estimator** - estimator object  
+ **param_distributions: dict or list of dicts** - Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. Distributions _must_ provide a `rvs` method for sampling (such as those from `scipy.stats.distributions`)   
+ **n_jobs: int, default=None** - Number of jobs to run in parallel. `None` means `1` unless in a `joblib.parallel_backend` context. `-1` means using all processors  
+ **n_iter: int, default=10** - Number of parameter settings that are sampled. `n_iter` trades off runtime vs quality of the solution.  
+ **cv: int, cross-validation generator or an iterable, default=None (5)**  
+ **scoring: str, callable, list, tuple or dict, default=None** - Strategy to evaluate the performance of the cross-validated model on the test set.  
+ **random_state: int, RandomState instance or None, default=None** - Pass an int for reproducible output across multiple function calls.
+ **return_train_score: bool, default=False** - If `False`, the `cv_results_` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive  

> Attributes are listed and explained [`here`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) on the documentation page

## [`Optuna`](https://optuna.readthedocs.io/en/stable/index.html)  
> Steps for hyperparameter tuning using optuna are as follows:
+ Define an `objective` function that that `optuna` will `optimize` - This function takes a set of hyper-parameters as input and returns an evaluation metric that `optuna` aims to `minimize` or `maximize`  
+ Using [`optuna.create_study()`](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.create_study.html#optuna.create_study), create a [`study`](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study) object that represents an optimization task - It contains multiple [`trial`]() corresponding to a single run of the objective function with specified set of hyper-parameters
+ `optimize` the `objective` function using [`study.optimize(objective, n_trials = n)`](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.optimize)

> **Note:**
> + The `direction` parameter passed to `optuna.create_study()` will affect the results, as it specifies whether the objective of the optimization is to `maximize` or `minimize` the value returned by the objective function
>> * `direction='minimize` - finds the set of hyperparameters that result in the lowest value of the objective function
>> * `direction='maxumize` - finds the set of hyperparameters that result in the highest value of the objective function

> + [`study.best_params`](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.best_params) / `study.best_trial.params` returns only best hyperparameters that had a search space defined using the `trial.suggest...` methods. The study object does not contain hard coded parameters with fixed values throughout the trials - Therefore, find a way to combine them both when training the final model

## Load the data

In [None]:
# Import pandas
import pandas as pd

# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')

In [None]:
# Extract features / inputs (X) and label (y)
# features/inputs (X):
cols = ['season','mnth', 'holiday','weekday','workingday','weathersit','temp', 'atemp', 'hum', 'windspeed']
X = bike_data[cols].copy()

# label (y):
y = bike_data['rentals']

numeric_features = [6,7,8,9]
categorical_features = [0,1,2,3,4,5]

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split

# Split the data into training and validation / test sets 70% - 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

# Keep in mind that the resultant variables (X_train, X_test, y_train, y_test) are all DataFrames
type(X_train) # to confirm

In [None]:
# define function for evaluation model, on VALIDATION DATASET
def evaluate_model(model):
  from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
  import numpy as np

  # Get predictions from the model passed
  y_pred = model.predict(X_test)

  # MSE
  mse = mean_squared_error(y_test, y_pred)
  # RMSE
  rmse = np.sqrt(mse)
  # R-squared
  r2 = r2_score(y_test, y_pred)
  # MAE
  mae = mean_absolute_error(y_test, y_pred)

  print(f"\nMSE: {round(mse, 2)}")
  print(f"MAE: {round(mae, 2)}")
  print(f"RMSE: {round(rmse, 2)}")
  print(f"R Squared: {round(r2, 3)}")

## Pre-processing the data

In [None]:
# import libraries for pre-processing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [None]:
# Encode categorical features
cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")), #1. handle missing values
    ("ohe", OneHotEncoder(handle_unknown="ignore")) # 2. enode
])

# Scale numerical features
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")), #1. handle missing values
    ("std_sc", StandardScaler()) # scale
])

# Now combine the two transformers (num_transformer & cat_transformer) into one
preprocessor = ColumnTransformer(transformers=[
    ("cat", cat_transformer, categorical_features),
    ("num", num_transformer, numeric_features)
])

## [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor) algorithm  
+ `RandomForestRegressor` parameters, their explanations and best practices for tuning are listed in its documentation linked [`here`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor)

### `RandomizedSearchCV` approach

In [None]:
# Define randomized search grid with RandomForestRegressor hyperparameters:
from scipy.stats import randint, uniform

param_dist = {
    "reg__max_depth": randint(3, 10),
    "reg__min_samples_split": randint(2, 20),
    "reg__min_samples_leaf": randint(2, 20),
    "reg__max_features": ["sqrt", 1, 0.5],
    "reg__n_estimators": randint(10, 300, 50)
}

In [None]:
# import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# create a pipeline with preprocessor + RandomForestRegressor
forest_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor), # preprocess
    ("reg", RandomForestRegressor(random_state=42)) # model
])

In [None]:
# import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Create RandomSearchCV with RandomForestRegressor estimator
random_search = RandomizedSearchCV(forest_pipeline,
                                   n_iter = 1000,
                                   n_jobs = -1,
                                   verbose = True,
                                   random_state=0,
                                   param_distributions=param_dist)

In [None]:
%%time
# Fit the data to the random_search
random_search.fit(X_train, y_train)

# Save best_params to a variable
params = random_search.best_estimator_.named_steps["reg"].get_params()

print(f"Best Params: {params}")
print(f"Best Score: {random_search.best_score_}")

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


  warn(


Best Params: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 9, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 327, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}
Best Score: 0.751224119860589
CPU times: user 37.9 s, sys: 4 s, total: 41.9 s
Wall time: 30min 20s


In [None]:
%%time
# create a pipeline with preprocessor + best_estimator_ found after search
pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("estimator", RandomForestRegressor().set_params(**params))
])

# train model
forest_model = pipeline.fit(X_train, y_train)

  warn(


CPU times: user 1.52 s, sys: 0 ns, total: 1.52 s
Wall time: 1.51 s


In [None]:
# Evaluate the model on test / validation data
evaluate_model(forest_model)


MSE: 76480.54
MAE: 196.3
RMSE: 276.55
R Squared: 0.811


## [`GradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn-ensemble-gradientboostingregressor) algorithm   
+ `GradientBoostingRegressor` parameters, their explanations and best practices for tuning are listed in its documentation linked [`here`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)

### `RandomizedSearchCV` approach

In [None]:
# Define randomized search grid with GradientBoostingRegressor hyperparameters:
from scipy.stats import uniform, randint

# Define the parameter grid to search
param_dist = {
    'reg__n_estimators': randint(50, 500),
    'reg__max_depth': randint(3, 10),
    'reg__learning_rate': uniform(0.01, 0.5),
    'reg__min_samples_split': randint(2, 20),
    'reg__min_samples_leaf': randint(1, 20),
    'reg__max_features': ['auto', 'sqrt', 'log2']
}

In [None]:
# import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Create pipeline with preprocessor + GradientBoostingRegressor
gb_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor), # Preprocess data
    ("reg", GradientBoostingRegressor(random_state = 42)) # Fit model
])

In [None]:
# import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# RandomizedSearchCV object with neessary parameters set
random_search = RandomizedSearchCV(gb_pipeline,
                              param_distributions = param_dist,
                              n_jobs = -1,
                              n_iter = 1000,
                              cv= 3,
                              verbose = True,
                                   random_state=42)

In [None]:
%%time
# Fit the data to the random_search
random_search.fit(X_train, y_train)

# set best_params_ to a variable
params = random_search.best_estimator_.named_steps["reg"].get_params()

print(f"Best Params: {params}")
print(f"Best Score: {random_search.best_score_}")

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
Best Params: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.12336519146175533, 'loss': 'squared_error', 'max_depth': 7, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 3, 'min_samples_split': 15, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 133, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Best Score: 0.7577032939530337
CPU times: user 20.3 s, sys: 2.03 s, total: 22.3 s
Wall time: 18min 3s


In [None]:
%%time
# create a pipeline with preprocessor + best_estimator_ found after search
pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("estimator", GradientBoostingRegressor().set_params(**params))
])

# train model
gb_model = pipeline.fit(X_train, y_train)

CPU times: user 182 ms, sys: 0 ns, total: 182 ms
Wall time: 183 ms


In [None]:
# Evaluate the model on test / validation data
evaluate_model(gb_model)


MSE: 79684.62
MAE: 199.58
RMSE: 282.28
R Squared: 0.803


## [`LGBM`](https://lightgbm.readthedocs.io/en/stable/index.html) algorithm

In [None]:
# import LightGBM
import lightgbm
!pip install --upgrade lightgbm
from lightgbm import LGBMRegressor

### [`Optuna`](https://optuna.readthedocs.io/en/stable/index.html) approach

In [None]:
# install and import optuna
!pip install optuna
import optuna

# disable logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

# import cross_val_score
from sklearn.model_selection import cross_val_score

In [None]:
# Define an objective funciton to be minimized
def objective(trial):

  # define search space for the hyperparameters:
  params = {
      'verbosity': 0,
      'metric': 'rmse',
      'boosting_type': 'gbdt',
      'early_stopping': 10,
      'max_bin': trial.suggest_int('max_bin', 255, 300),
      'num_leaves': trial.suggest_int('num_leaves', 2, 50),
      'max_depth': trial.suggest_int('max_depth', 1, 15),
      'n_estimators': trial.suggest_int('n_estimators', 10, 1000),
      'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1e-1),
      'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 30),
      'path_smooth': trial.suggest_int('path_smooth', 2, 10),
      'min_gain_to_split': trial.suggest_float('min_gain_to_split', 0.1, 0.5),
      'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1.0),
      'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 1.0)
  }

  # model
  model = LGBMRegressor()
  model.set_params(**params)

  # parameter to pass to the fit function of LGBMRegressor
  eval = {
      'eval_set': [(preprocessor.fit_transform(X_test), y_test)]
  }

  # get cross_validation_score
  score = cross_val_score(model, preprocessor.fit_transform(X_train), y_train,
                          fit_params = eval, cv=3, scoring='neg_root_mean_squared_error',
                          n_jobs = -1).mean()
  return score

In [None]:
%%time
# create a new study
study = optuna.create_study(direction='maximize')

# optimize the objective function
study.optimize(objective, n_trials = 200)

# save best_params to a variable
best_params = study.best_params
print(f"Best Params: {best_params}")
print(f"Best Value: {study.best_value}")

Best Params: {'max_bin': 259, 'num_leaves': 5, 'max_depth': 7, 'n_estimators': 946, 'learning_rate': 0.06515796583328963, 'min_data_in_leaf': 2, 'path_smooth': 2, 'min_gain_to_split': 0.14106204260940144, 'reg_lambda': 0.031535305046180864, 'reg_alpha': 0.6083194758643262}
Best Value: -342.82891838049113
CPU times: user 24.1 s, sys: 382 ms, total: 24.5 s
Wall time: 58.2 s


In [None]:
# hard coded parameters
hard_coded = {
    'metric': 'rmse',
    'early_stopping': 10,
    'metric': 'rmse',
    'boosting_type': 'gbdt',
}

# create LGBMRegressor
estimator = LGBMRegressor(**hard_coded, **best_params)

# define pipeline with pre-processor + LGBMRegressor
lgbm_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor), # preprocess
    ("lgb", LGBMRegressor(**best_params)) # estimator
])

In [None]:
lgbm_model = lgbm_pipeline.fit(X_train, y_train,
                               lgb__eval_set = [(preprocessor.fit_transform(X_test), y_test)])

In [None]:
# evaluate the model
evaluate_model(lgbm_model)


MSE: 87546.81
MAE: 214.84
RMSE: 295.88
R Squared: 0.781


## [`XGBRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn-ensemble-gradientboostingregressor) algorithm

### `RandomizedSearchCV` approach

In [None]:
# import r2 score
from sklearn.metrics import r2_score
# import XGBRegressor
from xgboost import XGBRegressor
# validation dataset for early stopping
eval_set = [(X_train, y_train), (X_test, y_test)]

# Create pipeline with preprocessor + XGBRegressor
xgb_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor), # pre-process
    ("xgb", XGBRegressor(booster = "gbtree",
                         random_state = 42,
                         #eval_set = eval_set,
                         #early_stopping_rounds = 10,
                         objective = "reg:squarederror",
                         eval_metric = r2_score)) # estimator
])

In [None]:
from scipy.stats import randint, uniform

# Hyperparameter grid
param_dist = {
    "xgb__n_estimators": randint(100, 1000),
    "xgb__learning_rate": uniform(0.01, 0.1),
    "xgb__max_depth": randint(6, 10),
    #"xgb__max_leaves":[0],
    "xgb__min_child_weight": randint(1,6),
    "xgb__gamma": randint(0,5),
    "xgb__subsample": uniform(0, 1),
    "xgb__colsample_bytree": uniform(0, 1)
}

In [None]:
# import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(xgb_pipeline,
                                   param_distributions = param_dist,
                                   n_jobs = -1,
                                   cv = 4,
                                   n_iter = 1000,
                                   verbose = True,
                                   random_state=0)

In [None]:
%%time
# Fit the data to the random_search
random_search.fit(X_train, y_train)

# save best_params_ to a variable
params = random_search.best_estimator_.named_steps["xgb"].get_params()

print(f"Best Params: {params}")
print(f"Best Score: {random_search.best_score_}")

Fitting 4 folds for each of 1000 candidates, totalling 4000 fits
Best Params: {'objective': 'reg:squarederror', 'base_score': None, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': 0.9365537663271488, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': <function r2_score at 0x7b89a4ba1b40>, 'feature_types': None, 'gamma': 0, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': 0.0289613969338456, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': 7, 'max_leaves': None, 'min_child_weight': 3, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': 572, 'n_jobs': None, 'num_parallel_tree': None, 'random_state': 42, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': 0.0799077058806561, 'tree_method': None, 'v

In [None]:
# create pipeline with best_params_ model + preprocessor
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor), # preprocess
    ("estimator", XGBRegressor().set_params(**params)) # model
])

xgb_model = pipeline.fit(X_train, y_train)

In [None]:
# Evaluate the XGBRegressor model
evaluate_model(xgb_model)


MSE: 76080.42
MAE: 201.18
RMSE: 275.83
R Squared: 0.812


### [`Optuna`](https://optuna.readthedocs.io/en/stable/index.html) approach

In [None]:
#install & import oputuna
!pip install optuna
# import
import optuna

# disable logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [None]:
# install cross_val_score
from sklearn.model_selection import cross_val_score
# import xgbregressor
from xgboost import XGBRegressor

In [None]:
# define objective function
def objective(trial):
  # define parameter search space
  params = {
      'early_stopping_rounds': 10,
      'eval_metric': 'rmse',
      'booster': 'gbtree',
      'objective': 'reg:squarederror',
      'max_depth': 4,
      'gamma': trial.suggest_int('gamma', 1, 5),
      'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
      'eta': trial.suggest_float('eta', 0.1, 1.5),
      'min_child_weight': trial.suggest_float('min_child_weight', 1, 6),
      #'min_child_weight': trial.suggest_int('min_child_weight', 1, 6),
      'max_delta_step': trial.suggest_int('max_delta_step', 1, 5),
      'subsample': trial.suggest_float('subsample', 0.5, 1.0),
      'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
      'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.5, 1.0),
      'colsample_bynode': trial.suggest_float('colsample_bynode', 0.5, 1.0),
      'reg_lambda': trial.suggest_int('reg_lambda', 1, 5),
      'reg_alpha': trial.suggest_int('reg_alpha', 1, 5)

  }

  # eval_set for early_stopping should be passed to the
  # `fit` method of the XGBRegressor through fit_params parameter of
  # corss_val_score
  eval = {
      'eval_set': [(preprocessor.fit_transform(X_test), y_test)]
  }

  model = XGBRegressor()
  model.set_params(**params)

  # perform cross validation on the model with chosen parameters
  # return mean score since the function returns a list with scores
  score = cross_val_score(model, preprocessor.fit_transform(X_train), y_train,
                          fit_params = eval, cv = 3, n_jobs = -1,
                          scoring = 'neg_root_mean_squared_error').mean()
  return score

In [None]:
%%time
# create a study object
study = optuna.create_study(direction = 'maximize')

# optimize objective function
study.optimize(objective, n_trials = 300)

CPU times: user 47 s, sys: 1.04 s, total: 48.1 s
Wall time: 5min 24s


In [None]:
# print values
best_params = study.best_params
print(f"Best Params: {best_params}")
print(f"Best Score: {study.best_value}")

Best Params: {'gamma': 3, 'n_estimators': 575, 'eta': 1.2480357679620102, 'min_child_weight': 3.515868717621868, 'max_delta_step': 5, 'subsample': 0.7921692143279411, 'colsample_bytree': 0.7541006783737176, 'colsample_bylevel': 0.5585674343982423, 'colsample_bynode': 0.9797116368333302, 'reg_lambda': 2, 'reg_alpha': 5}
Best Score: -339.3877151097754


In [None]:
# create an estimator with the best parameters
estimator = XGBRegressor(**best_params)

hard_coded_params = {
    'early_stopping_rounds': 20,
    'eval_metric': 'rmse',
    'booster': 'gbtree',
    'objective': 'reg:squarederror',
    'max_depth': 6,
}

estimator.set_params(**hard_coded_params)

# define pipeline with pre-processor + XGBRegressor
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor), # preprocesor
    ("xgb", estimator) # estimator
])

In [None]:
%%time
# train the model with best parameters
xgb_final = pipeline.fit(X_train, y_train,
                         xgb__eval_set = [(preprocessor.fit_transform(X_test), y_test)])

[0]	validation_0-rmse:639.02968
[1]	validation_0-rmse:635.19697
[2]	validation_0-rmse:630.97007
[3]	validation_0-rmse:626.73179
[4]	validation_0-rmse:623.17146
[5]	validation_0-rmse:619.55389
[6]	validation_0-rmse:615.43307
[7]	validation_0-rmse:611.62158
[8]	validation_0-rmse:607.57178
[9]	validation_0-rmse:603.59193
[10]	validation_0-rmse:599.64707
[11]	validation_0-rmse:595.64339
[12]	validation_0-rmse:591.66602
[13]	validation_0-rmse:588.55004
[14]	validation_0-rmse:584.71384
[15]	validation_0-rmse:580.95556
[16]	validation_0-rmse:577.02155
[17]	validation_0-rmse:573.21030
[18]	validation_0-rmse:569.30037
[19]	validation_0-rmse:566.04236
[20]	validation_0-rmse:562.15241
[21]	validation_0-rmse:558.42992
[22]	validation_0-rmse:554.59874
[23]	validation_0-rmse:551.21408
[24]	validation_0-rmse:547.75932
[25]	validation_0-rmse:544.11829
[26]	validation_0-rmse:540.40893
[27]	validation_0-rmse:536.92881
[28]	validation_0-rmse:533.74603
[29]	validation_0-rmse:530.10366
[30]	validation_0-rm

In [None]:
# evaluate model
evaluate_model(xgb_final)


MSE: 73362.68
MAE: 189.36
RMSE: 270.86
R Squared: 0.816


## **Save the model**
> Here, I will select the model that achieved the best Root Mean Squared Error (RMSE) metrics, thanks to hyperparameter tuning using [`Optuna`](https://optuna.readthedocs.io/en/stable/index.html). This was the XGBRegressor: `xgb_final`

In [None]:
import joblib

# Save the model as a pickle file
# (will appear on the project file section on colab)
filename = './bike-share.pkl'
joblib.dump(xgb_final, filename)

['./bike-share.pkl']

> Now, the model can be loaded whenever needed and used to predict labels for new data. This is often called _scoring_ or _inferencing._  

> The model's `predict` method accepts an `array` of observations, hence can be used to generate multiple predictions as a batch. For example, suppose we have a weather forecast for the next five days:

### Predicting for a single feature array

In [None]:
import joblib
import numpy as np
import pandas as pd

# filename
filename = 'bike-share.pkl'
# load the saved model
loaded_model = joblib.load(filename)

# Create a numpy array containing a new observation
#X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('float64')
X_new = pd.DataFrame(data = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]),
                     columns = cols,
                     dtype='float64')

# predict (returns a np.ndarray)
results = loaded_model.predict(X_new)

for result in results:
  print(round(result))

# alternatively: print(round(result[0]))

115


### Predicting for multiple feature arrays

In [None]:
# An array of features based on five-day weather forecast
X_new = pd.DataFrame(data = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],
                  [0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],
                  [0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],
                  [0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],
                  [0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]]),
                     columns = cols)

# Use the model to predict rentals (returns an np.ndarray)
results = loaded_model.predict(X_new)

print('5-day rental predictions:')
for prediction in results:
    print(round(prediction))

5-day rental predictions:
520
570
301
227
346


> **Note:**
>> The [`pandas.Series.apply()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) method below is used to invoke a function (such as a lambda function in our case) on values of `Series` (pandas columns are returned as `Series`)

In [None]:
# Add the predicted values as a new column to X_new
X_new['y_hat'] = results

# Round off all the values in the y_hat column using a lambda function,
# and Series.apply()
X_new['y_hat'] = X_new['y_hat'].apply(lambda x: round(x))

In [None]:
# Now, view features and y_hat as a DataFrame
X_new

Unnamed: 0,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,y_hat
0,0.0,1.0,1.0,0.0,0.0,1.0,0.344167,0.363625,0.805833,0.160446,520
1,0.0,1.0,0.0,1.0,0.0,1.0,0.363478,0.353739,0.696087,0.248539,570
2,0.0,1.0,0.0,2.0,0.0,1.0,0.196364,0.189405,0.437273,0.248309,301
3,0.0,1.0,0.0,3.0,0.0,1.0,0.2,0.212122,0.590435,0.160296,227
4,0.0,1.0,0.0,4.0,0.0,1.0,0.226957,0.22927,0.436957,0.1869,346
