This notebook will provide explanation on the concept of hyperparameter optimization. We will have to rely on the previous classes notebook on feature engineering and cross validation.  You can check the github repository for those contents.
I'll skip the data  pre-preprocessing and feature engineering steps. 

In [32]:
#import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from scipy.stats import skew

pd.set_option('display.max_columns', None)

In [33]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [34]:
data = pd.concat([train.drop("SalePrice", axis=1),test], axis=0) #Join train and test dataset
y = train[['SalePrice']]

In [35]:
#Drop features with high missing values(over 80%)
high_missing_cols = ['PoolQC', 'MiscFeature', 'Alley', 'Fence',]
data = data.drop(high_missing_cols, axis=1)

In [36]:
# I'll drop some features due to multicollinearity, low class representation, etc
to_drop = ['Id', 'YrSold', 'MoSold', 'Utilities', 'Street', 'Condition2', 'RoofMatl', 'Heating',
           'LowQualFinSF', '3SsnPorch', 'PoolArea', 'MiscVal']
data = data.drop(to_drop, axis=1)


In [37]:
#Get list of categorical features
categorical_cols = data.select_dtypes(include=['object']).columns
categorical_cols

Index(['MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'RoofStyle',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

In [38]:
# Get list of numeric columns
numeric_cols = data.select_dtypes(include=np.number).columns
numeric_cols

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'ScreenPorch'],
      dtype='object')

In [39]:
#Replace NaN with none.
none_cols = ['FireplaceQu', 'GarageType','GarageFinish', 'GarageQual', 'GarageCond', 'BsmtQual',
             'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'MasVnrType']
for col in none_cols:
    data[col].replace(np.nan, 'None', inplace=True)

In [40]:
#Fill missing categorical columns with the mode
data[categorical_cols] = data[categorical_cols].fillna(data[categorical_cols].mode().iloc[0])

In [41]:
# Handle missing values
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
#Handle numeric missing values
data[numeric_cols] = imputer.fit_transform(data[numeric_cols])

In [42]:
# Label encode categorical features
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
for col in categorical_cols:
    data[col] = encoder.fit_transform(data[col])

In [43]:
data.isna().sum().any() #Check if missing value(s) exists

False

In [44]:
# Handling skewed features using log transformation
skew_features = np.abs(data[numeric_cols].apply(lambda x: skew(x)).sort_values(ascending=False))
skew_features[:10] # Displaying top ten skewed features

LotArea          12.822431
KitchenAbvGr      4.302254
BsmtFinSF2        4.146034
EnclosedPorch     4.003891
ScreenPorch       3.946694
BsmtHalfBath      3.931148
MasVnrArea        2.602112
OpenPorchSF       2.535114
WoodDeckSF        1.842433
LotFrontage       1.563371
dtype: float64

In [45]:
# Filtering skewed features.
high_skew = skew_features[skew_features > 1]
# Taking indexes of high skew.
skew_index = high_skew.index
#Applying log transformation
for i in skew_index:
    data[i] = np.log1p(data[i])

In [46]:
# Creating new features  based on previous observations...
data['TotalSF'] = data['BsmtFinSF1'] + data['BsmtFinSF2'] + data['1stFlrSF'] + data['2ndFlrSF']
data['TotalBathrooms'] = data['FullBath'] + (0.5*data['HalfBath']) + data['BsmtFullBath'] + (0.5*data['BsmtHalfBath'])
data['TotalPorchSF'] = data['OpenPorchSF'] +  data['EnclosedPorch'] + data['ScreenPorch'] + data['WoodDeckSF']
data['YearBlRm'] = data['YearBuilt'] + data['YearRemodAdd']

# Merging quality and conditions.
data['TotalExtQual'] = data['ExterQual'] + data['ExterCond']
data['TotalBsmQual'] = data['BsmtQual'] + data['BsmtCond'] + data['BsmtFinType1'] + data['BsmtFinType2']
data['TotalGrgQual'] = data['GarageQual'] + data['GarageCond']
data['TotalQual'] = data['OverallQual'] + data['TotalExtQual'] + data['TotalBsmQual'] + data['TotalGrgQual'] + data['KitchenQual'] + data['HeatingQC']

# Creating new features by using new quality indicators.
data['QualGr'] = data['TotalQual'] * data['GrLivArea']
data['QualBsm'] = data['TotalBsmQual'] * (data['BsmtFinSF1'] + data['BsmtFinSF2'])
data['QualPorch'] = data['TotalExtQual'] * data['TotalPorchSF']
data['QualExt'] = data['TotalExtQual'] * data['MasVnrArea']
data['QualGrg'] = data['TotalGrgQual'] * data['GarageArea']
data['QlLivArea'] = (data['GrLivArea']  * data['TotalQual'])
data['QualSFNg'] = data['QualGr'] * data['Neighborhood']

#create binary columns
binary_column = ['2ndFlrSF', 'QualGrg', 'Fireplaces', 'QualBsm', 'QualPorch','TotalPorchSF']
for col in binary_column:
    col_name = 'has_'+ col
    data[col_name] = data[col].apply(lambda x: 1 if x > 0 else 0)    

In [47]:
X = data.iloc[:1460,:]
X_test = data.iloc[1460:, :]

In [48]:
# Scale the dataset
from sklearn.preprocessing import RobustScaler

cols = X.select_dtypes(np.number).columns
scaler = RobustScaler().fit(X[cols])
X[cols] = scaler.transform(X[cols])
X_test[cols] = scaler.transform(X_test[cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.p

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.3, random_state=42)

#### What are hyperparameters?
Hyperparameters are settings that can be adjusted to improve the overall predictive performance of our model. It can be *colsample_bytree* in XGBoost or *min_samples_leaf* in RandomForest.
Most times, every model has different parameter, so you will have to read up on the parameter of the model you want to use.

#### What is hyperparameter optimization?
Hyperparameter tuing or optimization is the process of finding the best hyperparameters, which can results in performance improvement. In nutshell, we want to find the optimal set of parameters.
Hyperparameter tuning is an important concept in modelling exercise and it is worth doing well.

This notebook  will use XGBoost as the baselilne model.

In [50]:
from xgboost import XGBRegressor
xgb_model = XGBRegressor(random_state=42)

Every model provides a way to get the parameters name and its default value. The parameters and its default value can be obtained as a python dictionary using **.get_params()**. In xgboost, this can be achieved using **.get_params()** or **.get_xgb_params()** function.

In [51]:
xgb_model.get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'importance_type': 'gain',
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'reg:linear',
 'random_state': 42,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}

Model evaluation with default parameter default value

In [52]:
xgb_model.fit(X_train, y_train) 



XGBRegressor(random_state=42)

In [53]:
from sklearn.metrics import mean_absolute_error
xgb_preds = xgb_model.predict(X_val)
print(mean_absolute_error(y_val, xgb_preds))

16401.73058468893


*You will have to read up the documentation to understand what a parameter does. For example, these are parameters of XGBoost and their function*


* **booster:** Select the type of model to run at each iteration
    * gbtree: tree-based models
    * gblinear: linear models
* **nthread:** default to maximum number of threads available if not set
* **objective:** This defines the loss function to be minimized

**Parameters for controlling speed**

* **subsample:** Denotes the fraction of observations to be randomly samples for each tree
* **colsample_bytree:** Subsample ratio of columns when constructing each tree.
* **n_estimators:**  Number of trees to fit.

**Important parameters which control overfiting**

* **learning_rate:** Makes the model more robust by shrinking the weights on each step
* **max_depth:** The maximum depth of a tree.
* **min_child_weight:** Defines the minimum sum of weights of all observations required in a child.

**Techniques of hyperparameter optimization?**
* Manual Search
* Grid Search
* Randomized Search
* Bayesian Optimization

Let's address each techniques one step at a time.

##### Manual Search
In manual search, we change value of the parameter, train the model, check the score for improvement, without automating the process of selecting the parameters to change. This is done iteratively until our best performing parameters are gotten. 
This method is completely manual & tedious process.

In [54]:
#Let's try to change values of some parameters
xgb_model_m_tuned = XGBRegressor(random_state=42, n_estimators=500)
xgb_model_m_tuned.fit(X_train, y_train) 

xgb_preds_m_tuned = xgb_model_m_tuned.predict(X_val)
print(mean_absolute_error(y_val, xgb_preds_m_tuned))

16027.002630921803


Just by changing the number of estimator from 100 to 500, the mean absolute error decreased from 164071 to 16027. Let's change the the subsample from 1 to 0.5 to observe the influence on the model's error.

In [55]:
#Let's try to change values of some parameters
xgb_model_m_tuned2 = XGBRegressor(random_state=42, n_estimators=500, subsample=0.5)
xgb_model_m_tuned2.fit(X_train, y_train) 
xgb_preds_m_tuned2 = xgb_model_m_tuned2.predict(X_val)
print(mean_absolute_error(y_val, xgb_preds_m_tuned2))

15670.421848244863


##### Grid Search
The approach of grid search is quite simple, it's a brute-force exhaustive search method where we specify a list of values for different hyperparameters, and the computer evaluates the model performance for each combination of those to obtain the optimal set. We perform GridSearch by initializing a GridSearchCV object from the sklearn.grid_search module, and then train and tune on our model or pipeline.
In grid search, you can cover all possible prospective sets of parameters

**NB**: If it an hyperparameter is not specified in your grid, it will never be tried during the search phase.

In [56]:
from sklearn.model_selection import GridSearchCV

In [57]:
param_tuning = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7, 10],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.5, 0.7],
    'colsample_bytree': [0.5, 0.7],
    'n_estimators' : [100, 200, 500, 1000],
    'objective': ['reg:squarederror']
}


In [58]:
# #If you want to you can uncomment it.
# grid_search = GridSearchCV(estimator = xgb_model,
#                        param_grid = param_tuning,
#                        cv = 5,
#                        n_jobs = -1,
#                        verbose = 1)


# grid_search.fit(X_train,y_train)

In [59]:
# #Get the best values for specified parameters.
# grid_search.best_params_

In [60]:
# Replace xgboost default parameter value with the value gotten from grid_search,
# then train your model again.

Although grid search is a powerful approach for finding the optimal set of parameters, the evaluation of all possible parameter combinations is also computationally very expensive.
The computation expensiveness serves as a major drawback.
As the number of hyperparameterizations increases, processing times also increases and becomes slower. Also, during the process of defining our hyperparameters, we may omit an hyperparameterization that would in fact be optimal.

##### Randomized Search 
An alternative approach to sampling different parameter combinations using scikit-learn is randomized search. Using the RandomizedSearchCV class in scikit-learn, we can draw random parameter combinations from sampling distributions with a specified budget.

We define our set of hyperparameter to tune just like gridsearch, but the hyperparameter set is
randomly selected from prepared hyperparameters search space already defined.

The random search is based on sampling from distributions of hyperparameters, which allows you to expand the search range, thereby giving you the chance to discover a good solution that you may miss with the grid or manual search options.

In [61]:
from sklearn.model_selection import RandomizedSearchCV

In [62]:
param_tuning2 = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7, 10],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.5, 0.7],
    'colsample_bytree': [0.5, 0.7],
    'n_estimators' : [100, 200, 500, 1000]
}


In [63]:
# #If you want to you can uncomment it.
# random_search = RandomizedSearchCV(estimator = xgb_model,
#                        param_distributions = param_tuning2,
#                        cv = 3,
#                        n_jobs = -1)

In [64]:
# random_search.fit(X_train,y_train)

In [65]:
# # Get the best values for specified parameters.
# random_search.best_params

In [66]:
# Replace xgboost default parameter value with the value gotten from grid_search,
# then train your model again.

In randomized search, there is probability that your some important search space may not be explored.

##### Bayesian Optimization

In Bayesian optimization, it starts from random search of the seaech space, and then narrow the search space based on Bayesian approach. The popular libraries for bayesian optimization are scikit-optimize(skopt) and hyperopt. You will have to check the resources online for its implementation.