# Hyper Parameter Tune

In this notebook we'll close the cycle and find the best parameters for the model that we built in previously. This process is called **Hyper Parameter Optimization** and involves some steps:

1. Choose a optimization metric.
2. Create a objective function to minimize/maximize.
3. Define a parameter search space.
4. Execute the paramter search.
5. Check the results. If the found parameters gives a satisfactory model performance, then we are done and proceed to use the model. Otherwise, get back to step 3 and repeat util you find a model that best suits your needs.

In this project will the [Optuna Framework](https://optuna.org/), one of the best Hyper Parameter Optimization framework out there. Personally, I like Optana for its easy to use (I'll see it in a bit), but there are other great libraries that serves the same purposes very well. Some examples are:
- [Hyperopt](http://hyperopt.github.io/hyperopt/)
- [Ray Tune](https://docs.ray.io/en/latest/tune/index.html)
- [Scikit-Optimize](https://scikit-optimize.github.io/stable/)
- Scikit-learn itself, through [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).

With some adjustments, the concepts that will be presented are easily extendable for other frameworks as well. So, feel free to use test what I will do here with any other framework you want.

In [1]:
import catboost as cb
import sklearn
import optuna
import pandas as pd
import pickle
import os.path as P
from sklearn.metrics import mean_squared_log_error, make_scorer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
sklearn.set_config(transform_output="pandas")

In [3]:
random_state = 42

# Data Loading

## Preprocessing Pipeline
Let's first load our preprocessing pipeline:

In [4]:
artifacts_root_dir = P.join(P.dirname(P.abspath("")), "artifacts")

In [5]:
preprocessing_pipeline_path = P.join(
    artifacts_root_dir, "preprocessing_pipeline.pickle"
)

with open(preprocessing_pipeline_path, "rb") as f:
    preprocessing_pipeline = pickle.load(f)

preprocessing_pipeline


In [6]:
target_transform_path = P.join(artifacts_root_dir, "target_transform.pickle")

with open(target_transform_path, "rb") as f:
    target_transform = pickle.load(f)

target_transform

## Dataset

Let's also load the dataset

In [7]:
preprocessed_dataset_root_dir = P.join(P.dirname(P.abspath("")), "data", "processed")

In [8]:
df_file = P.join(preprocessed_dataset_root_dir, "sp_sales_data.parquet")

features = pd.read_parquet(df_file)
target = features.pop("sale_price")

display(features)
display(target)

Unnamed: 0,neighborhood,property_type,usable_area,bathrooms,suites,bedrooms,parking_spots,ad_date,condominium_fee,annual_iptu_tax
0,Jardim da Saude,Two-story House,388.0,3.0,1.0,4.0,6.0,2017-02-07,,
1,Vila Santa Teresa (Zona Sul),House,129.0,2.0,1.0,3.0,2.0,2016-03-21,,
2,Vila Olimpia,Apartament,80.0,2.0,1.0,3.0,2.0,2018-10-26,686.0,1610.0
3,Pinheiros,Apartament,94.0,1.0,0.0,3.0,2.0,2018-05-29,1120.0,489.0
4,Vila Santa Clara,Condominium,110.0,1.0,1.0,3.0,2.0,2018-04-16,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
88742,Vila Carmosina,Apartament,48.0,1.0,0.0,2.0,1.0,2017-10-07,244.0,0.0
88743,Bela Vista,Apartament,60.0,1.0,,1.0,1.0,2017-12-13,273.0,86.0
88744,Liberdade,Apartament,53.0,2.0,1.0,2.0,1.0,2018-11-28,210.0,0.0
88745,Vila Lageado,Apartament,20.0,3.0,2.0,3.0,2.0,2019-02-06,,


0         700000
1         336000
2         739643
3         630700
4         385000
          ...   
88742     171150
88743     251999
88744     249782
88745     623000
88746    1820000
Name: sale_price, Length: 88747, dtype: int64

Let's also apply the data transformation pipeline.

In [9]:
transformed_features = preprocessing_pipeline.transform(features)
transformed_target = target_transform.transform(target)

display(transformed_features)
display(transformed_target)

Unnamed: 0,property_type_Apartament,property_type_Condominium,property_type_Flat,property_type_House,property_type_Penthouse,property_type_Residential Building,property_type_Studio Apartament,property_type_Two-story House,usable_area,condominium_fee,annual_iptu_tax,condominium_per_area,iptu_per_area,neighborhood_condominium_per_area,neighborhood_iptu_per_area,suites,parking_spots,bedrooms,bathrooms
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.039322,4.616727e-18,0.000000,-0.031827,-0.015712,3.180273,0.073782,0.166667,0.857143,0.8,0.428571
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.229577,4.616727e-18,0.000000,-0.010645,-0.000007,-0.216482,0.135286,0.166667,0.285714,0.6,0.285714
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.491186,-1.225603e-02,0.019209,-0.006842,0.059671,1.011119,-0.096446,0.166667,0.285714,0.6,0.285714
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.285254,5.368350e-03,-0.005375,0.006999,-0.002027,0.616780,-0.083665,0.000000,0.285714,0.6,0.142857
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.049903,-4.011391e-02,-0.016099,-0.042378,-0.023535,-0.325976,-0.147722,0.166667,0.285714,0.6,0.142857
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88742,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.961888,-3.020527e-02,-0.016099,-0.021312,-0.023535,-0.338419,-0.162822,0.000000,0.142857,0.4,0.142857
88743,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.785375,-2.902761e-02,-0.014213,-0.023522,-0.017609,-0.139360,-0.105350,0.177136,0.142857,0.2,0.142857
88744,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.888341,-3.158599e-02,-0.016099,-0.025958,-0.023535,-0.189627,-0.062737,0.166667,0.142857,0.4,0.285714
88745,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.373752,4.616727e-18,0.000000,0.162300,0.128216,-0.181446,-0.077305,0.333333,0.285714,0.6,0.428571


0        13.458836
1        12.724866
2        13.513923
3        13.354586
4        12.860999
           ...    
88742    12.050296
88743    12.437180
88744    12.428344
88745    13.342302
88746    14.414347
Name: sale_price, Length: 88747, dtype: float64

## Train/Test Split

As in the last notebook, we'll split the data into train (80% of the original data) and test (20% of the original data) sets.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    transformed_features, transformed_target, test_size=0.2, random_state=random_state
)

print([t.shape for t in (X_train, X_test, y_train, y_test)])

[(70997, 19), (17750, 19), (70997,), (17750,)]


# Hyperparameter Optimization with Optuna

Optuna uses a technique called [Bayesian Optimization](https://en.wikipedia.org/wiki/Bayesian_optimization) to find the best parameters for you model. To put it simply, it **learns** the best parameters combinations by, first, computing the value of an **objective function** given some random combination of parameters and, by checking the variation of the objetive function, finding the best optimization path. It them computes some new random combinations of hyperparameters that are closer to an optimal objective function value. This process is repeated some times until some stop criteria is met.

Usually, the **objective function** is the value of the evaluation metric (in our case, the Mean Squared Logarithmic Error) of our model trained with the given set of hyperparameters.

Now, we'll find the best parameters for each of the models that compose the **voting classifier** that we've trained in the last notebook.

## CatBoost Regressor

First, we need to choose which hyperparameters we want to optimize. Here's a description of the hyperparameters that I've chosen.

### Hyperparameters Descriptions

- **Number of Trees** (`iterations`): The maximum number of trees that can be built within the model. Represents the steps (or rounds of refinement) the algorithm takes to create a more accurate model that learns from the data. We'll use a fixed number, as we'll be also tuning the `learning_rate`.
- **Learning Rate** (`learning_rate`): Technically, is used for reducing the gradient step. In other words, it scales the contribution of each decision tree to manage the overall balance and accuracy of the model. A range of 0.001 to 0.1 is a good starting point.
- **Tree Depth** (`depth`): Depth of the tree. You can think as the complexity or “height” of decision trees in your CatBoost model. It’s a good idea to try out values between 1 and 10
- **Subsample** (`subsample`): is a technique used to randomly choose a fraction of the dataset when constructing each tree, promoting diversity among the trees and helping to reduce overfitting. We'll from 0.05 to 1.
- **Feature Sampling by Level** (`colsample_bylevel`): The percentage of features to use at each split selection. The idea is the same as with `subsample`, but this time, we’re sampling features instead of rows. We'll use values between 0.05 and 1.0.
- **Minimum Data in Leaf** (`min_data_in_leaf`): The minimum number of training samples in a leaf, effectively controlling the split creation process. We'll go for values between 1 and 100.

### Objective function

With the hyperparameters and search space chosen, it's time to define the objective function.

In [22]:
def catboost_objective(trial):
    params = {
        "iterations": 1000,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "depth": trial.suggest_int("depth", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.05, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    }

    model = cb.CatBoostRegressor(
        **params, silent=True, allow_writing_files=False, random_seed=random_state
    )

    scorer = make_scorer(mean_squared_log_error)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scorer)

    avg_msle = scores.mean()
    return avg_msle

Let's take a minute to analyze this code.

```python
def catboost_objective(trial):
```
Every objective function in Optuna receives a **Trial** object, that will hold all relevant information for a specific run of the function. E.g.: The hyperparameter combination set, timestamp of the function call, id of the call, etc.

```python
    params = {
        "iterations": 1000,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "depth": trial.suggest_int("depth", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.05, 1.0),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    }
```
The search spaces in Optuna are defined inside the objective function. The **Trial** object will take care of "suggesting" the hyperparameters combination for the function call.

```python
    model = cb.CatBoostRegressor(
        **params, silent=True, allow_writing_files=False, random_seed=random_state
    )
```
Here the hyperparameters combination is supplied to the model (the `**` unpacks the `dict` items as function parameters). We also supply the parameters `silent=True` and `allow_writing_files=False` (that instructs the CatBoost regressor to not generate the training output files, we don't them), and the `random_seed` as weel, to ensure the reproductibility of the results.

```python
    scorer = make_scorer(mean_squared_log_error)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scorer)

    avg_msle = scores.mean()
    return avg_msle
```
Next we compute the cross validation scores for 5-folds using the trainning set, compute the average score and return it.

### Optimization

With the objective function defined, we'll proceed to create the **Study**. In Optuna, the **Study** represents an optimization session, which comprises, among other things, the set of **Trials** objects.

The **Study** object also receives the `direction` parameter, that indicates if type of optimization we want: minimization or maximization. In our case, we want to minimize the error.

The **Study** is also responsible for the optimization process, by calling `optimize`.

In [17]:

catboost_study = optuna.create_study(direction="minimize")
catboost_study.optimize(catboost_objective, n_trials=30)

[I 2023-07-11 19:22:45,257] A new study created in memory with name: no-name-16148528-d483-40bb-b2e8-f3f528a40aa5
[I 2023-07-11 19:22:52,052] Trial 0 finished with value: 0.0004762874517409778 and parameters: {'learning_rate': 0.010939529799571657, 'depth': 1, 'subsample': 0.8203932474610771, 'colsample_bylevel': 0.8017866713035482, 'min_data_in_leaf': 51}. Best is trial 0 with value: 0.0004762874517409778.
[I 2023-07-11 19:23:06,571] Trial 1 finished with value: 0.00024754260692170924 and parameters: {'learning_rate': 0.03526378691431389, 'depth': 7, 'subsample': 0.6502480712402945, 'colsample_bylevel': 0.27709693845832384, 'min_data_in_leaf': 62}. Best is trial 1 with value: 0.00024754260692170924.
[I 2023-07-11 19:23:24,654] Trial 2 finished with value: 0.0003645644835250124 and parameters: {'learning_rate': 0.0035193848219394305, 'depth': 8, 'subsample': 0.2029087330554441, 'colsample_bylevel': 0.8849167500038724, 'min_data_in_leaf': 48}. Best is trial 1 with value: 0.0002475426069

With the process finished, we can retrieve the found set of best parameters, as well of the optimizal value of the objective function.

In [18]:
print("Best hyperparameters: ", catboost_study.best_params)
print("Best MSLE", catboost_study.best_value)

Best hyperparameters:  {'learning_rate': 0.09365069861247115, 'depth': 10, 'subsample': 0.6417003804735676, 'colsample_bylevel': 0.5214210141622353, 'min_data_in_leaf': 94}
Best MSLE 0.00020581933883343423


Now we can train the CatBoost regressor with the best hyperparameters and evaluate the resulting model error using the test set.

In [20]:
catboost_model = cb.CatBoostRegressor(
    **catboost_study.best_params, silent=True, random_seed=random_state
)
catboost_model.fit(X_train, y_train)
catboost_predictions = catboost_model.predict(X_test)
mean_squared_log_error(y_test, catboost_predictions)

0.00019927099196797581

This is already an improvement of the voting classifier that we've trained in the last notebook.

Next, we'll proceed to execute the same parameter optimization process for the **XGBoost**, **Random Forest** and **LightGBM** regression models.

## XGBoost Regressor

As we did with the CatBoost regressor, we'll describe the hyperparameters that we want to optimize.

### Hyperparameters Descriptions

- **Number of estimators** (`n_estimators`): Similar to CatBoostm, is the maximum number of trees that can be built within the model. We'll also use a fixed number.
- **Max depth** (`max_depth`): Also similar to the CatBoost. We'll use values between 1 and 10.
- **Subsample** (`subsample`): Also similar to the CatBoost. We'll use values between 0.05 to 1.
- **Learning rate** (`learning_rate`): Also similar to the CatBoost. We'll use values between 0.001 to 0.1.
- **ColSample by level** (`colsample_bylevel`): Also similar to the CatBoost. We'll use values between 0.05 to 1.
- **Min child weight** (`min_child_weight`): That is the most different than those we've talked about. Defines the minimum sum of weights of all observations required in a child, and is used to control over-fitting. Higher values prevent a model from learning relations that might be highly specific to the particular sample selected for a tree. We'll use values between 1 and 10.

### Objective function

In [36]:
def xgboost_objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 500),
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
        "max_depth": trial.suggest_int("max_depth", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.05, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.05, 1.0),
    }

    model = XGBRegressor(
        **params, random_state=random_state
    )

    scorer = make_scorer(mean_squared_log_error)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scorer)

    avg_msle = scores.mean()
    return avg_msle

### Optimization

In [37]:
xgboost_study = optuna.create_study(direction="minimize")
xgboost_study.optimize(xgboost_objective, n_trials=30)

[I 2023-07-12 21:52:58,791] A new study created in memory with name: no-name-1959d5e3-3e56-4b79-b399-15c4c502b343
[I 2023-07-12 21:53:04,995] Trial 0 finished with value: 0.016088573906070977 and parameters: {'n_estimators': 393, 'learning_rate': 0.005158606265529596, 'max_depth': 1, 'subsample': 0.8133215684130123, 'min_child_weight': 5, 'colsample_bylevel': 0.6324014618497815}. Best is trial 0 with value: 0.016088573906070977.
[I 2023-07-12 21:53:15,661] Trial 1 finished with value: 0.00032644413528221176 and parameters: {'n_estimators': 289, 'learning_rate': 0.02060201512003209, 'max_depth': 5, 'subsample': 0.25720542212839775, 'min_child_weight': 8, 'colsample_bylevel': 0.7833911374052128}. Best is trial 1 with value: 0.00032644413528221176.
[I 2023-07-12 21:53:24,060] Trial 2 finished with value: 0.00026919178659423707 and parameters: {'n_estimators': 249, 'learning_rate': 0.0922355413739139, 'max_depth': 4, 'subsample': 0.32250429513896295, 'min_child_weight': 10, 'colsample_byle

In [38]:
print("Best hyperparameters: ", xgboost_study.best_params)
print("Best MSLE", xgboost_study.best_value)

Best hyperparameters:  {'n_estimators': 452, 'learning_rate': 0.03215735426888183, 'max_depth': 9, 'subsample': 0.6340800688863539, 'min_child_weight': 4, 'colsample_bylevel': 0.45440483434810286}
Best MSLE 0.00021568294536994324


In [39]:
xgboost_model = XGBRegressor(
        **xgboost_study.best_params, random_state=random_state
    )
xgboost_model.fit(X_train, y_train)
xgboost_predictions = xgboost_model.predict(X_test)
mean_squared_log_error(y_test, xgboost_predictions)

0.00020921901142607533

Wonderful! Now XGBoost regressor is also better than the voting regressor.

## Random Forest Regressor

### Hyperparameter Descriptions

For this model I could not find any other combinations of hyperparameters that could improve upon the vanilla model. So I'll just use to following two.

- **Number of estimators** (`n_estimators`):Similar to CatBoostm, is the maximum number of trees that can be built within the model. We'll  values between 100 a 500.
- **Depth** (`max_depth`): Also similar to the CatBoost. We'll use values between 1 and 10.

### Objective Function

In [107]:
def random_forest_objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 70, 120),
        "max_depth": trial.suggest_int("max_depth", 1, 100),
        # "max_features": trial.suggest_float("max_features", 0.7, 1.0),
    }

    model = RandomForestRegressor(**params, n_jobs=-1, random_state=random_state)

    scorer = make_scorer(mean_squared_log_error)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scorer)

    avg_msle = scores.mean()
    return avg_msle

### Optimization

In [108]:
random_forest_study = optuna.create_study(direction="minimize")
random_forest_study.optimize(random_forest_objective, n_trials=30)

[I 2023-07-13 22:17:42,040] A new study created in memory with name: no-name-85ebd399-80e1-4ba3-9610-e1cf62dce214
[I 2023-07-13 22:17:54,706] Trial 0 finished with value: 0.0002405071476581513 and parameters: {'n_estimators': 76, 'max_depth': 43}. Best is trial 0 with value: 0.0002405071476581513.
[I 2023-07-13 22:18:06,941] Trial 1 finished with value: 0.00024063044368127616 and parameters: {'n_estimators': 73, 'max_depth': 55}. Best is trial 0 with value: 0.0002405071476581513.
[I 2023-07-13 22:18:21,324] Trial 2 finished with value: 0.00024002142435783114 and parameters: {'n_estimators': 87, 'max_depth': 38}. Best is trial 2 with value: 0.00024002142435783114.
[I 2023-07-13 22:18:37,427] Trial 3 finished with value: 0.0002396708158978298 and parameters: {'n_estimators': 99, 'max_depth': 100}. Best is trial 3 with value: 0.0002396708158978298.
[I 2023-07-13 22:18:53,563] Trial 4 finished with value: 0.0002450791908380493 and parameters: {'n_estimators': 118, 'max_depth': 17}. Best is

In [109]:
print("Best hyperparameters: ", random_forest_study.best_params)
print("Best MSLE", random_forest_study.best_value)

Best hyperparameters:  {'n_estimators': 119, 'max_depth': 86}
Best MSLE 0.00023915227192271814


In [110]:
random_forest_model = RandomForestRegressor(
        **random_forest_study.best_params, random_state=random_state
    )
random_forest_model.fit(X_train, y_train)
random_forest_model_predictions = random_forest_model.predict(X_test)
mean_squared_log_error(y_test, random_forest_model_predictions)

0.0002276042588944481