## 1.0 Introduction

The main objective of this feature is to implement the use of hyperparameter tuning process for the development of Machine Learning models. The tutorial consists of the use of iris and boston datasets from sklearn Datasets, where the Data Scientists can choose the parameters to be tested for the method that is a wrapper of optuna opimization framework.

The params to be chosen are:
- df: DataFrame pandas
- target: target variable
- parameters: Dict that contains all the threshold given for optimization testing
- algorithm: Machine Learning algorithm used for fit the model (eg: RandomForestClassifier, RandomForestRegressor)
- metric: Metric used for the evaluation of the tests (eg: accuracy_score, r2)
- metric_type: Use to guide the performance results, if its a classifier or Regressor
- scoring_option: Maximize or minimize objectives
-  n_trials: The of trials that the framework must perform
- n_splits: The number of splits that must be done in the dataset
- shuffle: The flag True or False for shuffle data before execution
- random_state: The number choosen for seed
- metric_goal: If scoring_option is maximize, then set as True. Otherwise, set as False

## 1.1 Import modules

In [1]:
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from gumly.hyperparameter_tuning import hyperparameter_tuning

import warnings
warnings.filterwarnings('ignore')

## 1.2 Gathering the datasets

In [2]:
# for classification 
iris_data = datasets.load_iris()
df_iris = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df_iris["target"] = iris_data.target
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# for regression
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df_boston["target"] = boston_data.target
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## 1.3 Running hyperparameter tuning for classification model

In [4]:
hyperparameter_tuning(
        df=df_iris,
        target="target",
        parameters=[
        {"name": "min_samples_leaf", "type": "Integer", "low": 50, "high": 75},
        {"name": "max_depth", "type": "Integer", "low": 12, "high": 24},
    ],
        algorithm=RandomForestClassifier,
        metric=accuracy_score,
        metric_type='Classification',
        scoring_option="maximize",
        n_trials=20,
        n_splits=2,
        shuffle=True,
        random_state=42,
        metric_goal=True
    )

[32m[I 2022-01-31 16:01:47,460][0m A new study created in memory with name: no-name-21c237ef-71bc-4284-be67-f89f1c56dbcc[0m
[32m[I 2022-01-31 16:01:47,638][0m Trial 0 finished with value: 0.7066666666666667 and parameters: {'min_samples_leaf': 59, 'max_depth': 24}. Best is trial 0 with value: 0.7066666666666667.[0m
[32m[I 2022-01-31 16:01:47,812][0m Trial 1 finished with value: 0.7066666666666667 and parameters: {'min_samples_leaf': 69, 'max_depth': 19}. Best is trial 0 with value: 0.7066666666666667.[0m
[32m[I 2022-01-31 16:01:47,991][0m Trial 2 finished with value: 0.7066666666666667 and parameters: {'min_samples_leaf': 54, 'max_depth': 14}. Best is trial 0 with value: 0.7066666666666667.[0m
[32m[I 2022-01-31 16:01:48,169][0m Trial 3 finished with value: 0.7066666666666667 and parameters: {'min_samples_leaf': 51, 'max_depth': 23}. Best is trial 0 with value: 0.7066666666666667.[0m
[32m[I 2022-01-31 16:01:48,349][0m Trial 4 finished with value: 0.7066666666666667 and 

(    number     value  params_max_depth  params_min_samples_leaf     state
 0        0  0.706667                24                       59  COMPLETE
 1        1  0.706667                19                       69  COMPLETE
 2        2  0.706667                14                       54  COMPLETE
 3        3  0.706667                23                       51  COMPLETE
 4        4  0.706667                21                       65  COMPLETE
 5        5  0.706667                24                       50  COMPLETE
 6        6  0.706667                14                       71  COMPLETE
 7        7  0.706667                14                       54  COMPLETE
 8        8  0.706667                18                       57  COMPLETE
 9        9  0.706667                15                       61  COMPLETE
 10      10  0.706667                13                       65  COMPLETE
 11      11  0.706667                16                       57  COMPLETE
 12      12  0.706667    

## 1.4 Running hyperparameter tuning for regression model

In [5]:
hyperparameter_tuning(
        df=df_boston,
        target="target",
        parameters=[
        {"name": "min_samples_leaf", "type": "Integer", "low": 50, "high": 75},
        {"name": "max_depth", "type": "Integer", "low": 12, "high": 24},
    ],
        algorithm=RandomForestRegressor,
        metric=mean_absolute_error,
        metric_type='Regression',
        scoring_option="minimize",
        n_trials=30,
        n_splits=2,
        shuffle=True,
        random_state=42,
        metric_goal=True
    )

[32m[I 2022-01-31 16:01:51,106][0m A new study created in memory with name: no-name-74371ade-9ed4-47a9-913b-6285f8866799[0m
[32m[I 2022-01-31 16:01:51,283][0m Trial 0 finished with value: 5.197361870604654 and parameters: {'min_samples_leaf': 59, 'max_depth': 24}. Best is trial 0 with value: 5.197361870604654.[0m
[32m[I 2022-01-31 16:01:51,458][0m Trial 1 finished with value: 5.186109811030348 and parameters: {'min_samples_leaf': 69, 'max_depth': 19}. Best is trial 1 with value: 5.186109811030348.[0m
[32m[I 2022-01-31 16:01:51,625][0m Trial 2 finished with value: 4.960935898198869 and parameters: {'min_samples_leaf': 54, 'max_depth': 14}. Best is trial 2 with value: 4.960935898198869.[0m
[32m[I 2022-01-31 16:01:51,802][0m Trial 3 finished with value: 4.6986915646178 and parameters: {'min_samples_leaf': 51, 'max_depth': 23}. Best is trial 3 with value: 4.6986915646178.[0m
[32m[I 2022-01-31 16:01:51,992][0m Trial 4 finished with value: 5.249720828626785 and parameters: {

(    number     value  params_max_depth  params_min_samples_leaf     state
 0        0  5.197362                24                       59  COMPLETE
 1        1  5.186110                19                       69  COMPLETE
 2        2  4.960936                14                       54  COMPLETE
 3        3  4.698692                23                       51  COMPLETE
 4        4  5.249721                21                       65  COMPLETE
 5        5  4.727287                24                       50  COMPLETE
 6        6  5.143868                14                       71  COMPLETE
 7        7  4.994898                14                       54  COMPLETE
 8        8  5.158882                18                       57  COMPLETE
 9        9  5.227005                15                       61  COMPLETE
 10      10  5.281458                13                       65  COMPLETE
 11      11  5.196886                16                       57  COMPLETE
 12      12  5.245874    

## 2.0 Conclusion and library advantages

This implementation is an advantage for feature selection process during the development of ML models due to its standardization. Basically, you are able to run several different methods for feature select only specifying basic hyperparameters and the dataFrame to be used. This makes very to run a lot of tests in order to get best set of features for the train/test phase.

## References

[cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)


[KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold)


[optuna](https://optuna.readthedocs.io/en/v0.19.0/)


[RandomSampler](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.RandomSampler.html)


[make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)