## 1.0 Introduction

The main objective of this feature is to implement the use of hyperparameter tuning process for the development of Machine Learning models. The tutorial consists of the use of iris and boston datasets from sklearn Datasets, where the Data Scientists can choose the parameters to be tested for the method that is a wrapper of optuna opimization framework.

The params to be chosen are:
- df: DataFrame pandas
- target: target variable
- parameters: Dict that contains all the threshold given for optimization testing
- algorithm: Machine Learning algorithm used for fit the model (eg: RandomForestClassifier, RandomForestRegressor)
- metric: Metric used for the evaluation of the tests (eg: accuracy_score, r2)
- scoring_option: Maximize or minimize objectives
-  n_trials: The of trials that the framework must perform

## 1.1 Import modules

In [1]:
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from mlutils.hyperparameter_tuning import hyperparameter_tuning

import warnings
warnings.filterwarnings('ignore')

## 1.2 Gathering the datasets

In [2]:
# for classification 
iris_data = datasets.load_iris()
df_iris = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df_iris["target"] = iris_data.target
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# for regression
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df_boston["target"] = boston_data.target
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## 1.3 Running hyperparameter tuning for classification model

In [4]:
hyperparameter_tuning(
        df=df_iris,
        target="target",
        parameters=[
        {"name": "min_samples_leaf", "type": "Integer", "low": 1, "high": 1000},
        {"name": "max_depth", "type": "Integer", "low": 12, "high": 1000},
        {"name": "n_estimators", "type": "Integer", "low": 12, "high": 1000}
    ],
        algorithm=RandomForestClassifier,
        metric=accuracy_score,
        scoring_option="maximize",
        n_trials=10,
        n_splits= 2,
        suffle = True,
        random_state= 42,
        metric_goal='True'
    
    )

[32m[I 2022-01-26 22:40:14,227][0m A new study created in memory with name: no-name-0170e4b5-3464-44bf-a423-9a296e51398a[0m
[32m[I 2022-01-26 22:40:15,282][0m Trial 0 finished with value: 0.29333333333333333 and parameters: {'min_samples_leaf': 375, 'max_depth': 952, 'n_estimators': 735}. Best is trial 0 with value: 0.29333333333333333.[0m
[32m[I 2022-01-26 22:40:15,511][0m Trial 1 finished with value: 0.29333333333333333 and parameters: {'min_samples_leaf': 599, 'max_depth': 166, 'n_estimators': 166}. Best is trial 0 with value: 0.29333333333333333.[0m
[32m[I 2022-01-26 22:40:16,359][0m Trial 2 finished with value: 0.29333333333333333 and parameters: {'min_samples_leaf': 59, 'max_depth': 868, 'n_estimators': 606}. Best is trial 0 with value: 0.29333333333333333.[0m
[32m[I 2022-01-26 22:40:17,694][0m Trial 3 finished with value: 0.29333333333333333 and parameters: {'min_samples_leaf': 709, 'max_depth': 32, 'n_estimators': 971}. Best is trial 0 with value: 0.29333333333333

({'min_samples_leaf': 375, 'max_depth': 952, 'n_estimators': 735},
 0.29333333333333333)

## 1.4 Running hyperparameter tuning for regression model

In [5]:
hyperparameter_tuning(
        df=df_iris,
        target="target",
        parameters=[
        {"name": "min_samples_leaf", "type": "Integer", "low": 1, "high": 1000},
        {"name": "max_depth", "type": "Integer", "low": 12, "high": 1000},
        {"name": "n_estimators", "type": "Integer", "low": 12, "high": 1000}
    ],
        algorithm=RandomForestRegressor,
        metric=mean_absolute_error,
        scoring_option="minimize",
        n_trials=10,
        n_splits= 2,
        suffle = True,
        random_state= 42,
        metric_goal='False',
    
    )

[32m[I 2022-01-26 22:40:20,481][0m A new study created in memory with name: no-name-93a5ba1b-93da-4e6f-a195-136889889196[0m
[32m[I 2022-01-26 22:40:21,377][0m Trial 0 finished with value: 0.6983031292517003 and parameters: {'min_samples_leaf': 375, 'max_depth': 952, 'n_estimators': 735}. Best is trial 0 with value: 0.6983031292517003.[0m
[32m[I 2022-01-26 22:40:21,595][0m Trial 1 finished with value: 0.7019469879518074 and parameters: {'min_samples_leaf': 599, 'max_depth': 166, 'n_estimators': 166}. Best is trial 0 with value: 0.6983031292517003.[0m
[32m[I 2022-01-26 22:40:22,332][0m Trial 2 finished with value: 0.6988843417675097 and parameters: {'min_samples_leaf': 59, 'max_depth': 868, 'n_estimators': 606}. Best is trial 0 with value: 0.6983031292517003.[0m
[32m[I 2022-01-26 22:40:23,513][0m Trial 3 finished with value: 0.6999597207918522 and parameters: {'min_samples_leaf': 709, 'max_depth': 32, 'n_estimators': 971}. Best is trial 0 with value: 0.6983031292517003.[0m

({'min_samples_leaf': 833, 'max_depth': 222, 'n_estimators': 191},
 0.6968898196625948)

## 2.0 Conclusion and library advantages

This implementation is an advantage for feature selection process during the development of ML models due to its standardization. Basically, you are able to run several different methods for feature select only specifying basic hyperparameters and the dataFrame to be used. This makes very to run a lot of tests in order to get best set of features for the train/test phase.

## References

[cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)


[KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold)


[optuna](https://optuna.readthedocs.io/en/v0.19.0/)


[RandomSampler](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.RandomSampler.html)


[make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)