# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code
faster to execute. Once your code works on the small subset, try to
change `train_size` to a larger value (e.g. 0.8 for 80% instead of
20%).

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

# data_train, data_test, target_train, target_test = train_test_split(
#     data, target, train_size=0.2, random_state=42)

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.8, random_state=42)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer(
    [('cat_preprocessor', categorical_preprocessor,
      selector(dtype_include=object))],
    remainder='passthrough', sparse_threshold=0)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", HistGradientBoostingClassifier(random_state=42))
])

In [22]:
HistGradientBoostingClassifier?


Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you will need to train and test
the model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. We will use the
following parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the
  depth of each tree.

In [18]:
from sklearn.model_selection import cross_val_score
import numpy as np

learning_rate = np.geomspace(.01, 10, 4)
max_leaf_nodes = [3, 10, 30]

for rate in learning_rate:
    for node in max_leaf_nodes:
        model.set_params(classifier__learning_rate=rate)
        model.set_params(classifier__max_leaf_nodes=node)
        scores = cross_val_score(model, data_train, target_train)
        print(f"""
        Accuracy score via cross-validation with learning_rate={rate}
        and max_leaf_nodes={node}:\n"""
              f"{scores.mean():.3f} +/- {scores.std():.3f}")


        Accuracy score via cross-validation with learning_rate=0.01
        and max_leaf_nodes=3:
0.797 +/- 0.002

        Accuracy score via cross-validation with learning_rate=0.01
        and max_leaf_nodes=10:
0.819 +/- 0.003

        Accuracy score via cross-validation with learning_rate=0.01
        and max_leaf_nodes=30:
0.848 +/- 0.004

        Accuracy score via cross-validation with learning_rate=0.1
        and max_leaf_nodes=3:
0.855 +/- 0.003

        Accuracy score via cross-validation with learning_rate=0.1
        and max_leaf_nodes=10:
0.869 +/- 0.003

        Accuracy score via cross-validation with learning_rate=0.1
        and max_leaf_nodes=30:
0.871 +/- 0.003

        Accuracy score via cross-validation with learning_rate=1.0
        and max_leaf_nodes=3:
0.862 +/- 0.007

        Accuracy score via cross-validation with learning_rate=1.0
        and max_leaf_nodes=10:
0.862 +/- 0.009

        Accuracy score via cross-validation with learning_rate=1.0
        and 


Now use the test set to score the model using the best parameters
that we found using cross-validation in the training set.

In [21]:
model.set_params(classifier__learning_rate=1)
model.set_params(classifier__max_leaf_nodes=3)

model.fit(data_train, target_train)
model.score(data_test, target_test)

0.8714300337803256

In [23]:
model.get_params()['classifier__learning_rate']

1