# üìù Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster
to execute. Once your code works on the small subset, try to change
`train_size` to a larger value (e.g. 0.8 for 80% instead of 20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

In [5]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_preprocessor, selector(dtype_include=object)),
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

model.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('ordinalencoder',
                                    OrdinalEncoder(handle_unknown='use_encoded_value',
                                                   unknown_value=-1),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f2ecf3cda90>)])),
  ('classifier', HistGradientBoostingClassifier(random_state=42))],
 'transform_input': None,
 'verbose': False,
 'preprocessor': ColumnTransformer(remainder='passthrough',
                   transformers=[('ordinalencoder',
                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                 unknown_value=-1),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f2ecf3cda90>)]),
 'classifier': HistGradientBoostingClassifier(random_state=42),
 'preproc

Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you need to train and test the
model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. Use the following
parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the depth
  of each tree.

In [13]:
# Write your code here.
import numpy as np
from sklearn.model_selection import cross_val_score

for learning_rate in [0.01, 0.1, 1, 10]:
    for max_leaf_nodes in [3, 10, 30]:
        model.set_params(classifier__learning_rate = learning_rate)
        model.set_params(classifier__max_leaf_nodes = max_leaf_nodes)
        print(f'CrossVal score for learning_rate= {learning_rate} and max_leaf_nodes= {max_leaf_nodes} is: {cross_val_score(estimator=model, X=data_train, y= target_train).mean():.3f}')

CrossVal score for learning_rate= 0.01 and max_leaf_nodes= 3 is: 0.790
CrossVal score for learning_rate= 0.01 and max_leaf_nodes= 10 is: 0.814
CrossVal score for learning_rate= 0.01 and max_leaf_nodes= 30 is: 0.842
CrossVal score for learning_rate= 0.1 and max_leaf_nodes= 3 is: 0.849
CrossVal score for learning_rate= 0.1 and max_leaf_nodes= 10 is: 0.863
CrossVal score for learning_rate= 0.1 and max_leaf_nodes= 30 is: 0.861
CrossVal score for learning_rate= 1 and max_leaf_nodes= 3 is: 0.854
CrossVal score for learning_rate= 1 and max_leaf_nodes= 10 is: 0.837
CrossVal score for learning_rate= 1 and max_leaf_nodes= 30 is: 0.824
CrossVal score for learning_rate= 10 and max_leaf_nodes= 3 is: 0.288
CrossVal score for learning_rate= 10 and max_leaf_nodes= 10 is: 0.646
CrossVal score for learning_rate= 10 and max_leaf_nodes= 30 is: 0.534


Now use the test set to score the model using the best parameters that we
found using cross-validation. You will have to refit the model over the full
training set.

In [None]:
# Write your code here.
model.set_params(classifier__learning_rate = 0.1)
model.set_params(classifier__max_leaf_nodes = 10)

cross_val_score(estimator=model, X=data_test, y=target_test).mean()

np.float64(0.8720631293046374)