# üìù Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

In this exercise, we progressively define the regression pipeline and later
tune its hyperparameters.

Start by defining a pipeline that:
* uses a `StandardScaler` to normalize the numerical data;
* uses a `sklearn.neighbors.KNeighborsRegressor` as a predictive model.

In [4]:
# Write your code here.
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

In [5]:
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [6]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_preprocessor, selector(dtype_include=object)),
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

Use `RandomizedSearchCV` with `n_iter=20` and
`scoring="neg_mean_absolute_error"` to tune the following hyperparameters
of the `model`:

- the parameter `n_neighbors` of the `KNeighborsRegressor` with values
  `np.logspace(0, 3, num=10).astype(np.int32)`;
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values `True`
  or `False`.

The `scoring` function is expected to return higher values for better models,
since grid/random search objects **maximize** it. Because of that, error
metrics like `mean_absolute_error` must be negated (using the `neg_` prefix)
to work correctly (remember lower errors represent better models).

Notice that in the notebook "Hyperparameter tuning by randomized-search" we
pass distributions to be sampled by the `RandomizedSearchCV`. In this case we
define a fixed grid of hyperparameters to be explored. Using a `GridSearchCV`
instead would explore all the possible combinations on the grid, which can be
costly to compute for large grids, whereas the parameter `n_iter` of the
`RandomizedSearchCV` controls the number of different random combination that
are evaluated. Notice that setting `n_iter` larger than the number of possible
combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating
already-explored combinations.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

Here I used the same excersize in M3.01 and same data set, but instead of using exhaustive search i used RandomizedSearchCV.

In [9]:
# Write your code here.
# from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
    "classifier__learning_rate": [0.01, 0.1, 1, 10],
    "classifier__max_leaf_nodes": [3, 10, 30],
}


search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    n_iter=6,          # trys 6 random combinations instead of 4*3 
    cv=2,
    scoring="accuracy",
    random_state=42,
    n_jobs=-1,
)

search.fit(data_train, target_train)

print(f"The best accuracy obtained is {search.best_score_}")
print(
    "The best parameters found are:",
    {
        "learning_rate": search.best_params_["classifier__learning_rate"],
        "max_leaf_nodes": search.best_params_["classifier__max_leaf_nodes"],
    },
)


The best accuracy obtained is 0.8574938574938575
The best parameters found are: {'learning_rate': 0.1, 'max_leaf_nodes': 30}


In [10]:
test_score = search.score(data_test, target_test)
print(f"Test score after the parameter tuning: {test_score}")

Test score after the parameter tuning: 0.8684547269283923
