# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [1]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [3]:
from sklearn.compose import make_column_selector as selector

# Write your code here.
categorical_selector = selector(dtype_include=object)
numerical_selector = selector(dtype_exclude=object)

categorical_columns = categorical_selector(data)
numerical_columns = numerical_selector(data)

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Write your code here.
cat_processor = OneHotEncoder(handle_unknown='ignore')
num_processor = StandardScaler()

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [7]:
from sklearn.compose import ColumnTransformer

# Write your code here.
preprocessor = ColumnTransformer(
[
    ('cat_process', cat_processor, categorical_columns),
    ('num_process', num_processor, numerical_columns)
])

Assemble the final pipeline by combining the above preprocessor
with a logistic regression classifier. Force the maximum number of
iterations to `10_000` to ensure that the model will converge.

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = make_pipeline(preprocessor, LogisticRegression(max_iter=11_000))

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [9]:
model.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('cat_process',
                                    OneHotEncoder(handle_unknown='ignore'),
                                    ['workclass', 'education', 'marital-status',
                                     'occupation', 'relationship', 'race', 'sex',
                                     'native-country']),
                                   ('num_process', StandardScaler(),
                                    ['age', 'capital-gain', 'capital-loss',
                                     'hours-per-week'])])),
  ('logisticregression', LogisticRegression(max_iter=11000))],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('cat_process',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['workclass', 'education', 'marital-status',
                                   'occupation', 'relationship', 'race', 'sex',
             

In [10]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# Write your code here.
params_dict = {
    'columntransformer__num_process__with_mean': [True, False],
    'columntransformer__num_process__with_std': [True, False],
    'logisticregression__C': loguniform(1e-3, 10)
}

model_random_search = RandomizedSearchCV(model,
                                        param_distributions= params_dict,
                                        n_iter=20, error_score='raise',
                                        n_jobs=-1, verbose=1)
model_random_search.fit(data_train, target_train)
model_random_search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


{'columntransformer__num_process__with_mean': False,
 'columntransformer__num_process__with_std': False,
 'logisticregression__C': 1.4789056063421475}