# Hyperparameters tuning
Hyperparameter tuning is a blend of manual and automatic techniques for adjusting paramters that control the learning process of a machine learning model. These parameters have an impact on the performance of these model and as such default values are not necessarily the best options. 

In [3]:
import pandas as pd

adult_census = pd.read_csv("../Module One/adult_census.csv")

target_name = "class"
numerical_columns = [
    "age", "capital-gain", "capital-loss", "hours-per-week"]

target = adult_census[target_name]
data = adult_census[numerical_columns]

In [4]:
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [6]:
adult_census.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
39222,44,Private,85440,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,45,United-States,<=50K
43102,22,Private,279901,HS-grad,9,Married-civ-spouse,Other-service,Own-child,Black,Male,0,0,40,United-States,<=50K
3268,25,Private,198512,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
37879,24,Private,291407,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,Black,Male,0,0,40,United-States,<=50K
42244,61,Self-emp-not-inc,221884,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,50,United-States,>50K


Since models work better when all the features have a similar scaling, we use `StandardScaler` to transform the data by rescaling features. 

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps = [
    ("preprocessor", StandardScaler()), 
    ("classifier", LogisticRegression())
])

Now, we do a simple evaluation of the statistical performance of the model using cross validation. 

In [10]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]

print(f"Accuracy score via cross-validation: \n"
        f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation: 
0.800 +/- 0.003


We can change the parameter of a model after it has been created with `set_params` method, which is available for all scikit-learn estimators. And then re-evaluate the model. 

In [12]:
model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]

print(f"Accuracy score via cross-validation: \n"
        f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation: 
0.787 +/- 0.002


When the model of interest is a Pipeline, the parameter names are of the form <model_name>__<parameter_name> (note the double underscore in the middle). In our case, `classifier` comes from the Pipeline definition and `C` is the parameter name of `LogisticRegression`.

In [16]:
for parameter in model.get_params(): 
    print(parameter)

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


In [17]:
 model.get_params()['classifier__C']

0.001

In [18]:
# We can systematically vary the value of C to see if there is an optimal value
for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model.set_params(classifier__C=C)
    cv_results = cross_validate(model, data, target)
    scores = cv_results["test_score"]
    
    print(f"Accuracy score via cross-validation: \n"
        f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation: 
0.787 +/- 0.002
Accuracy score via cross-validation: 
0.799 +/- 0.003
Accuracy score via cross-validation: 
0.800 +/- 0.003
Accuracy score via cross-validation: 
0.800 +/- 0.003
Accuracy score via cross-validation: 
0.800 +/- 0.003


The output shows that as long as C is high enough, the model performs well. 