# Grid search with sklearn


Grid Search is a brute force method to find the best hyperparameters for a specific dataset and model. It is possible to combine it with cross-validation and pipelines to automate this task. Here you have a visual representation of this task, a heatmap with all the accuracy values for several hyperparameters combinations.

<img src="https://mlr.mlr-org.com/articles/tutorial/hyperpar_tuning_effects_files/figure-html/unnamed-chunk-11-1.png" alt="Grid Search" style="width: 500px;"/>

SKlearn provides a complete toolbox for that. In this notebook, we concentrated on the GridSearchCV function.

## Exercise 1

For this exercise we are going to use a data file called **mortgages**. This file contains information about clients that wish to buy a house, and whereas they were given the mortgage or not. The aim of this exercise is to prepare the dataset for training different tree models.

In the following link, you can download a it:

https://github.com/jnin/information-systems/raw/main/data/mortgages.csv


The data is presented in a csv format. The dataset contains the following attributes:


|Variable	|Definition	|Key|
|---------|------------|-----------------|
|income	|	Family income| Integer|
|regular_bills	|general expenses	|Integer|
|car_bills	|car related expenses|Integer	|
|other_bills	|extraordinary expenses |Integer	|
|savings	| Down payment amount| Integer	|
|house_price	|house value  |Integer	|
|marital_status	|marital status	|divorced, married or single|
|number_of_children	|number of children	|0,1,3,4|
|mortgage	| credit given	|0 = No, 1 = Yes|

<div class="alert alert-info"><b>Exercise 1.1</b> 

Create a dataframe called ```df``` that contains the provided data. Extract the features matrix and target array from ```df``` and store them in two new variables called ```X```and ```y```, respectively. 

Output label distribution to check how unbalanced they are. 
</div>
<div class="alert alert-warning">

Use the last column `mortgage` as the target variable.

</div>

In [2]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))
lasso.score(X,y)

[0.3315057  0.08022103 0.03531816]


NotFittedError: This Lasso instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
<FILL IN>

<div class="alert alert-info"><b>Exercise 1.2</b> 
    
Write the code to normalize the data using a ```StandardScaler``` and train a ```LogisticRegression```. Then, check the score of your model. Remember to split your data into train and test datasets.

We will use this model as a baseline for the remaining exercises.

</div>

<div class="alert alert-warning">

To save time, we have provided code for a possible column transformer that imputes missing values in the `'marital_status'` and `'regular_bills'` columns, along with an encoder for the `'marital_status'` column.

</div>


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer

imputer_cat = ('imputer_cat', SimpleImputer(strategy = 'most_frequent'))
ohe = ('encoder', OneHotEncoder(sparse_output=False))

marital_status_pipe = Pipeline([imputer_cat, ohe])
 
transformer = ColumnTransformer([('marital_status_transformations', marital_status_pipe,['marital_status']),
                                 ('imputer_num', SimpleImputer(strategy = 'median'), ['regular_bills'])], 
                                 remainder = 'passthrough')


<FILL IN>

<div class="alert alert-info"><b>Exercise 1.3</b> 
    
Write code to print the `classification_report` for the previous pipeline, then examine the output. Do you notice any issues related to the dataset's unbalance?

</div>

In [None]:
<FILL IN>

<div class="alert alert-info"><b>Exercise 1.4</b> 
    
Write code to create a second `Pipeline`, this time setting the `class_weight='balanced'` parameter in `LogisticRegression`. Print the `classification_report` for this model and compare it with the previous one. Did setting the class weight address the unbalance-related issues?

</div>

In [None]:
<FILL IN>


<div class="alert alert-info"><b>Exercise 1.5</b> 


To further improve the recall metric performance of our logistic regression model, we need to try different hyperparameters. To do this, we can use  ```GridSearchCV``` sklearn function. Write the code to find the best hyperparameters for our morgage dataset using the recall as the main scoring metric.

</div>

<div class="alert alert-warning">

To save time, we’ve provided a sample `param_grid` dictionary with a predefined search space. Before using it, make sure to adjust the dictionary keys to match your pipeline. Additionally, refer to the scikit-learn documentation to understand the different hyperparameters and make any necessary adjustments.

The execution time of this exercise may vary based on your laptop's performance.

</div>

In [None]:
param_grid = {'transformer__imputer_num__strategy' : ['median','mean','most_frequent'],
              'lg__tol' : [0.1, 0.01, 0.001, 0.0001],
              'lg__class_weight' : ['balanced', None],
              'lg__C' : [0.8, 1.0, 1.2],
              'lg__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'lg__max_iter' : [100, 200, 300]
}

<FILL IN>

<div class="alert alert-info"><b>Exercise 1.6</b> 
    
Write code to print the best hyperparameters found. Then, verify that the best estimator is fitted with these hyperparameters. Finally, print the generalization `classification_report` for the best estimator and check if we have achieved a better recall metric.

</div>

In [None]:
<FILL IN>

In [None]:
<FILL IN>

In [None]:
<FILL IN>