# What are we doing?

## Objectives

+ Construct a cross-validation pipeline.
+ Use cross-validation to evaluate different hyperparameter performance.
+ Perform grid search for systemic evaluation.
+ Store and manage results.

## Procedure

The diagram below, taken from Scikit Learn's documentation, shows the procedure that we will follow:

![](../images/05_grid_search_workflow.png)


+ System requriements:
    
    - Automation: the system should operate automatically with the least amount of supervision. 
    - Replicability: changes to code and (arguably) data should be logged and controled. Randomness should also be controlled (random seeds, etc.)
    - Persistence: persist results for later analysis.


## What is a Hyperparameter?

+ Generally speaking, hyperparameters are parameters that control the learning process: regularization weights, learning rate, entropy/gini metrics, etc. 
+ Hyperparameters will drive the behaviour and performance of a model. Model selection is intimately related with hyperparameter tuning. 
+ Selection critieria are based on performance evaluation and, to get better performance estimates, we use cross-validation.

## Searching the Hyperparameter Grid

+ To address the automation requirement, we could use `GridSearchCV()`, which is a self-contained function for performing a Grid Search over a hyperparameter space.
+ To "Search the Hyperparameter Grid" exhaustively means that we will consider all possible combination of hyperparameter values in the search space and evaluate the model using those hyperparams. For example, if we have two parameters that we are exploring, kernel (takes values "rbf" and "poly") and C (takes values 1.0 and 0.5), then this grid would be the combinations:

    + (rbf, 1.0)
    + (rbf, 0.5)
    + (poly, 1.0)
    + (poly, 0.5)

+ Under each combination, we perform CV and evaluate the model's performance.

# Setup

We start with [Give me some credit](https://www.kaggle.com/c/GiveMeSomeCredit) data that we used in the previous session.

In [1]:
%load_ext dotenv
%dotenv 
import os
import sys
sys.path.append(os.path.join('../',os.getenv('SRC_DIR')))
import pandas as pd
import numpy as np

data_file = os.path.join('../', os.getenv('CREDIT_DATA'))
credit_dt = pd.read_csv( data_file )

In [2]:
df = credit_dt.drop(columns = ["Unnamed: 0"]).rename( columns = {
       'SeriousDlqin2yrs': 'delinq_2y',
       'RevolvingUtilizationOfUnsecuredLines': 'revolv_util_unsec',        
       'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_later',
       'DebtRatio': 'debt_ratio',
       'MonthlyIncome': 'month_inc',
       'NumberOfOpenCreditLinesAndLoans': 'num_open_credit',
       'NumberOfTimes90DaysLate': 'num_90_days_late',
       'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
       'NumberOfTime60-89DaysPastDueNotWorse': 'num_60-89_days_late',
       'NumberOfDependents': 'num_depend'
}).assign( high_debt_ratio = lambda x: (x['debt_ratio'] > 1) *1,
          miss_month_inc = lambda x: x['month_inc'].isna() *1,
          miss_num_depend = lambda x: x['num_depend'].isna() *1)
df

Unnamed: 0,delinq_2y,revolv_util_unsec,age,num_30_59_days_later,debt_ratio,month_inc,num_open_credit,num_90_days_late,num_real_estate_loans,num_60-89_days_late,num_depend,high_debt_ratio,miss_month_inc,miss_num_depend
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,0,0,0
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0,0,0
2,0,0.658180,38,1,0.085113,3042.0,2,1,0,0,0.0,0,0,0
3,0,0.233810,30,0,0.036050,3300.0,5,0,0,0,0.0,0,0,0
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,0,0.040674,74,0,0.225131,2100.0,4,0,1,0,0.0,0,0,0
149996,0,0.299745,44,0,0.716562,5584.0,4,0,1,0,2.0,0,0,0
149997,0,0.246044,58,0,3870.000000,,18,0,1,0,0.0,1,1,0
149998,0,0.000000,30,0,0.000000,5716.0,4,0,0,0,0.0,0,0,0


Use a simple pipeline composed of:

+ Preprocessing steps.
+ Logistic Regression classifier.

We will explore the hyperparameter sapce by evaluating different regularization strategies and parameters.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [5]:
num_cols = ['revolv_util_unsec', 
         'age', 
         'num_30_59_days_later',
         'debt_ratio', 
         'month_inc', 
         'num_open_credit', 
         'num_90_days_late',
         'num_real_estate_loans', 
         'num_60-89_days_late', 
         'num_depend']

pipe_simple_num = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('standartiz', StandardScaler())
])

col_transf = ColumnTransformer([
    ('col_transform', pipe_simple_num, num_cols)
], remainder='passthrough')

pipe_lr = Pipeline([
    ('preproc', col_transf),
    ('model', LogisticRegression())
])

pipe_lr.get_params()

{'memory': None,
 'steps': [('preproc',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('col_transform',
                                    Pipeline(steps=[('imputer', SimpleImputer()),
                                                    ('standartiz',
                                                     StandardScaler())]),
                                    ['revolv_util_unsec', 'age',
                                     'num_30_59_days_later', 'debt_ratio',
                                     'month_inc', 'num_open_credit',
                                     'num_90_days_late', 'num_real_estate_loans',
                                     'num_60-89_days_late', 'num_depend'])])),
  ('model', LogisticRegression())],
 'verbose': False,
 'preproc': ColumnTransformer(remainder='passthrough',
                   transformers=[('col_transform',
                                  Pipeline(steps=[('imputer', SimpleImputer()),
                           

Obtain the parameters of the pipeline with `.get_params()`.

## Setup the Splitting Strategy

In [6]:
X = df.drop(columns = 'delinq_2y')
Y = df['delinq_2y']

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.2, random_state=0)

scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

To perform the Grid Search we need to define a parameter grid:

- A parameter grid defines all of the combinations of parameters that we need to explore.
- The function `GridSearchCV()` performs an exhaustive search of parameter combinations.
- The parameter grid is defined as a dictionary of lists:

    * Each entry's key is the name of the parameter.
    * Each entry's value is the list of values that we would like to explore.

In [18]:
param_grid_lr = {
    'model__C': [0.01, 0.5, 0.9],
    'model__penalty': ['l1', 'l2'],
    'model__solver': ['sag']
}

Some key inputs to [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) are:

+ `estimator`: the pipeline or classifier that we are tuning.
+ `param_grid`: the parameter grid defined as a dictionary of lists described above.
+ `n_jobs`: settings for parallel computation.
+ `refit`: options for refitting the model using the best-performing configuration.

In [19]:
perf_grid_cv = GridSearchCV(
    estimator = pipe_lr,
    param_grid = param_grid_lr,
    scoring = scoring,
    cv = 5,
    refit = "neg_log_loss")

perf_grid_cv.fit(X_train, Y_train)

15 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Igor\.conda\envs\dsi_participant\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Igor\.conda\envs\dsi_participant\lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\Igor\.conda\envs\dsi_participant\lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\Igor\.conda\envs\dsi_participant\lib\site-packages\sk

Access the cross-validation results using the property `.cv_results_`:

In [20]:
results = perf_grid_cv.cv_results_
results = pd.DataFrame(results)
results.columns

results[['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_model__C', 'param_model__penalty', 'param_model__solver',
       'params', 'mean_test_neg_log_loss',
       'std_test_neg_log_loss', 'rank_test_neg_log_loss', 'mean_test_roc_auc',
       'std_test_roc_auc', 'rank_test_roc_auc', 'mean_test_accuracy', 
       'std_test_accuracy', 'rank_test_accuracy']].sort_values('rank_test_neg_log_loss')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__C,param_model__penalty,param_model__solver,params,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
5,4.053498,0.183764,0.088337,0.007439,0.9,l2,sag,"{'model__C': 0.9, 'model__penalty': 'l2', 'mod...",-0.226722,0.001667,1,0.700515,0.005965,1,0.933475,0.000217,2
3,3.846732,0.830224,0.103631,0.037312,0.5,l2,sag,"{'model__C': 0.5, 'model__penalty': 'l2', 'mod...",-0.226736,0.001665,2,0.700496,0.005973,2,0.933467,0.00022,3
1,3.821815,0.211567,0.091279,0.004554,0.01,l2,sag,"{'model__C': 0.01, 'model__penalty': 'l2', 'mo...",-0.228387,0.00136,3,0.697085,0.006059,3,0.933525,0.000251,1
0,0.073645,0.006384,0.0,0.0,0.01,l1,sag,"{'model__C': 0.01, 'model__penalty': 'l1', 'mo...",,,4,,,4,,,4
2,0.071506,0.00641,0.0,0.0,0.5,l1,sag,"{'model__C': 0.5, 'model__penalty': 'l1', 'mod...",,,4,,,4,,,4
4,0.070439,0.005923,0.0,0.0,0.9,l1,sag,"{'model__C': 0.9, 'model__penalty': 'l1', 'mod...",,,4,,,4,,,4


Access the best-performing configuration:

In [15]:
perf_grid_cv.best_params_

{'model__C': 0.9, 'model__penalty': 'l1', 'model__solver': 'liblinear'}

In [23]:
pipe_lr.set_params(**perf_grid_cv.best_params_)

In [25]:
best_pipe = perf_grid_cv.best_estimator_
best_pipe

The best-performing classifier (pipeline) trained on the complete training set is:

# Tracking GridSearchCV Experiments

+ We can expand our infrastructure for hyperparameter tuning across various models.
+ The plan:

    - Create a model ingredient to obtain the classifier object.
    - Create experiment param grids in json files to organize our parameter grids.
    - Schedule the experiments.


## The Design

<div>
<img src="../images/05_experiment_setup.png" width="75%">
</div>

Explore the code in `./05_src/credit_experiment.py` and `./05_src/credit_model_ingredient.py`:

+ `credit_model_ingredient.py` implements a function that returns a model given a string. This way, we can parametrize models in the experiment.
+ `credit_experiment.py` is modularized version of our previous file, `credit_experiment_nb.py` which only worked with Naive Bayes classifier.
+ The experiment is now further *modularized*: there are ingredients for most components and it can be broken down even more depending on the evolution of the model.

## Running Experiments from the Command Line

Access the experiment through the [Command Line Interface](https://sacred.readthedocs.io/en/stable/command_line.html).

```
cd src  # if required
python credit_experiment.py
```

We can also change the parameters of the experiment. For instance, using the same code, we can run an experiment with a logistic regression classifier using a basic (not power) preprocessing pipeline:

```
python .\credit_experiment.py with 'preprocessing="basic"' 'model="LogisticRegression"'
```

# A Few Notes About Sacred

+ Sacred is a powerful tool, but it is only the beginning.  
+ Sacred is useful in keeping track of experiments within a limited scope: it is not a project management tool.
+ It works well in SQL environments, but handling hyperparameters can be painful.
+ The natural backend is MongoDB, however not all workplaces have running instances.


## Experiment Schema

The database schema implemented by sacred is shown below. The schema is a useful representation of the code and setup of an experiment. The package offers a [metrics API](https://sacred.readthedocs.io/en/stable/collected_information.html#metrics-api), but we have decided to extend the framework with a few ad-hoc tables with performance metrics. 

The database backend is a database like any other: you can query it with Python, R, or PowerBI.

+ Server is located in localhost port 5432.
+ User and password are in the .env file in `./05_src/db/`.

<div>
<img src="../images/05_sacred_sql_schema.png" width="40%">
</div>