# Test Classification Using HistGBT

This Notebook tests the usage of `HistGBT` for classification on test data genereated from [mk_test_data.py](./mk_test_data.py).

<!-- #endregion -->

## Pre-amble

The following code cell imports the required libraries and sets up the notebook

In [None]:
import os
#TEST_REGISTRATION = os.environ.get('test_registration', False)

In [None]:
import sys
sys.path.append('../')

In [None]:
# Jupyter notebook Specific imports
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Imports injecting into namespace
from tqdm.auto import tqdm
tqdm.pandas()

# General imports
import os
import json
import pickle
from pathlib import Path

import pandas as pd
import numpy as np
from getpass import getpass
import argparse

from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError

from lightsaber import constants as C
import lightsaber.data_utils.utils as du
from lightsaber.data_utils.pt_dataset import (filter_preprocessor)
from lightsaber.data_utils import sk_dataloader as skd
from lightsaber.trainers import sk_trainer as skr

from sklearn.ensemble import HistGradientBoostingClassifier

import logging
log = logging.getLogger()

In [None]:
import io

data_dir = Path('./data')
assert data_dir.is_dir()

expt_conf = f"""
tgt_col: treatment

idx_cols: 
    - id
    - time
time_order_col: 
    - time_history

feat_cols: 
    - prev_cov1
    - prev_treat

train:
    tgt_file: '{data_dir}/easiest_sim_shifted_TGT_train.csv'
    feat_file: '{data_dir}/easiest_sim_shifted_FEAT_train.csv'

val:
    tgt_file: '{data_dir}/easiest_sim_shifted_TGT_val.csv'
    feat_file: '{data_dir}/easiest_sim_shifted_FEAT_val.csv'
    
test:
    tgt_file: '{data_dir}/easiest_sim_shifted_TGT_test.csv'
    feat_file: '{data_dir}/easiest_sim_shifted_FEAT_test.csv'

category_map:
    prev_treat: [0, 1]
    
numerical: 
    - prev_cov1

normal_values:
    prev_cov1: 0.
    prev_treat: 0
"""
expt_conf = du.yaml.load(io.StringIO(expt_conf), Loader=du._Loader)

## Model Training

In general, user need to follow the following steps to train a `HistGBT` for classification model.

* _Data Ingestion_: The first step involves setting up the pre-processors to train a classification model. In this example, we will flatten the temporal data using pre-defined pipelines and further use `StandardScaler` from `scikit-learn` using filters defined within lightsaber.

  - We would next read the train, test, and validation dataset. In some cases, users may also want to define a calibration dataset
    
* _Model Definition_: We would next need to define a base model for classification. In this example, we will use a standard `scikit-learn::HistGBT` model 

* _Model Training_: Once the models are defined, we can use `lightsaber` to train the model via the pre-packaged `SKModel` and the corresponding trainer code. This step will also generate the relevant `metrics` for this problem.

  - we will also show how to train a single hyper-parameter setting as well as a grid search over a pre-specified hyper-parameter space.



### Data ingestion

We firs start by reading extracted cohort data and use a `StandardScaler` demonstrating the proper usage of a pre-processor

In [None]:
flatten = 'mean'
preprocessor = StandardScaler()
train_filter = [filter_preprocessor(cols=expt_conf['numerical'], 
                                    preprocessor=preprocessor,
                                    refit=True),
               ]

train_dataloader = skd.SKDataLoader(tgt_file=expt_conf['train']['tgt_file'],
                                    feat_file=expt_conf['train']['feat_file'],
                                    idx_col=expt_conf['idx_cols'],
                                    tgt_col=expt_conf['tgt_col'],
                                    feat_columns=expt_conf['feat_cols'],
                                    time_order_col=expt_conf['time_order_col'],
                                    category_map=expt_conf['category_map'],
                                    filter=train_filter,
                                    fill_value=expt_conf['normal_values'],
                                    flatten=flatten,
                                   )
print(train_dataloader.shape, len(train_dataloader))

# For other datasets use fitted preprocessors
fitted_filter = [filter_preprocessor(cols=expt_conf['numerical'], 
                                     preprocessor=preprocessor, refit=False),
                 ]
val_dataloader = skd.SKDataLoader(tgt_file=expt_conf['val']['tgt_file'],
                                  feat_file=expt_conf['val']['feat_file'],
                                  idx_col=expt_conf['idx_cols'],
                                  tgt_col=expt_conf['tgt_col'],
                                  feat_columns=expt_conf['feat_cols'],
                                  time_order_col=expt_conf['time_order_col'],
                                  category_map=expt_conf['category_map'],
                                  filter=fitted_filter,
                                  fill_value=expt_conf['normal_values'],
                                  flatten=flatten,
                                )

test_dataloader = skd.SKDataLoader(tgt_file=expt_conf['test']['tgt_file'],
                                  feat_file=expt_conf['test']['feat_file'],
                                  idx_col=expt_conf['idx_cols'],
                                  tgt_col=expt_conf['tgt_col'],
                                  feat_columns=expt_conf['feat_cols'],
                                  time_order_col=expt_conf['time_order_col'],
                                  category_map=expt_conf['category_map'],
                                  filter=fitted_filter,
                                  fill_value=expt_conf['normal_values'],
                                  flatten=flatten,
                                )

print(val_dataloader.shape, len(val_dataloader))
print(test_dataloader.shape, len(test_dataloader))

### Training a single model

#### Model definition

We can define a base classification model using standard `scikit-learn` workflow as below:

In [None]:
model_name = 'HistGBT'
hparams = argparse.Namespace(learning_rate=0.01,
                             max_iter=100,
                             l2_regularization=0.01
                             )

base_model = HistGradientBoostingClassifier(learning_rate=hparams.learning_rate, 
                                            l2_regularization=hparams.l2_regularization, 
                                            max_iter=hparams.max_iter)

wrapped_model = skr.SKModel(base_model, hparams, name=model_name)

#### Model training with in-built model tracking and evaluation

In [None]:
mlflow_conf = dict(experiment_name=f'classifier_test')
artifacts = dict(preprocessor=preprocessor)
experiment_tags = dict(model=model_name, 
                       tune=False)

(run_id, metrics, 
 val_y, val_yhat, val_pred_proba, 
 test_y, test_yhat, test_pred_proba) = skr.run_training_with_mlflow(mlflow_conf, 
                                                                    wrapped_model,
                                                                    train_dataloader=train_dataloader,
                                                                    val_dataloader=val_dataloader,
                                                                    test_dataloader=test_dataloader,
                                                                    artifacts=artifacts,
                                                                    **experiment_tags)

print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)

### Hyper-parameter Search

`lightsaber` also naturally supports hyper-parameter search to find the best model w.r.t.\ a pre-defined metric using the similar trace as above. 

To conduct a grid-search we follow two steps:

* we define a grid `h_search` over the model parameter space
* We pass an experiment tag `tune` set to `True` along with the grid `h_search` to the trainer code

In [None]:
model_name = 'HistGBT'
hparams = argparse.Namespace(learning_rate=0.01,
                             max_iter=100,
                             l2_regularization=0.01
                             )
h_search = dict(
    learning_rate=[0.01, 0.1, 0.02],
    max_iter=[50, 100]
)

base_model = HistGradientBoostingClassifier(**vars(hparams))

wrapped_model = skr.SKModel(base_model, hparams, name=model_name)

In [None]:
mlflow_conf = dict(experiment_name=f'classifier_test')
artifacts = dict(preprocessor=preprocessor)
experiment_tags = dict(model=model_name, 
                       tune=True)

(run_id, metrics, 
 val_y, val_yhat, val_pred_proba, 
 test_y, test_yhat, test_pred_proba) = skr.run_training_with_mlflow(mlflow_conf, 
                                                                    wrapped_model,
                                                                    train_dataloader=train_dataloader,
                                                                    val_dataloader=val_dataloader,
                                                                    test_dataloader=test_dataloader,
                                                                    artifacts=artifacts,
                                                                    h_search=h_search,
                                                                    **experiment_tags)

print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)