## Training the Machine Learning Models 

### Primary Goal: Train an accurate machine model for the individual severe weather hazards. 
### Second Goal: Develop the models such that they outperform the baseline models. 

In this notebook, I'll provide a brief tutorial on how to train and evaluate a baseline model. It is not only helpful, but crucial to develop a simplier, baseline model against which to evaluate the skill of the machine learning model

In [1]:
# Import packages 
import pandas as pd
import numpy as np

# Plotting code imports 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# We add the github package to our system path so we can import python scripts for that repo. 
import sys
sys.path.append('/home/monte.flora/python_packages/2to6_hr_severe_wx/')
from main.io import load_ml_data

from master.ml_workflow.ml_workflow.calibrated_pipeline_hyperopt_cv import CalibratedPipelineHyperOptCV

import pandas as pd
from os.path import join
import numpy as np
from ml_workflow.ml_workflow.ml_methods import norm_aupdc, brier_skill_score
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from hyperopt import hp
from glob import glob

In [2]:
# Configuration variables (You'll need to change based on where you store your data)
base_path = '/work/mflora/ML_2TO6HR/data'

In [3]:
X,y,metadata = load_ml_data(base_path=base_path, 
                            mode='train', 
                            target_col='hail_severe__36km')

## CalibratedPipelineHyperOptCV

This is a scikit-learn style model I've developed. As the name suggests, it handles the following: 
#### 1. Creating an ML pipeline. 
Often data requires pre-process such as imputing missing values, scaling the data, or re-sampling the data to fix class imbalances prior to fitting the ML model. CalibratedPipelineHyperOptCV has 3 options for the pipeline : `scaler`, `imputer`, and `resample`. You'll like keep the options at `scaler = 'standard'`, `imputer='simple'`, and `resample='under'` for the REU, but feel free to explore different options! 
 
#### 2. Performing Hyperparameter Optimization 
Many machine learning models have tunable knobs. For example, in random forest we can set the number of trees. How do you know we have chosen the best-performing options? The simplest, but most time-consuming option is brute force where we try different hyperparameters and evaluate the performance on a validation dataset. CalibratedPipelineHyperOptCV uses the [hyperopt](http://hyperopt.github.io/hyperopt/) package and cross-validation to determine the best-performing hyperparameters. For the hyperparameter optimization, you set the number of iterations (`max_iter`). 

To use the hyperparameter optimization, you'll need to set a parameter grid (`param_grid`). An example grid for logistic regression is provided below. Feel free to ask questions about setting up a grid for other models. 


#### 3. Calibration 
Machine learning models, especially those trained on resampled data, tend to be uncalibrated. We perform cross-validation-based calibration using isotonic regression. 
 


In [4]:
# Uncomment this and run it to see the different CalibratedPipelineHyperOptCV options. 
#help(CalibratedPipelineHyperOptCV)

## Example. 

This example shows how to train a simple logistic regression with elastic nets using the CalibratedPipelineHyperOptCV package. Since logistic regression requires the different features to have similar scales, we set `scaler = 'standard'`. There is significant class imbalance (most of the examples are non-severe), so we have `resample = 'under'`. 

For the cross-validation argument (`cv_kwargs`), we want to pass in the training dates 

In [None]:
scaler = 'standard'
resample = 'under'

base_estimator = LogisticRegression(solver='saga', penalty='elasticnet', max_iter=300, random_state=42)
param_grid = {
                'l1_ratio': hp.choice('l1_ratio', [0.0001, 0.001, 0.01, 0.1, 0.5, 0.6, 0.8, 1.0]),
                'C': hp.choice('C', [0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1.0]),
                }
 
train_dates = metadata['Run Date'].apply(str)

clf = CalibratedPipelineHyperOptCV(base_estimator=base_estimator, 
                                   param_grid=param_grid, 
                                   scaler=scaler, 
                                   resample=resample, max_iter=10, 
                                   cv_kwargs = {'dates': train_dates, 'n_splits': 5, 'valid_size' : 20} )

clf.fit(X, y)

save_name = 'hail_model.joblib'
clf.save(save_name)

  0%|                                                                                                                                                                                                               | 0/100 [00:00<?, ?trial/s, best loss=?]

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)








  1%|█▊                                                                                                                                                                               | 1/100 [09:43<16:03:12, 583.76s/trial, best loss: 0.8459230868377026]



