## Training the Machine Learning Models 

### Primary Goal: Train an accurate machine model for the individual severe weather hazards. 
### Second Goal: Develop the models such that they outperform the baseline models. 

In this notebook, I'll provide a brief tutorial on how to train and evaluate a macine learning model. It is not only helpful, but crucial to develop a simplier, baseline model against which to evaluate the skill of the machine learning model

In [1]:
# Always keep this import at the top of your script. It is uses the Intel extension 
# for scikit-learn, which improves the training speed of machine learning algorithms
# in scikit-learn. 

# We add the github package to our system path so we can import python scripts for that repo. 
import sys
sys.path.append('/home/samuel.varga/projects/2to6_hr_severe_wx/')
sys.path.append('/home/samuel.varga/python_packages/ml_workflow/')
from main.io import load_ml_data
from ml_workflow.calibrated_pipeline_hyperopt_cv import CalibratedPipelineHyperOptCV

# Import packages 
import pandas as pd
import numpy as np
import sklearn
from os.path import join
from os.path import join
from sklearn.linear_model import LogisticRegression
from hyperopt import hp

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
# Configuration variables (You'll need to change based on where you store your data)
FRAMEWORK='POTVIN'
TIMESCALE='0to3'
base_path = f'/work/samuel.varga/data/{TIMESCALE}_hr_severe_wx/{FRAMEWORK}'

In [8]:
target_col='wind_severe__36km'
scale='36'
All=True
if All: #Use all severe as target
    X,y,metadata = load_ml_data(base_path=base_path, 
                            mode='train', 
                            target_col=target_col,
                           FRAMEWORK=FRAMEWORK,
                           TIMESCALE=TIMESCALE)
    print(len(y[y>0]))
    for hazard in ['hail','tornado']:
        target_col='{}_severe__{}km'.format(hazard, scale)
        SPAM, y1, SPAM  = load_ml_data(base_path=base_path, mode='train', target_col=target_col, FRAMEWORK=FRAMEWORK, TIMESCALE=TIMESCALE) 
        y +=y1
        print(len(y[y>0]))
       
    y[y > 0] = 1
else:
    X,y,metadata = load_ml_data(base_path=base_path, 
                            mode='train', 
                            target_col=target_col)
#list(X.columns)



139737
243973
255611


## CalibratedPipelineHyperOptCV

This is a scikit-learn style model I've developed. As the name suggests, it handles the following: 
#### 1. Creating an ML pipeline. 
Often data requires pre-process such as imputing missing values, scaling the data, or re-sampling the data to fix class imbalances prior to fitting the ML model. CalibratedPipelineHyperOptCV has 3 options for the pipeline : `scaler`, `imputer`, and `resample`. You'll likely keep the options at `scaler = 'standard'`, `imputer='simple'`, and `resample='under'` for the REU, but feel free to explore different options! 
 
#### 2. Performing Hyperparameter Optimization 
Many machine learning models have tunable knobs. For example, in random forest we can set the number of trees. How do you know we have chosen the best-performing options? The simplest, but most time-consuming option is brute force where we try different hyperparameters and evaluate the performance on a validation dataset. CalibratedPipelineHyperOptCV uses the [hyperopt](http://hyperopt.github.io/hyperopt/) package and cross-validation to determine the best-performing hyperparameters. For the hyperparameter optimization, you set the number of iterations (`max_iter`). 

To use the hyperparameter optimization, you'll need to set a parameter grid (`param_grid`). An example grid for logistic regression is provided below. Feel free to ask questions about setting up a grid for other models. 


#### 3. Calibration 
Machine learning models, especially those trained on resampled data, tend to be uncalibrated. We perform cross-validation-based calibration using isotonic regression. 
 


In [4]:
# Uncomment this and run it to see the different CalibratedPipelineHyperOptCV options. 
help(CalibratedPipelineHyperOptCV)

Help on class CalibratedPipelineHyperOptCV in module ml_workflow.calibrated_pipeline_hyperopt_cv:

class CalibratedPipelineHyperOptCV(sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin, sklearn.base.MetaEstimatorMixin)
 |  CalibratedPipelineHyperOptCV(base_estimator, param_grid, cal_method='isotonic', imputer='simple', scaler=None, resample=None, local_dir='/home/samuel.varga/projects/2to6_hr_severe_wx/notebooks', n_jobs=1, max_iter=15, scorer=<function norm_csi_scorer at 0x7f5acc893eb0>, cv='kfolds', cv_kwargs={'n_splits': 5}, hyperopt='atpe', scorer_kwargs={})
 |  
 |  This class takes X,y as inputs and returns 
 |  a ML pipeline with optimized hyperparameters (through k-folds cross-validation)  
 |  that is also calibrated using isotonic regression. 
 |  
 |  Parameters:
 |  ---------------------------
 |      base_estimator : unfit callable classifier or regressor (likely from scikit-learn) 
 |                  that implements a ``fit`` method. 
 |      
 |      param_grid : 

## Example. 

This example shows how to train a simple logistic regression with elastic nets using the CalibratedPipelineHyperOptCV package. Since logistic regression requires the different features to have similar scales, we set `scaler = 'standard'`. There is significant class imbalance (most of the examples are non-severe), so we have `resample = 'under'`. 

For the cross-validation argument (`cv_kwargs`), we want to pass in the training dates 

In [9]:
#Drop the Columns that I added if original=true
X=X.drop(['NX','NY'], axis=1)
original=False
envOnly=False

vardic={'ENS_VARS':['uh_2to5_instant',
                     'uh_0to2_instant',
                     'wz_0to2_instant',
                     'comp_dz',
                     'ws_80',
                     'hailcast',
                     'w_up',
                     'okubo_weiss',
                    ],
        'ENV_VARS':['mid_level_lapse_rate', 
                    'low_level_lapse_rate', 
                    'shear_u_0to1', 
                    'shear_v_0to1', 
                    'shear_u_0to6', 
                    'shear_v_0to6',
                    'shear_u_3to6', 
                    'shear_v_3to6',
                    'srh_0to3',
                    'cape_ml', 
                    'cin_ml', 
                    'stp',
                    'scp',]}
if original:
    X=X[[col for col in X.columns if 'IQR' not in col]]
    X=X[[col for col in X.columns if '2nd' not in col]]
    X=X[[col for col in X.columns if '16th' not in col]]
    #Mean of intrastorm vars
   
    stormcols=np.array([])
    for strmvar in vardic['ENS_VARS']:
        stormcols=np.append(stormcols, [col for col in X.columns if 'mean' in col and strmvar in col] ) #Every column w/ mean of intrastorm vars

    X=X.drop(stormcols, axis=1)
if envOnly:
    stormcols=np.array([])
    for strmvar in vardic['ENS_VARS']:
        stormcols=np.append(stormcols, [col for col in X.columns if strmvar in col]) #Every column name that has a storm var
    X=X.drop(stormcols, axis=1) #Drops all IS columns
list(X.columns)

['uh_2to5_instant__time_max__45km__ens_90th',
 'uh_0to2_instant__time_max__45km__ens_90th',
 'wz_0to2_instant__time_max__45km__ens_90th',
 'comp_dz__time_max__45km__ens_90th',
 'ws_80__time_max__45km__ens_90th',
 'hailcast__time_max__45km__ens_90th',
 'w_up__time_max__45km__ens_90th',
 'okubo_weiss__time_max__45km__ens_90th',
 'uh_2to5_instant__time_max__45km__ens_mean',
 'uh_0to2_instant__time_max__45km__ens_mean',
 'wz_0to2_instant__time_max__45km__ens_mean',
 'comp_dz__time_max__45km__ens_mean',
 'ws_80__time_max__45km__ens_mean',
 'hailcast__time_max__45km__ens_mean',
 'w_up__time_max__45km__ens_mean',
 'okubo_weiss__time_max__45km__ens_mean',
 'uh_2to5_instant__time_max__45km__ens_2nd',
 'uh_0to2_instant__time_max__45km__ens_2nd',
 'wz_0to2_instant__time_max__45km__ens_2nd',
 'comp_dz__time_max__45km__ens_2nd',
 'ws_80__time_max__45km__ens_2nd',
 'hailcast__time_max__45km__ens_2nd',
 'w_up__time_max__45km__ens_2nd',
 'okubo_weiss__time_max__45km__ens_2nd',
 'uh_2to5_instant__time_

In [10]:
print(np.shape(X))

(1117741, 198)


In [13]:
scaler = 'standard'
resample = 'under'
name='ADAM'

if name=='logistic':
    base_estimator = LogisticRegression(solver='saga', penalty='elasticnet', max_iter=300, random_state=42)
    #Param grid for LogReg
    param_grid = {
                'l1_ratio': hp.choice('l1_ratio', [0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.5, 0.6, 0.8, 1.0]),
                'C': hp.choice('C', [0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.25, 0.35, 0.5, 0.62, 0.75, 0.87, 1.0]),
            }
elif name=='random':
    base_estimator=sklearn.ensemble.RandomForestClassifier(random_state=42)
    #Param Grid for RF
    param_grid = {
               'n_estimators' : hp.choice('n_estimators',[10, 25, 50, 100,150,300,400,500]), 
               'max_depth' : hp.choice('max_depth',[4, 6,8,10,15,20]),
               'max_features' : hp.choice('max_features',[4,6,8,10,15,20,25,30]),
               'min_samples_split' : hp.choice('min_samples_split',[4,6,8,10,15,20,25,50]),
               'min_samples_leaf' : hp.choice('min_samples_leaf',[4,6,8,10,15,20,25,50]),
            }
elif name=='hist':
    base_estimator=sklearn.ensemble.HistGradientBoostingClassifier(random_state=42, max_iter=150)
    #Check performance of different loss functions?
    #Param Grid for HGB
    param_grid= {
    'learning_rate': hp.choice('learning_rate',[0.0001, 0.001, 0.01, 0.1]),
    'max_leaf_nodes': hp.choice('max_leaf_nodes',[5, 10, 20, 30, 40, 50]),
    'max_depth': hp.choice('max_depth', [4, 6, 8, 10]),
    'min_samples_leaf': hp.choice('min_samples_leaf',[5,10,15,20,30, 40, 50]),
    'l2_regularization': hp.choice('l2_regularization',[0.001, 0.01, 0.1]), #This one causes problems
    'max_bins': hp.choice('max_bins',[15, 31, 63, 127])
    
            }

elif name=='ADAM':
    base_estimator=sklearn.ensemble.RandomForestClassifier(random_state=42)
    param_grid = {
               'n_estimators' : hp.choice('n_estimators',[200]), 
               'criterion' : hp.choice('criterion',['entropy']),
                'max_depth' : hp.choice('max_depth',[15]),
               'max_features' : hp.choice('max_features',["sqrt"]),
               #'min_samples_split' : hp.choice('min_samples_split',[4,6,8,10,15,20,25,50]),
               'min_samples_leaf' : hp.choice('min_samples_leaf',[20]),
            }





train_dates = metadata['Run Date'].apply(str)

clf = CalibratedPipelineHyperOptCV(base_estimator=base_estimator, 
                                   param_grid=param_grid, 
                                   scaler=scaler, 
                                   resample=resample, max_iter=50, 
                                   cv_kwargs = {'dates': train_dates, 'n_splits': 5, 'valid_size' : 20}, 
                                   hyperopt='tpe')

clf.fit(X, y)


save_name = f'Varga_{target_col}_model_{name}_original.joblib'
clf.save(save_name) # <- need to make sure this points to my directory
#target_col='wind_severe_36km
#names=['hist','logistic','random']

  0%|                                                                                                                                             | 0/100 [02:39<?, ?trial/s, best loss=?]


KeyboardInterrupt: 

In [None]:
#Supposed to use hyperopt='tke'