# Training the Baseline Model 

### Primary Goal: Train an accurate baseline model for the individual severe weather hazards. 

In this notebook, I'll provide a brief tutorial on how to train and evaluate a baseline model. It is not only helpful, but crucial to develop a simplier, baseline model against which to evaluate the skill of the machine learning model. 

In [3]:
# Import packages 
import pandas as pd
import numpy as np
import joblib

# Plotting code imports 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# We add the github package to our system path so we can import python scripts for that repo. 
import sys
sys.path.append('/home/monte.flora/python_packages/2to6_hr_severe_wx/')
from main.io import load_bl_data
#from main.evaluator import baseline_cv_scorer
from sklearn.isotonic import IsotonicRegression

In [4]:
# Configuration variables (You'll need to change based on where you store your data)
base_path = '/work/mflora/ML_2TO6HR/data'

### Neighborhood Maximum Ensemble Probability (NMEP)

For the baseline system, we use a single variable approach in which we compute the ensemble probability and extract the maximum value within a track. To compute the ensemble probability ($EP$), we threshold a variable $f$ on some threshold $t$ for each ensemble member and compute the average number of members exceeding that threshold:
\begin{equation}
	        EP = \frac{1}{N}\sum_{i=1}^{N} f_i > t
\end{equation}

The following variables and their threshold are as follows: 
   * Tornado $\rightarrow$ Updraft Helicity (`uh_2ot5_instant`)
   * Severe Hail $\rightarrow$ HAILCAST (`hailcast`)
   * Severe Wind $\rightarrow$ 80-m wind speed (`ws_80`)
   
   
For the NMEP, we apply a local maximum value filter to each ensemble member prior to computing the $EP$. By finding the maximum value within some neighborhood, we are accounting for the spatial uncertainty. 

<div class="alert alert-block alert-warning"> <b>Task:</b> Per hazard and target variable, determine the most skillful threshold and scale   </div>

To improve the probabilistic guidance provided by the baseline system, we used [isotonic regression](https://scikit-learn.org/stable/modules/isotonic.html) to calibrate the probabilities. We used the cross-validation approach from [Platt 1999](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639) to train the calibration model where the prediction and target values on each validation set are concatenated together and then isotonic regression is fit on that combined dataset. 

## Step 1. Evaluate the NMEP at different thresholds and scales. 

As a first example, I've provided some code that evaluates the NMEP using cross-validation on the training dataset. Your goal will be to determine the best threshold and scale per hazard. You'll want to create a figure similar to Fig. 1 in [Loken et al. 2020](https://journals.ametsoc.org/view/journals/wefo/35/4/wafD190258.xml), which will be a great addition to your paper. 


In [5]:
# Uncomment and run this command to learn about the input args for the load_bl_data function.
#help(load_bl_data)

In [6]:
df, y, dates = load_bl_data(mode='train', 
                            target_col = 'hail_severe__36km', 
                            base_path = base_path)

In [7]:
df

Unnamed: 0,hailcast__nmep_>0_5_45km,hailcast__nmep_>0_75_45km,hailcast__nmep_>1_0_45km,hailcast__nmep_>1_25_45km,hailcast__nmep_>1_5_45km,uh_2to5_instant__nmep_>50_45km,uh_2to5_instant__nmep_>75_45km,uh_2to5_instant__nmep_>100_45km,uh_2to5_instant__nmep_>125_45km,uh_2to5_instant__nmep_>150_45km,...,tornado_severe__18km,hail_sig_severe__18km,wind_sig_severe__18km,tornado_sig_severe__18km,hail_severe__9km,wind_severe__9km,tornado_severe__9km,hail_sig_severe__9km,wind_sig_severe__9km,tornado_sig_severe__9km
0,0.500000,0.444444,0.388889,0.388889,0.333333,0.444444,0.444444,0.444444,0.444444,0.444444,...,0,0,0,0,0,0,0,0,0,0
1,1.000000,1.000000,1.000000,0.944444,0.888889,1.000000,1.000000,1.000000,0.888889,0.722222,...,0,0,0,0,0,0,0,0,0,0
2,0.611111,0.444444,0.277778,0.111111,0.055556,0.555556,0.333333,0.333333,0.222222,0.166667,...,0,0,0,0,0,0,0,0,0,0
3,0.055556,0.000000,0.000000,0.000000,0.000000,0.055556,0.000000,0.000000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
4,1.000000,0.833333,0.611111,0.444444,0.166667,1.000000,0.666667,0.333333,0.222222,0.111111,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703126,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667,0.000000,0.000000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
2703127,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
2703128,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
2703129,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Uncomment and run to see the full list of features. 
#list(df.columns)

# Here is the breakdown for the naming convention 
# [hailcast|ws_80|uh_2to5_instant]__nmep__>[thresholds]_[9|27|45]km

Here is an example of how to use the `baseline_cv_scorer` function. You'll want to write a for loop to iterate over the different thresholds and scales. You'll want to keep the mean value for each set of `cv_scores`. Since you have two degrees of freedom (threshold and scale), the final figure will look like a heatmap. For an easy plotting example, look at the [seaborn heat map](https://seaborn.pydata.org/generated/seaborn.heatmap.html). 

In [6]:
X = df['hailcast__nmep_>1_0_45km']
cv_scores = baseline_cv_scorer(X, y, dates)

In [7]:
cv_scores

[0.08014548479021788,
 0.09350032993473578,
 0.09938423947315389,
 0.08826115901914233,
 0.07402634165939381]

<div class="alert alert-block alert-info"> <b>Tip</b> It may be useful to save the results so you can plot them later. </div>

## Step 2. Training the final, calibrated model 

Once you are confident about the best threshold and scale, you can train the final baseline model. 


In [8]:
def train_baseline(X,y,dates, save_name):
    """Train a baseline model and then save it."""
    clf = IsotonicRegression(out_of_bounds='clip', y_min=0, y_max=1)
    clf.fit(X, y)
    joblib.dump(clf, save_name, compress=4)

In [9]:
# Create a save path for the baseline model. 
save_name = 'hail_baseline_model.joblib'
train_baseline(X,y,dates, save_name)