# Advanced Model Training
---------


### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-05-23  
**Last Updated:** 2023-05-23  
**Version:** 1  

### Description
This notebook should serve to show a more complex model training with features such as:
- mlflow monitoring
- neural network model

### Notes

## 1. Import libraries

In [None]:
# Data analysis libs
import pandas as pd
import numpy as np
import random

# Supporting libs
import cloudpickle
import matplotlib.pyplot as plt
import time

# MLFlow specific libs
import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import mlflow.sklearn

# Sci-kit learn specific libs
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

## 2. Prep Data

### 2.1 Read Data

In [None]:
df = pd.read_csv('./Data/advanced_synthetic_training_data.csv',index_col = [0,1])

### 2.2 Test Train Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('label',axis=1), df['label'], test_size=0.2, random_state=42)

## 3. ML Model

### 3.1 Define Param Grid Search Space

In [None]:
# Define the possible sizes for the hidden layers in the neural network. Each tuple represents a 
# different network architecture. For example, (100,) means one hidden layer with 100 neurons,
# (50,2) means two hidden layers with 50 neurons each, etc.
hidden_layer_sizes = [(100,), (50,2), (100,2), (50,3),(10,5)]

# Define the possible activation functions for the neurons in the network. 'tanh' and 'relu' are
# two common choices.
activation = ['tanh', 'relu']

# Define the possible solvers for weight optimization in the network. 'sgd' stands for Stochastic 
# Gradient Descent, and 'adam' is a method that computes adaptive learning rates for each weight.
solver = ['sgd', 'adam']

# Define the possible values for alpha, the regularization parameter in the MLPClassifier. This
# parameter helps prevent overfitting by constraining the size of the weights.
alpha = [0.0001, 0.05]

# Define the possible learning rate schedules for weight updates. 'constant' means the learning 
# rate stays the same throughout training, and 'adaptive' means the learning rate decreases 
# whenever progress on the training set stalls.
learning_rate = ['constant','adaptive']

# Consolidate all these lists into one list, which forms a grid of hyperparameters to be explored.
# This will be used later to randomly select a set of hyperparameters for each run of the model.
param_grid = [ hidden_layer_sizes , activation, solver, alpha, learning_rate ]

### 3.2 Begin MLFlow runs

This might look complex with the class and all, but it is far from that.
The only reason we set up the class is so that we can use a sklearn model in our mlflow log and output a prediction as a probability, not a label.

In [None]:
# Define a wrapper class for sklearn models that implements the mlflow.pyfunc.PythonModel interface.
# This allows the model to be used with MLflow's pyfunc model flavor, which allows it to be loaded
# as a Python function for inference.
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
  def __init__(self, model):
    # Store the model object (fitted sklearn model).
    self.model = model

  # Define the predict method that the model will use for inference.  
  def predict(self, context, model_input):
    # For classification problems, sklearn's predict_proba method returns a 2D array with one row
    # per input and one column per class. The column at index 1 represents the probability of the
    # positive class, so we select it with [:,1].
    return self.model.predict_proba(model_input)[:,1]

# Enable automatic logging of hyperparameters, metrics, and the trained model.
mlflow.sklearn.autolog()

# Loop over a range of values from 1 to 50.
for i in np.arange(1,50):
  
  cur_run_params = []

  # For each parameter in the parameter grid, select a random value and add it to the current
  # run's parameters.
  for j in np.arange(0,len(param_grid)):
    cur_run_params.append(random.choice(param_grid[j]))

  # Start a new run in MLflow, giving it a specific name.
  with mlflow.start_run(run_name='RecordLinkage_NeuralNetwork'):

    # Initialize the MLPClassifier model with the current run's parameters.
    model = MLPClassifier(hidden_layer_sizes =  cur_run_params[0],
                          activation =  cur_run_params[1],
                          solver =  cur_run_params[2],
                          alpha =  cur_run_params[3],
                          learning_rate =  cur_run_params[4])

    # Train the model and log the time it took to train.
    fit_start = time.time()
    model.fit(X_train,y_train)
    fit_end = time.time()
    mlflow.log_metric('fit_time', fit_end - fit_start) 

    # Predict on the test set and log the time it took to predict.
    predict_start = time.time()
    pred_proba_test = model.predict_proba(X_test)[:,1]
    predict_end = time.time()
    mlflow.log_metric('predict_time', predict_end - predict_start)

    # Log various performance metrics.
    test_auc_score = roc_auc_score(y_test, pred_proba_test)
    mlflow.log_metric('auc', test_auc_score)
    predict_binary_test = model.predict(X_test)
    test_f1_score = f1_score(y_test, predict_binary_test)
    mlflow.log_metric('f1', test_f1_score)
    test_accuracy = accuracy_score(y_test, predict_binary_test)
    mlflow.log_metric('accuracy', test_accuracy)
    test_precision = precision_score(y_test, predict_binary_test)
    mlflow.log_metric('precision', test_precision)
    test_recall = recall_score(y_test, predict_binary_test)
    mlflow.log_metric('recall', test_recall)

    # Wrap the trained model using the SklearnModelWrapper class defined earlier.
    wrappedModel = SklearnModelWrapper(model)

    # Infer the signature of the model. The signature defines the input and output schema of the
    # model. When the model is deployed, this signature will be used to validate inputs.
    signature = infer_signature(X_train, wrappedModel.predict(None, X_train))

    # Log the model. This includes the model object, the wrapper class, and the inferred signature
    mlflow.pyfunc.log_model("RecordLinkage_NeuralNetwork", python_model=wrappedModel, signature=signature)  


## 4. Manually inspect output

Open up a command prompt.  I'm using visual studio code, so this it is as easy as clicking the `Terminal` dropdown in the top menu, selecting new terminal, and then opening a command prompt there.

From that point, ensure that you are working in the proper virtual environment.
Should look something like this command:

```cmd
virtual_environment_name\Scripts\activate
```

From that point, navigate to the MachineLearning directory:

```cmd
cd MachineLearning
```

Then, you can open the MLFlow user interface by typing:

```cmd
mlflow ui
```

It should buffer for a second, and then show you an instance of a locally hosted server (website thing) that you can `CTRL+click` on to compare MLFlow runs.
This tool is very useful for choosing a model.
I like to pick a model with high precision (low number of False Positives), high accuracy (usually right), high AUC (low error margins in predicted probability), and low predict time (fast model).  In general the high precision is likely the most important for record linkage, because false positives can lead to future transitive links, which causes a metaphorical cascade of issues.  False positives are more costly than false negatives in this context.

-----------

Once you choose a model you like, you can register it and transition to staging or production.  This way, you can track models and versions.
As you diagnose issues with your model, you can modify your training data to improve upon it and train better models.
Go MLOps!!!