# Sklearn
---------

### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-08-09  
**Last Updated:** 2023-08-09  
**Version:** 1  

### Description

In this notebook, we train a ML model for record linkage.
We'll use sklearn's multi layer perceptron (MLP) model.
This is a neural network model that has performed well for a previous RL task of mine.



### Notes

*\*If you are unfamiliar with the origins of this synthetic data, please see the [Synthetic-Gold](https://github.com/DOH-PJG1303/Synthetic-Gold) github project. We ran the simulation for the state of Nebraska, so all data is relevant to that state.
To manage the size of the data we'll have publicly stored on Github, we only captured relevant data for each table for the population living in years 2019-2022*


*\*\*Annotation improved with the help of chat-GPT*

## 1. Import Libraries

In [1]:
# Data analysis libs
import pandas as pd
import numpy as np
import random

# Supporting libs
import time

# MLFlow specific libs
import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature
import mlflow.sklearn

# Sci-kit learn specific libs
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

In [2]:
# Set random seed for code reproducibility
random.seed(42)

## 2. Prep Data

### 2.1 Read Data

In [3]:
df = pd.read_parquet('../../Data/Training/04. Training Data Hep C.parquet')

### 2.2 Test Train Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('label',axis=1), df['label'], test_size=0.2, random_state=42)

## 3. ML Model

### 3.1 Define Param Grid Search Space

Since we'll be using a sklearn.neural_network.MLPClassifier, please review [sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) to understand the various parameters we can tweak for each model.

In [5]:
# Define the possible sizes for the hidden layers in the neural network. Each tuple represents a 
# different network architecture. For example, (100,) means one hidden layer with 100 neurons,
# (50,2) means two hidden layers with 50 neurons each, etc.
hidden_layer_sizes = [(100,), (50,2), (100,2), (50,3),(10,5)]

# Define the possible activation functions for the neurons in the network. 'tanh' and 'relu' are
# two common choices.
activation = ['tanh', 'relu']

# Define the possible solvers for weight optimization in the network. 'sgd' stands for Stochastic 
# Gradient Descent, and 'adam' is a method that computes adaptive learning rates for each weight.
solver = ['sgd', 'adam']

# Define the possible values for alpha, the regularization parameter in the MLPClassifier. This
# parameter helps prevent overfitting by constraining the size of the weights.
alpha = [0.0001, 0.05]

# Define the possible learning rate schedules for weight updates. 'constant' means the learning 
# rate stays the same throughout training, and 'adaptive' means the learning rate decreases 
# whenever progress on the training set stalls.
learning_rate = ['constant','adaptive']

# Consolidate all these lists into one list, which forms a grid of hyperparameters to be explored.
# This will be used later to randomly select a set of hyperparameters for each run of the model.
param_grid = [ hidden_layer_sizes , activation, solver, alpha, learning_rate ]

### 3.2 MLFlow config

This might look complex with the class and all, but it is far from that.
The only reason we set up the class is so that we can use a sklearn model in our mlflow log and output a prediction as a probability, not a label.

#### 3.2.1 Create Experiment

In [7]:
# Define experiment name
experiment_name = 'Hep C Model Training'

try:
    mlflow.create_experiment(experiment_name)
except:
    pass

# Set the experiment ID
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///c:/Users/pjg1303/Documents2/Python-Record-Linkage/Scripts/Train%20Model/mlruns/345556906167296312', creation_time=1695401782783, experiment_id='345556906167296312', last_update_time=1695401782783, lifecycle_stage='active', name='Hep C Model Training', tags={}>

#### 3.2.2 Define Run

In [8]:
# Naming it Iter1 (Iteration 1), 12 linking fields (to capture number raw data fields required)
# [fname, lname, dob, add, zip, city, county, state, phone, ssn]
run_name = 'Iter1, 10 linking fields'

#### 3.2.3 Define wrapper/signature

The `SklearnModelWrapper` class is a special piece of code that we use to make our machine learning model work well with a tool called MLflow. Let's break down what it does and why we use it:

<b><u>What It Does</b></u>

- **Wraps Around Our Model**: Think of this class as a special box that holds our machine learning model. It adds some extra features that make the model compatible with MLflow.

- **Customizes Predictions**: Inside this box, we can change how the model makes predictions. In our case, we want the model to tell us the probability of a particular outcome, so we've adjusted the prediction method to do just that.

<b><u>Why we use it</b></u>

- **Works with MLflow**: MLflow is a tool that helps manage machine learning projects. By using this wrapper, our model can easily be used with MLflow, making things like tracking and sharing the model much easier.

- **Makes Things Flexible**: By putting our model in this special box, we can quickly change how it works without affecting the rest of our code. If we want to use a different model or make a different prediction, we can do so easily.

- **Keeps It Simple**: Even though it might look a bit complex, this wrapper actually makes our code simpler. It takes care of some technical details, so we don't have to worry about them in other parts of our code.


The `SklearnModelWrapper` class is like a special adapter for our machine learning model. It allows us to use the model with MLflow and gives us the freedom to customize how the model makes predictions. It's a handy tool that makes our code more flexible and easier to manage.

In [9]:
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
    """A wrapper class for sklearn models that implements the mlflow.pyfunc.PythonModel interface."""

    def __init__(self, model):
        self.model = model

    def predict(self, context, model_input):
        return self.model.predict_proba(model_input)[:, 1]

#### 3.2.4 Define Metrics Log

We created the log_metrics() function to streamline the process of tracking important information about our model's performance. By packaging the logging code into a separate function, we enhance code readability and maintainability, making it easier to modify or extend in the future. This organized approach ensures that our code follows good practices, promoting clarity and efficiency, especially when working with complex machine learning workflows.

In [10]:
def log_metrics(y_test, pred_proba_test, predict_binary_test):
    """Log various performance metrics."""

    test_auc_score = roc_auc_score(y_test, pred_proba_test)
    mlflow.log_metric('auc', test_auc_score)
    test_f1_score = f1_score(y_test, predict_binary_test)
    mlflow.log_metric('f1', test_f1_score)
    test_accuracy = accuracy_score(y_test, predict_binary_test)
    mlflow.log_metric('accuracy', test_accuracy)
    test_precision = precision_score(y_test, predict_binary_test)
    mlflow.log_metric('precision', test_precision)
    test_recall = recall_score(y_test, predict_binary_test)
    mlflow.log_metric('recall', test_recall)

### 3.3 Conduct Experiment

In [11]:
def perform_run(run_name, param_grid, n_runs, X_train, y_train, X_test, y_test):
    """
    Perform a series of runs in MLflow with the given parameters, looping through the hyperparameter grid.

    Args:
        run_name (str): The name of the run, to be displayed in MLflow.
        param_grid (list of lists): The hyperparameter grid containing lists of possible values for each hyperparameter.
        n_runs (int): The number of runs to be performed.
        X_train (array-like): The training feature data.
        y_train (array-like): The training target data.
        X_test (array-like): The testing feature data.
        y_test (array-like): The testing target data.

    Returns:
        None

    This function loops through the hyperparameter grid, selecting random parameters for each run, trains the model, and logs metrics to MLflow. It also wraps the model using the SklearnModelWrapper class, then logs the model to MLflow.
    """

    # With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
    mlflow.sklearn.autolog()

    # Looping through the hyperparameter grid
    for i in np.arange(1, n_runs+1):

        # Choose current run parameters randomly
        cur_run_params = [random.choice(param) for param in param_grid]

        # Start a new run in MLflow, giving it a specific name.
        with mlflow.start_run(run_name=run_name):

            # Training, prediction, and logging metrics
            model = MLPClassifier(hidden_layer_sizes=cur_run_params[0],
                                  activation=cur_run_params[1],
                                  solver=cur_run_params[2],
                                  alpha=cur_run_params[3],
                                  learning_rate=cur_run_params[4])
            
            # Log the parameter 
            mlflow.log_param('Model Type', 'Sklearn-MLPClassifier')

            # Fit model to training data, track time it took
            fit_start = time.time()
            model.fit(X_train, y_train)
            fit_end = time.time()
            mlflow.log_metric('fit_time', fit_end - fit_start)

            # Predict on testing data, track time it took
            predict_start = time.time()
            pred_proba_test = model.predict_proba(X_test)[:, 1]
            predict_end = time.time()
            mlflow.log_metric('predict_time', predict_end - predict_start)

            # Log various metrics
            log_metrics(y_test, pred_proba_test, model.predict(X_test))

            # Wrap and log the model
            wrapped_model = SklearnModelWrapper(model)
            signature = infer_signature(X_train, wrapped_model.predict(None, X_train))
            mlflow.pyfunc.log_model(run_name, python_model=wrapped_model, signature=signature)

In [12]:
# Define how many runs we want.  Do not exceed number of combinations of param_grid for repetition's sake
n_runs = 20

# Perform the runs!
perform_run(run_name, param_grid, n_runs, X_train, y_train, X_test, y_test)

  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)
  inputs = _infer_schema(model_input)


### 4. Inspect Output

Open a CMD terminal.  

First, you'll need to activate your virtual environment for access to the `mlflow` command.  Do this by typing:

```cmd
.venv\Scripts\activate
```

You should see that you are now in the virtual environment because `(.venv)` is at the start of your terminal commands.  
Now navigate to the directory where this code lives. If you're starting in the root repo folder, this will look like:

```cmd
cd "Scripts\Train Model"
```

Then navigate into the MLFlow UI by typing:

```cmd
mlflow ui
```

There you can interact with models, compare different models, and select one to register.