# Operationalization of machine learning models

In this notebook, we cover some of the important themes around model operationalization. This is an extensive topic, and we do not try to be comprehensive here. Instead we learn about some essentials and look at an example of a library that makes this kind of work very easy for us: the `mlflow` library. To introduce you to the library, we go over their [own example](https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html) for running an experiment. But first some vocabulary:

- A **script** is some Python code we want to run, stored as a `.py` or `.ipynb` formats. Usually, the script has a set of required or optional inputs we provide (just like a Python function). In `mlflow`, we refer to these inputs as **parameters**, but do NOT confuse this term with model parameters in ML.
- A **run** is what we call when we fix the inputs of a script to some value and executing the script. In the context of ML, the script could be a training script, its "parameters" could be hyper-parameters to the model we wish to train, and a run is when we train a model with the hyper-parameters set to some fix values.
- As part of a run we can log the **parameters** we used, the **metrics** we calculated such as training and test accuracy, and **artifacts** such as plots, tables, or trained models we save externally for reuse later. We can refer to these as run meta-data. In addition to the meta-data we log explicitly in the code, `mlflow` also logs some of its own meta-data such as run ID or run time.
- A **experiment** is a collection of related runs. So to continue with the above example, if we execute the script several times each time using another set of values for the hyper-parameters, then the experiment is the collection of all such runs. After executing all the runs, we can go to our experiment to compare them in terms of accuracy, run time, or whatever **metric** of interest.

Note that the example we provide above is a "typical" example, and this is what we show in this notebook. But in general we can be flexible in what exactly we define as an experiment. The general idea is that from run to run, we change things and later we want to see what worked and what didn't by looking at metrics or artifacts generated by the model. A machine learning project can consist of one or several experiments. It all depends on the complexity of the proect, and how granular we think of individual runs. This is to some extent a matter of preference and can even be driven by business needs. 

Finally of course we can do a lot of this manually. After all we know how to run scripts with different inputs, or how to save plots or models on disk. Using a **version control** tool like Git, we can also track changes to the code. So why do we need `mlflow`? The answer is simple: It takes away most of the hassle that comes with doing such things manually, and on top of that it provides us with a UI where we go to find all our runs and quickly compare them. There are other concepts in `mlflow` that we do not cover here, but we invite you to check out [their website](https://mlflow.org/).

To begin with, we create a folder to save not only the code, but also the meta-data generated by our runs. Once we begin to log runs, the project folder will be populated by such meta-data. You are advised against deleting the meta-data directly (the better way is to use the UI).

In [1]:
!pip install mlflow



In [2]:
import mlflow
import pandas as pd
import os

experiment_name = "predict_wine_quality"
project_folder = 'wine'

os.makedirs(project_folder, exist_ok = True)
os.makedirs(project_folder + '/code', exist_ok = True)
os.makedirs(project_folder + '/config', exist_ok = True)

try:
    experiment_id = mlflow.create_experiment(experiment_name)
except:
    experiment = mlflow.get_experiment_by_name(experiment_name)
    experiment_id = experiment.experiment_id
    
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///C:/Users/mam_0/Desktop/UW/Machine%20Learning/MLEARN%20520/Labs/mlruns/783477743455773257', creation_time=1687125924373, experiment_id='783477743455773257', last_update_time=1687125924373, lifecycle_stage='active', name='predict_wine_quality', tags={}>

### Exercise

Below is the script we wish to execute. A lot of the code should look familiar. Examine this script and try to point out the pieces that are new. What is the purpose of `sys.argv`? Notice how and where the `mlflow` library is used in the code. Finally, execute the script to make sure it works. There are several ways to execute a script:

- from the **command line** navigate to its folder and run `python train.py`
- from this **notebook** create a new cell and paste this `!python $project_folder/code/train.py`
- from this **notebook** create a new cell and paste this `%run $project_folder/code/train.py`

In order to execute the script make sure you first run the cell below. Note that if you changed the name of the experiment in cell above, you will need to also change it in the script in the cell below.

### End of exercise

In [3]:
%%writefile $project_folder/code/train.py
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

logging.basicConfig(level = logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep = ";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e
        )

    # split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # the predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis = 1)
    test_x = test.drop(["quality"], axis = 1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

    mlflow.set_experiment("predict_wine_quality")
    # mlflow.autolog()
    with mlflow.start_run():
        
        run = mlflow.active_run()
        experiment = mlflow.get_experiment(run.info.experiment_id)
        print("Experiment ID: \"{}\"".format(run.info.experiment_id))
        print("Experiment name: \"{}\"".format(experiment.name))
        print("Run ID: \"{}\"".format(run.info.run_id))

        lr = ElasticNet(alpha = alpha, l1_ratio = l1_ratio, random_state = 42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Using alpha = {:0.2f}, l1_ratio = {:0.2f} we get the following metrics:".format(alpha, l1_ratio))
        print("  metric RMSE: {:6.2f}".format(rmse))
        print("  metric MAE: {:6.2f}".format(mae))
        print("  metric R-squared: {:0.2f}".format(r2))

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

        # model registry does not work with file store
        if tracking_url_type_store != "file":

            # register the model
            mlflow.sklearn.log_model(lr, "model", registered_model_name = "ElasticnetWineModel")
        else:
            mlflow.sklearn.log_model(lr, "model")

Writing wine/code/train.py


Since we defined the above script with two inputs (what `mlflow` calls "parameters"), we can now change them to new values and execute the script again.

In [4]:
!python $project_folder/code/train.py 0.25 0.50

Experiment ID: "783477743455773257"
Experiment name: "predict_wine_quality"
Run ID: "69ee9d1e8b724cceb896384d4620d619"
Using alpha = 0.25, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.75
  metric MAE:   0.58
  metric R-squared: 0.21


The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



Let's now define an `mlflow` experiment and formalize what we did above. We create a file below that defines an `mlflow` project with its parameters and the command to be executed. Note that file paths are sepecified relative to the project directory.

In [5]:
%%writefile $project_folder/MLproject
name: Wine Quality Prediction

conda_env: config/conda.yaml

entry_points:
  main:
    parameters:
      alpha: float
      l1_ratio: {type: float, default: 0.1}
    command: "python code/train.py {alpha} {l1_ratio}"

Writing wine/MLproject


The above file also points to a conda environment file which we create below. This file defines the Python runtime used by the experiment. So for example, as part of the experiment, we can update one of the packages listed below and execute a new run to see if the update breaks our script.

In [6]:
%%writefile $project_folder/config/conda.yaml
channels:
  - defaults
dependencies:
  - numpy=1.14.3
  - pandas=0.22.0
  - pip:
    - mlflow
    - scikit-learn==0.24.1

Writing wine/config/conda.yaml


To execute our experiment, we use the `mlflow` command. This is very similar to the way we executed the script earlier, but instead of pointing to the file we just provide the experiment name.

In [7]:
!mlflow run $project_folder --experiment-name $experiment_name -P alpha=0.42

Experiment ID: "783477743455773257"
Experiment name: "predict_wine_quality"
Run ID: "4320de8fba754a47a81c1e52dfca8376"
Using alpha = 0.42, l1_ratio = 0.10 we get the following metrics:
  metric RMSE:   0.74
  metric MAE:   0.57
  metric R-squared: 0.22


The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All

We can also run the above command from the **command line**, as we will see in the following exercise. Finally, here's some useful information about our experiment.

In [11]:
logged_model = './mlruns/783477743455773257/4320de8fba754a47a81c1e52dfca8376/artifacts/model'
loaded_model = mlflow.pyfunc.load_model(logged_model) # load model as a PyFuncModel.
df_wine_sample = pd.read_csv('../data/wine.csv').drop(columns = ['quality', 'Class']).head() # load some data
loaded_model.predict(df_wine_sample) # predict on a pandas.DataFrame

 - cloudpickle (current: 2.0.0, required: cloudpickle==2.2.1)
 - scikit-learn (current: 1.2.1, required: scikit-learn==0.24.1)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.


array([5.367283  , 5.4342527 , 5.41406653, 5.57125136, 5.367283  ])

Based on the business need, we can also go one step further and serve the model over HTTP as a **scoring service**. This makes the model behave like an application. To do so, run the next cell, and copy its **output** and run it from the command line. Note that you can only run `mlflow` commands from the **Anaconda prompt** after activating the environment that `mlflow` is installed in.

In [12]:
!echo mlflow models serve -m $logged_model -p 1234

mlflow models serve -m ./mlruns/783477743455773257/4320de8fba754a47a81c1e52dfca8376/artifacts/model -p 1234


Examine the output as you run the above command. We should see the conda environment being created before the model is served. Once the model is ready, the HTTP URL is shown as well.

The data we send to the model must be in json format, which is one of the most command format that applications use to send data to each other. In this context, the data is sometimes referred to as the **payload**. Here is an example of what the data should look like in our case:

In [14]:
%%writefile $project_folder/data/input_sample.json
{"columns":["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"], 
 "index":[0, 1, 2, 3, 4], 
 "data":[
     [7.4,  0.7,  0.0,  1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4], 
     [7.8,  0.88, 0.0,  2.6, 0.098, 25.0, 67.0, 0.9968, 3.2,  0.68, 9.8], 
     [7.8,  0.76, 0.04, 2.3, 0.092, 15.0, 54.0, 0.997,  3.26, 0.65, 9.8], 
     [11.2, 0.28, 0.56, 1.9, 0.075, 17.0, 60.0, 0.998,  3.16, 0.58, 9.8], 
     [7.4,  0.7,  0.0,  1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4]]
}

Writing wine/data/input_sample.json


To send a request to the model, we can use the `curl` command, or any Rest API application like [Postman](https://www.postman.com/). Here is what the `curl` command looks like, which you can run on Linux or on Windows using [WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10).

In [15]:
!echo curl -X POST -H "Content-Type:application/json; format=pandas-split" --data @$project_folder/data/input_sample.json http://127.0.0.1:1234/invocations

curl -X POST -H "Content-Type:application/json; format=pandas-split" --data @wine/data/input_sample.json http://127.0.0.1:1234/invocations


Here's the output you should see by running the above command.

    [5.422102809496764, 5.448114600770513, 5.444533999028288, 5.513957675441143, 5.422102809496764]
    
If we get errors two possible reasons are:
- We need to first run `conda activate <environment-name>` to activate the Conda environment in which `mlflow` is installed.
- We need to navigate to the folder where the notebook is running. This is because we set up the code so that paths are specified relative to this folder. You can run `print(os.getcwd())` to see the path, and then `cd` into it.

Let's finish by pointing out two important aspects about `mlflow` here:
- Everything we did here is "local", meaning that all meta-data is being saved to a local file path, but in most production system we use the cloud both for storage and for serving such models in production. For example, look [here](https://mlflow.org/docs/latest/models.html#deploy-a-python-function-model-on-microsoft-azure-ml) for an example of deployment in Azure. There are similar "plug-ins" for other cloud providers.
- As we saw, there are three ways to interact with `mlflow`: through the Python library, through the command line, and through the UI. Which we use depends to some extent on what we want to do. For example, to log metrics, it makes sense to use the Python library and embed `mlflow` in the code. To run experiments and serve models we used the command line and to see and compare runs we used the UI, but in most cases we can also use the Python library, so it's a matter of preference to some extent. As an example, take a look at the next cell, which returns a `DataFrame` with meta-data for runs under our experiment.

In [17]:
mlflow.search_runs(experiment_id).head()

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.rmse,metrics.r2,metrics.mae,params.l1_ratio,params.alpha,tags.mlflow.project.entryPoint,tags.mlflow.runName,tags.mlflow.log-model.history,tags.mlflow.source.type,tags.mlflow.user,tags.mlflow.project.env,tags.mlflow.source.name,tags.mlflow.project.backend
0,4320de8fba754a47a81c1e52dfca8376,783477743455773257,FINISHED,file:///C:/Users/mam_0/Desktop/UW/Machine%20Le...,2023-06-18 22:05:46.540000+00:00,2023-06-18 22:06:04.135000+00:00,0.742062,0.219785,0.572285,0.1,0.42,main,fun-dolphin-822,"[{""run_id"": ""4320de8fba754a47a81c1e52dfca8376""...",PROJECT,mam_0,conda,C:\Users\mam_0\Desktop\UW\Machine Learning\MLE...,local
1,69ee9d1e8b724cceb896384d4620d619,783477743455773257,FINISHED,file:///C:/Users/mam_0/Desktop/UW/Machine%20Le...,2023-06-18 22:05:32.044000+00:00,2023-06-18 22:05:40.888000+00:00,0.748931,0.205275,0.580695,0.5,0.25,,enthused-loon-668,"[{""run_id"": ""69ee9d1e8b724cceb896384d4620d619""...",LOCAL,mam_0,,wine/code/train.py,


In [18]:
!mlflow ui

^C
