# Algorithm-Agnostic Model Deployment with Mlflow

One common challenge in MLOps is the need to migrate between various estimators or algorithms to achieve the optimal solution for a business problem.

Consider a scenario where we had a scikit-learn (sklearn) model deployed in production for a specific use case. Later, we discovered that a deep learning model performed even better. In such a scenario, if the sklearn model was deployed in its native flavour, the switch to deep learning model could be a hassle because the two model artifacts are very different.

## MLflow pyfunc flavor

The `mlflow.pyfunc` model flavor offers a generic way of model building. It can serve as a unified, default model interface for all MLflow Python models, regardless of which persistence library, module or framework was used to produce the model. With pyfunc, we can deploy a python function without worrying about the underlying format of the model. Thanks to its unified model representations, pyfunc massively reduces the complexity of model deployment, redeployment and downstream scoring. 

What's more, this means that, not only the model, but also the full pipeline, encompassing elements such as pre- and post-processing steps or any arbitrary code we would like to execute during model loading, can all be encapsulated within the pyfunc object that works seamlessly with the rest of the mlflow ecosystem. 

Last but not least, pyfunc enable us to package the trained model pipeline in a platform-agnostic manner, which provides optimal flexibility in deployment options and facilitates model reuse across diverse platforms. 

Below is a demo of `mlflow.pyfunc`, where it is used to define a simple model pipeline class that encompass a random forest model with a particular preprocessing step. To get a feel of how this pyfunc object is then integrated into the rest of the mlflow ecosystem, let's also go through the steps of training and serving the model including passing on custom configuration, capture model signature and python environment, and finally load and apply the model to make predictions. 

In [10]:
import mlflow

class RFWithPreprocess(mlflow.pyfunc.PythonModel):

    def __init__(self, params):
        self.params = params
        self.rf_model = None
        self.config = None

    def load_context(self, context=None, config_path=None):
        """
        When loading a pyfunc, this method runs automatically with the related
        context.  This method is designed to perform the same functionality when
        run in a notebook or a downstream operation (like a REST endpoint).

        If the context object is provided, it will load the path to a config from
        that object (this happens with mlflow.pyfunc.load_model() is called).
        If the config_path argument is provided instead, it uses this argument
        in order to load in the config.
        """
        if context: # This block executes for server run
            config_path = context.artifacts['config_path']
        else: # This block executes for notebook run
            pass

        self.config = json.load(open(config_path))
    
    def preprocess_input(self, model_input):
        """
        return preprocessed model input. 
        """

        processed_input = model_input.copy()
        # put any desired logic here
        processed_input.drop(processed_input.columns[0], axis=1, inplace=True)

        return processed_input
    
    def fit(self, X_train, y_train):

        from sklearn.ensemble import RandomForestRegressor
        processed_model_input = self.preprocess_input(X_train.copy())
        rf_model = RandomForestRegressor(**self.params)
        rf_model.fit(processed_model_input, y_train)

        self.rf_model = rf_model

    def predict(self, context, model_input):
        processed_model_input = self.preprocess_input(model_input.copy())
        return self.rf_model.predict(processed_model_input)

# Train and Log the Model

In [11]:
# Data for demo
import pandas as pd
import sklearn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
X = pd.DataFrame(X)

# Create a DataFrame for visualization (optional)
import pandas as pd
df = pd.DataFrame(data=X, columns=diabetes.feature_names)
df['target'] = y

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Utilizing context

The `context` parameter is provided automatically by mlflow in downstream tools. This can be used to add custom dependent objecs such as models that are not easily serialized (e.g., `keras` models) or custom configurations files. 

Steps to provide a config file:
* save out any file we want to load into the class
* Create an artifact dictionary of key/value pairs where the value is the path to that object
* When saving the model, all artifacts will be copied over into the same directory for downstream use


In the example below, let's pass on some model hyperparameters into the config. 

In [12]:
params = {
    'n_estimators': 15,
    'max_depth': 5
}

In [13]:
with mlflow.start_run(run_name = 'test') as run:
    
    model = RFWithPreprocess(params)
    model.fit(X_train, y_train)

    mlflow.pyfunc.log_model(
        'test_pyfunc',
        python_model = model,
    )



In [6]:
import json
import os

config_path = "mlruns/data.json"
with open(config_path, "w") as f:
    json.dump(params, f)

artifacts = {'config_path': config_path}
print(artifacts)

{'config_path': 'mlruns/data.json'}


In [7]:
# This happens automatically in serving integrations
model.load_context(config_path = config_path)
print("model config:", model.config)
predictions = model.predict(context = None, model_input = X_test)

model config: {'n_estimators': 15, 'max_depth': 5}


## Generate model signature

In [8]:
from mlflow.models.signature import infer_signature

signature = infer_signature(X_test, predictions)
signature

inputs: 
  [0: double (required), 1: double (required), 2: double (required), 3: double (required), 4: double (required), 5: double (required), 6: double (required), 7: double (required), 8: double (required), 9: double (required)]
outputs: 
  [Tensor('float64', (-1,))]
params: 
  None

## Capture conda environment

This is necessary because when we use `mlflow.sklearn`, we automatically log the apprropriate version of `sklearn`. With a `pyfunc`, we must manually construct our deployment environment. 

In [9]:
from sys import version_info
import sklearn

conda_env = {
    "channels": ["defaults"],
    "dependencies": [
        f"python={version_info.major}.{version_info.minor}.{version_info.micro}",
        "pip",
        {"pip": ["mlflow",
                 f"sciket-learn=={sklearn.__version__}"]
        },
    ],
    "name": "sklearn_env"
}

conda_env

{'channels': ['defaults'],
 'dependencies': ['python=3.8.15',
  'pip',
  {'pip': ['mlflow', 'sciket-learn==1.2.0']}],
 'name': 'sklearn_env'}

## Log the model

We can log the model with rich info such as artifacts, conda_env, signature and input_example. 

In [10]:
# double-check env in respond to warning in the next cell, see backlog
import sklearn
print(sklearn.__version__)

1.2.0


In [11]:
with mlflow.start_run(run_name = 'test') as run:
    mlflow.pyfunc.log_model(
        'rf_preprocessed_model',
        python_model = model,
        artifacts = artifacts,
        conda_env=conda_env,
        signature=signature,
        input_example=X_test[:3]
    )

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

 - sciket-learn (current: uninstalled, required: sciket-learn==1.2.0)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.


# Load and Utilize the Model

In [12]:
mlflow_pyfunc_model_path = f"runs:/{run.info.run_id}/rf_preprocessed_model"
loaded_preprocess_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)

 - sciket-learn (current: uninstalled, required: sciket-learn==1.2.0)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.


## Apply the model

to make predictions

In [13]:
loaded_preprocess_model.predict(X_test)

array([158.48896884, 184.32338111, 157.3486947 , 250.67946537,
       116.10953609, 129.25752347, 248.23706751, 218.31714944,
       145.43777053, 175.50507834, 109.77701145, 172.44423645,
        90.35806064, 231.02801247, 106.28503907, 154.19536245,
       229.13414973, 247.54295669, 181.75428992, 211.87610517,
       187.58071778, 111.24837101,  82.73119571, 192.2444602 ,
       148.18506314, 179.78699311, 172.81527653, 110.4966332 ,
        82.73119571, 114.28501395, 168.95470798,  99.35855525,
       162.76867799, 202.87625786, 152.05677875, 220.14202136,
       120.39505122, 122.23399749, 170.92243273,  86.01504187,
        82.73119571,  81.51166825, 152.6721596 , 145.65301484,
       154.08324524,  92.94443125,  82.70940559, 112.48905089,
        78.58980894, 177.53117506, 132.73170066,  86.0199772 ,
       182.63527763, 100.08397662, 175.22945738, 162.00083005,
       105.52205377, 205.63997098, 110.33266442, 111.82741378,
       176.60377189, 178.64678837, 155.34659724,  96.90

## Access Model's Metadata

One really cool thing about `pyfunc` object is that it is automatically loaded with its metadata. 

In [15]:
run_id = loaded_preprocess_model.metadata.run_id
path = mlflow.artifacts.download_artifacts(run_id = run_id)
params = json.load(open(f"{path}/rf_preprocessed_model/artifacts/data.json"))
print("params:", params)
input_example = json.load(open(f"{path}/rf_preprocessed_model/input_example.json"))
print("input example:", input_example)

params: {'n_estimators': 15, 'max_depth': 5}
input example: {'data': [[0.04534098333546186, -0.044641636506989144, -0.006205954135807083, -0.015998975220305175, 0.12501870313429186, 0.1251981011367534, 0.019186997017453092, 0.03430885887772673, 0.03243232415655107, -0.005219804415300423], [0.09256398319871433, -0.044641636506989144, 0.0369065288194249, 0.0218723855140367, -0.0249601584096303, -0.016658152053905938, 0.0007788079970183853, -0.03949338287409329, -0.022516528376302174, -0.021788232074638245], [0.06350367559055897, 0.05068011873981862, -0.004050329988045492, -0.012556124244455912, 0.10300345740307394, 0.04878987646010685, 0.05600337505832251, -0.002592261998183278, 0.08449153066204618, -0.01764612515980379]]}


# Backlog

When logging the model with its conda environment, there is a warning saying sklearn is not installed in the current envionrment, which is not true. 

What's reassuring is that the conda_env did capture the correct module versions for the model, but the warnings are nonetheless annoying. Could this due to the notebook running in a virtual environment (the same happens when I am using either an anoconda or pipenv virtual env)? Note that this is not the case when similar commands are running in Databrick clusters. Will keep exploring and update any learning back here. 