# Algorithm-Agnostic Model Building with Mlflow

One common challenge in MLOps is the hassle of migrating between various algorithms or frameworks. This beginner-friendly article helps you tackle the challenge by leveraging algorithm-agnostic model building using mlflow.pyfunc.

Consider this scenario: we have an sklearn model currently deployed in production for a particular use case. Later on, we find that a deep learning model performs even better. If the sklearn model was deployed in its native format, transitioning to the deep learning model could be a hassle, 🤪because the two model artifacts are very different.

## MLflow pyfunc flavor

To address such a challenge, the mlflow.pyfunc model flavor provides a versatile and generic approach to building and deploying machine learning models in Python. 😎 

1. **Generic Model Building:**
The pyfunc model flavor offers a generic way to build models, regardless of the framework or library used to create the model.
2. **Unified Model Representation:**
We can deploy a model, or any python function built with pyfunc without worrying about the model's underlying format. Such a unified representation simplifies model deployment, redeployment, and downstream scoring.
3. **Encapsulation of the ML Pipeline:**
pyfunc allows us to encapsulate the model with its pre- and post-processing steps or any custom logic desirable during model consumption.

## Demo
Below is a `mlflow.pyfunc` demo. Ple refer to the [medium article published at Towards Data Science](https://medium.com/towards-data-science/algorithm-agnostic-model-building-with-mlflow-b106a5a29535) for more detailed explanations. 


In [14]:
import mlflow
import pandas as pd

### `pyfunc` Simplest Toy Model

In [15]:
class ToyModel(mlflow.pyfunc.PythonModel):
    """
    ToyModel is a simple example implementation of an MLflow Python model.
    """
    
    def predict(self, context, model_input):
        """
        A basic predict function that takes a model_input list and returns a new list where each element is incremented by one.

        Parameters:
        - context (Any): An optional context parameter provided by MLflow.
        - model_input (list of int or float): A list of numerical values that the model will use for prediction.

        Returns:
        - list of int or float: A list with each element in model_input incremented by one.
        """
        return [x + 1 for x in model_input]


In [16]:
# log this toy model as an mlflow run
with mlflow.start_run():
    mlflow.pyfunc.log_model(
        artifact_path = "model", 
        python_model=ToyModel()
    )
    run_id = mlflow.active_run().info.run_id

In [7]:
# load the model and perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = [1,2,3]
# model inference for the new data
print(model.predict(x_new))


[2, 3, 4]


### `pyfunc` Encapsulated ML Pipeline

In [9]:
import xgboost as xgb
import pandas as pd


class XGB_PIPELINE(mlflow.pyfunc.PythonModel):
    """
    XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
    """
    
    def __init__(self, params):
        """
        Initialize the model with given parameters.

        Parameters:
        - params (Dict[str, Union[str, int, float]]): Parameters for the XGBoost model.
        """
        self.params = params
        self.model = None

    def preprocess_input(self, model_input):
        """
        Preprocess the input data.

        Parameters:
        - model_input (pd.DataFrame): The input data to preprocess.

        Returns:
        - pd.DataFrame: The preprocessed input data.
        """
        processed_input = model_input.copy()
        # Put any desired preprocessing logic here
        processed_input.drop(processed_input.columns[0], axis=1, inplace=True)

        return processed_input

    def fit(self, X_train, y_train):
        """
        Train the XGBoost model.

        Parameters:
        - X_train (pd.DataFrame): The training input data.
        - y_train (pd.Series): The target values.
        """
        processed_model_input = self.preprocess_input(X_train.copy())
        dtrain = xgb.DMatrix(processed_model_input, label=y_train)
        self.xgb_model = xgb.train(self.params, dtrain)

    def predict(self, context, model_input):
        """
        Predict using the trained XGBoost model.

        Parameters:
        - context (Any): The context object provided by MLflow.
        - model_input (pd.DataFrame): The input data for making predictions.

        Returns:
        - Any: The prediction results.
        """
        processed_model_input = self.preprocess_input(model_input.copy())
        dmatrix = xgb.DMatrix(processed_model_input)
        return self.xgb_model.predict(dmatrix) 

In [10]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate synthetic datasets for demo
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# train and log the model
with mlflow.start_run(run_name = 'xgb_demo') as run:

    # Create an instance of XGB_PIPELINE
    params = {
        'objective': 'reg:squarederror',  
        'max_depth': 3,  
        'learning_rate': 0.1,
    }
    model = XGB_PIPELINE(params)

    # Fit the model
    model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)

    # Log the model
    model_info = mlflow.pyfunc.log_model(
        artifact_path = 'model',
        python_model = model,
    )

    run_id = mlflow.active_run().info.run_id

### Deep Dive into the Mlflow.pyfunc Object

In [36]:
print(model_info.utc_time_created)
print(model_info.run_id)
print(model_info.model_uri)
print(model_info.mlflow_version)


2024-07-20 12:42:35.994182
38a617d0f30645e8ae95eea4642a03c2
runs:/38a617d0f30645e8ae95eea4642a03c2/model
2.14.3


In [37]:

print(model_info.model_uri)
print(run_id)

runs:/38a617d0f30645e8ae95eea4642a03c2/model
38a617d0f30645e8ae95eea4642a03c2


In [12]:
loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri) 
loaded_model.predict(pd.DataFrame(X_test))

array([ 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
       -3.59597740e+01,  1.54358311e+01, -8.11709061e+01, -7.20540428e+00,
       -1.30709152e+01,  6.39794998e+01, -9.27280197e+01, -2.26579022e+00,
        3.47381516e+01, -6.19107590e+01,  3.34622955e+01, -1.14708206e+02,
       -6.09642944e+01, -1.33605647e+00,  6.94177933e+01, -6.88880005e+01,
        1.56064920e+01,  7.54498520e+01,  4.88155556e+01,  1.72110510e+00,
        1.13513260e+02,  3.03793182e+01, -1.31428665e+02,  7.69099426e+01,
        6.46578903e+01,  1.00327553e+02,  5.51982651e+01, -4.73030014e+01,
       -5.56093407e+01,  3.03793182e+01, -4.19037476e+01, -5.03655167e+01,
       -1.31638784e+01, -1.20837240e+01,  3.34622955e+01, -2.36042137e+01,
        2.69025154e+01, -2.87856426e+01,  2.63542976e+01,  4.22503738e+01,
       -4.69940872e+01, -1.99425220e-02, -5.63392944e+01,  3.39756622e+01,
        2.17549591e+01, -4.07176819e+01,  2.15980854e+01,  4.64821968e+01,
        9.28865204e+01, -

In [13]:
unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually when performing inference with the unwrapped_mnoel
unwrapped_model.predict(context=None, model_input=pd.DataFrame(X_test))


array([ 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
       -3.59597740e+01,  1.54358311e+01, -8.11709061e+01, -7.20540428e+00,
       -1.30709152e+01,  6.39794998e+01, -9.27280197e+01, -2.26579022e+00,
        3.47381516e+01, -6.19107590e+01,  3.34622955e+01, -1.14708206e+02,
       -6.09642944e+01, -1.33605647e+00,  6.94177933e+01, -6.88880005e+01,
        1.56064920e+01,  7.54498520e+01,  4.88155556e+01,  1.72110510e+00,
        1.13513260e+02,  3.03793182e+01, -1.31428665e+02,  7.69099426e+01,
        6.46578903e+01,  1.00327553e+02,  5.51982651e+01, -4.73030014e+01,
       -5.56093407e+01,  3.03793182e+01, -4.19037476e+01, -5.03655167e+01,
       -1.31638784e+01, -1.20837240e+01,  3.34622955e+01, -2.36042137e+01,
        2.69025154e+01, -2.87856426e+01,  2.63542976e+01,  4.22503738e+01,
       -4.69940872e+01, -1.99425220e-02, -5.63392944e+01,  3.39756622e+01,
        2.17549591e+01, -4.07176819e+01,  2.15980854e+01,  4.64821968e+01,
        9.28865204e+01, -

In [14]:
print(unwrapped_model.params)

{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}


In [15]:
print(loaded_model.metadata.artifact_path)
print(loaded_model.metadata.run_id)

model
848e2039df78422a95ece7deb3b0fef9


In [16]:
mlflow.pyfunc.get_model_dependencies(model_info.model_uri)

2024/07/21 16:21:10 INFO mlflow.pyfunc: To install the dependencies that were used to train the model, run the following command: '%pip install -r C:\Users\ningw\Desktop\Repo\mlflow-demo\mlruns\0\848e2039df78422a95ece7deb3b0fef9\artifacts\model\requirements.txt'.


'C:\\Users\\ningw\\Desktop\\Repo\\mlflow-demo\\mlruns\\0\\848e2039df78422a95ece7deb3b0fef9\\artifacts\\model\\requirements.txt'