When came with a new model. We want to ask some questions. Like what has changed from previous version of model to new version. Is there any preprocessing needed? What are extra libraries that we need to run a new model

And what if when running this new model in production we face some issues and roll back to old model. We need to know where the old model is stored

When doing an ML task, we use the MLFlow Tracking Server to log the parameters, metrics, artifactions and also many different model versions

Once we believe those models are fit for production, then we will "register model" to the MLFlow registry

MLFlow registry is the place where we store the production ready models. So whenver a deployment engineer wants to update the models, they can take a look at the Model Registry to find the new prod ready models

The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, model aliasing, model tagging, and annotations.

Model Registry does not deploy the models, instead it stores the models that are prod ready

In [39]:
import pickle
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import LinearSVR
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import GridSearchCV

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.metrics import root_mean_squared_error

from sklearn.model_selection import cross_val_score, KFold

In [3]:
import mlflow

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")

# Create a new MLflow Experiment - Inside an experiment, there will be Runs
mlflow.set_experiment("taxi-model-registry")

2024/08/31 14:42:21 INFO mlflow.tracking.fluent: Experiment with name 'taxi-model-registry' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/478746432289998830', creation_time=1725095541014, experiment_id='478746432289998830', last_update_time=1725095541014, lifecycle_stage='active', name='taxi-model-registry', tags={}>

In [4]:
# a function to read the data, preprocess it and return it
def read_and_preprocess(filename):
    data = pd.read_parquet(filename)
    
    # create the target variable
    data['ride_duration'] = data['tpep_dropoff_datetime'] - data['tpep_pickup_datetime'] 
    data['ride_duration'] = data['ride_duration'].apply(lambda x: x.total_seconds()/60) 

    # take only the data below 1 hour
    data = data[(data['ride_duration'] >= 1) & (data['ride_duration'] <= 60)]

    # # sample the data to 70k rows
    # if len(data) > 70000:
    #     sampled_data = data.iloc[:70000,:].copy()
    # else:
    #     sampled_data = data.copy()
    sampled_data = data.copy()
    
    # chosing categorical
    categorical = ['PULocationID', 'DOLocationID']

    # convert these numerical categorical features to string categorical features
    sampled_data[categorical] = sampled_data[categorical].astype(str)

    return sampled_data

In [5]:
df_train = read_and_preprocess('../01-intro/data/yellow_tripdata_2021-01.parquet')
df_valid = read_and_preprocess('../01-intro/data/yellow_tripdata_2021-02.parquet')

In [6]:
# chosing categorical and numerical features
categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

# to use the DictVectorizer, we need to convert the dataframe to dict
train_dicts = df_train[categorical + numerical].to_dict(orient='records')
val_dicts = df_valid[categorical + numerical].to_dict(orient='records')


dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
X_valid = dv.fit_transform(val_dicts)

# storing our target variable
target = 'ride_duration'
y_train = df_train[target].values
y_val = df_valid[target].values

MLFlow AutoLog, logs the following

- Metrics - MLflow pre-selects a set of metrics to log, based on what model and library you use

- Parameters - hyper params specified for the training, plus default values provided by the library if not explicitly set

- Model Signature - logs Model signature instance, which describes input and output schema of the model

- Artifacts - e.g. model checkpoints

- Dataset - dataset object used for training (if applicable), such as tensorflow.data.Dataset



In [21]:
# Linear Regression Model
mlflow.autolog()

# as we are using Auto Log, we do not need any "with context manager" but if we dont use context manager, we need to specify mflow.end_run() after each run
# here this cell is a single run, so at end of the end, we need to specifu mlflow.end_run() if not using context manager
with mlflow.start_run():
    # train a LinearRegression Model
    lr = LinearRegression()

    lr.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = lr.predict(X_valid)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','Linear Regression')

# if not using with context manager, uncomment
# mlflow.end_run()

2024/08/31 15:08:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2024/08/31 15:08:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for lightgbm.
2024/08/31 15:08:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2024/08/31 15:08:21 INFO mlflow.tracking._tracking_service.client: 🏃 View run agreeable-skink-171 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/47ec2cd3efd744f398c9381cfab20f6d.
2024/08/31 15:08:21 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.


In [22]:
# Linear Regression Model with LASSO Regularization
with mlflow.start_run():
    lr = Lasso()

    lr.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = lr.predict(X_valid)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','LASSO')

2024/08/31 15:08:37 INFO mlflow.tracking._tracking_service.client: 🏃 View run loud-lark-824 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/4aa4fd9eb7454abda7b05bacc5707262.
2024/08/31 15:08:37 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.


In [23]:
# Linear Regression Model with Ridge Regularization
with mlflow.start_run():
    # train a LinearRegression Model
    lr = Ridge()

    lr.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = lr.predict(X_valid)

    # calculate the metrics
    root_mean_squared_error(y_val, y_pred)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','Ridge')

2024/08/31 15:08:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run overjoyed-hare-3 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/6abd28401d724ef8be3282d8942d8a5a.
2024/08/31 15:08:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.


XGBoost Regressor

- n_estimators: The number of trees in the ensemble, often increased until no further improvements are seen.
- max_depth: The maximum depth of each tree, often values are between 1 and 10.
- eta: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.
- subsample: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.
- colsample_bytree: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

In [26]:
# XGBoost Regressor
with mlflow.start_run():
    boost = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

    boost.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = boost.predict(X_valid)

    # calculate the metrics
    root_mean_squared_error(y_val, y_pred)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','XGBoost')

2024/08/31 15:14:55 INFO mlflow.tracking._tracking_service.client: 🏃 View run fearless-fawn-503 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/b838e9d27ae04a2f9fc8b8b34c767d57.
2024/08/31 15:14:55 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.


In [25]:
# to get the information on last active run
autolog_run = mlflow.last_active_run()
print(autolog_run)

<Run: data=<RunData: metrics={'test_root_mean_squared_error': 4.517709979174692}, params={'base_score': 'None',
 'booster': 'None',
 'colsample_bylevel': 'None',
 'colsample_bynode': 'None',
 'colsample_bytree': '0.8',
 'custom_metric': 'None',
 'device': 'None',
 'early_stopping_rounds': 'None',
 'eta': '0.1',
 'eval_metric': 'None',
 'gamma': 'None',
 'grow_policy': 'None',
 'interaction_constraints': 'None',
 'learning_rate': 'None',
 'max_bin': 'None',
 'max_cat_threshold': 'None',
 'max_cat_to_onehot': 'None',
 'max_delta_step': 'None',
 'max_depth': '7',
 'max_leaves': 'None',
 'maximize': 'None',
 'min_child_weight': 'None',
 'model': 'Linear Regression',
 'monotone_constraints': 'None',
 'multi_strategy': 'None',
 'n_jobs': 'None',
 'num_boost_round': '1000',
 'num_parallel_tree': 'None',
 'objective': 'reg:squarederror',
 'random_state': 'None',
 'reg_alpha': 'None',
 'reg_lambda': 'None',
 'sampling_method': 'None',
 'scale_pos_weight': 'None',
 'subsample': '0.7',
 'tree_met

In [16]:
X_train.shape

(1343254, 519)

In [27]:
# LightGBM Regressor
with mlflow.start_run():
    boost = LGBMRegressor()

    boost.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = boost.predict(X_valid)

    # calculate the metrics
    root_mean_squared_error(y_val, y_pred)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','LGBMRegressor')

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009066 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1245
[LightGBM] [Info] Number of data points in the train set: 1343254, number of used features: 496
[LightGBM] [Info] Start training from score 11.644064


2024/08/31 15:20:23 INFO mlflow.tracking._tracking_service.client: 🏃 View run intelligent-pig-197 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/d1cc89c1a3064fb690b2e76058f06d57.
2024/08/31 15:20:23 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.


In [35]:
from rich import print as rprint

def print_auto_logged_info(run):
    tags = {k: v for k, v in run.data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [
        f.path for f in mlflow.MlflowClient().list_artifacts(run.info.run_id, "model")
    ]
    feature_importances = [
        f.path
        for f in mlflow.MlflowClient().list_artifacts(run.info.run_id)
        if f.path != "model"
    ]
    rprint(f"run_id: {run.info.run_id}")
    rprint(f"artifacts: {artifacts}")
    rprint(f"feature_importances: {feature_importances}")
    rprint(f"params: {run.data.params}")
    rprint(f"metrics: {run.data.metrics}")
    rprint(f"tags: {tags}")

In [36]:
# fetch the auto logged parameters and metrics
autolog_run = mlflow.last_active_run()
# print_auto_logged_info(mlflow.get_run(run_id=autolog_run.info.run_id))
print_auto_logged_info(autolog_run)

From the above, we see that autolog logs all the parameters of that model. And different metrics(for lightgbm it doesnt log any metrics, but for other sklearn models it logs different metrics like rmse, r2, etc and all these are for training data)

We see all the artifacts that are saved in the artifacts folder. The model is saved in pkl format along with yaml files and requirement files to run the model

Even Feature Importance data is stored which contains the feature imporatnce for different models

### Load the Models from MLFlow

There are Two Flavours / Methods to Load Models

In [37]:
# Load model for inference

# let the last run, LightGBM as trained at the last
last_active_run = mlflow.last_active_run()

# get the run id for this run
run_id = last_active_run.info.run_id

# to load any model, we need a model URI, for this we need the model run_id
model_uri = f"runs:/{run_id}/model"

# we need the run_id because the model is stored in mlartifacts/experiment_id/run_id/model folder. This folder contains the pkl file and all other
# files which are shown in the above 2nd print statement

# load the model using mlflow.lightgbm.model class and the URI
loaded_model = mlflow.lightgbm.load_model(model_uri)

y_pred = loaded_model.predict(X_valid)

root_mean_squared_error(y_val, y_pred)

  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 859.45it/s]  


np.float64(4.555352288314773)

In [48]:
# Loading the Model as a PyFunction instead XGBoost
loaded_model = mlflow.pyfunc.load_model(model_uri)

y_pred = loaded_model.predict(X_valid)

root_mean_squared_error(y_val, y_pred)

Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 179.23it/s]


np.float64(4.555352288314773)

### Parent and Child Runs

Autologging for estimators (e.g. LinearRegression, Lasso) and meta estimators (e.g. Pipeline) creates a single run and logs. Autologging for parameter search estimators (e.g. GridSearchCV) creates a single parent run and nested child runs

As we are using GridSearch, we will get multiple runs here. And for  each run, its parameters and scores will be logged. Also in the artifacts, the model with best score will be logged

In [44]:
# XGBoost Regressor with GridSearch
with mlflow.start_run():
    XGBR = XGBRegressor(colsample_bytree=0.8)

    parameters = {'eta': [0.1,0.05],
                  'subsample'    : [0.9, 0.5],
                  'n_estimators' : [500,1000],
                  'max_depth'    : [4,7]
                 }

    grid_XGBR = GridSearchCV(estimator=XGBR, param_grid = parameters, cv = 2, n_jobs=-1)

    # fitting the search
    grid_XGBR.fit(X_train, y_train)

    # make predictions on test_data
    y_pred = grid_XGBR.predict(X_valid)

    # calculate the metrics
    root_mean_squared_error(y_val, y_pred)

    # calculate the metrics
    rmse = root_mean_squared_error(y_val, y_pred) # squared set to False implies we are using RMSE instead MSE

    # logging test metric
    mlflow.log_metric('test_root_mean_squared_error', rmse)

    # logging model name - Logging it as Param, so I can see a graph of models vs RMSE
    mlflow.log_param('model','XGBoost_with_GridSearch')

2024/08/31 16:58:39 INFO mlflow.sklearn.utils: Logging the 5 best runs, 11 runs will be omitted.
2024/08/31 16:58:39 INFO mlflow.tracking._tracking_service.client: 🏃 View run amusing-ant-749 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/c7fac24d7e914aeb8239e40b11675ecb.
2024/08/31 16:58:39 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.
2024/08/31 16:58:39 INFO mlflow.tracking._tracking_service.client: 🏃 View run amusing-boar-204 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/9bdce9b8a1004e03a2589a53d23920d4.
2024/08/31 16:58:39 INFO mlflow.tracking._tracking_service.client: 🏃 View run masked-crane-442 at: http://127.0.0.1:5000/#/experiments/478746432289998830/runs/ea3cef248bcb431e86c5d94a5d8d1fbd.
2024/08/31 16:58:39 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/478746432289998830.
2024/08/31 16:58:39 INFO mlflow.tra

In [45]:
# Lets Check the Logged Info

# fetch the auto logged parameters and metrics
autolog_run = mlflow.last_active_run()
# print_auto_logged_info(mlflow.get_run(run_id=autolog_run.info.run_id))
print_auto_logged_info(autolog_run)

In UI, if we go this run and check the artifacts, we get see a folder called `best_estimator` which stors the best model. The model folder also stores the same model. And looking above, in feature_importances, we can see a file called `cv results.csv`, This contains results like parameters, training error, cv error,etc  for all 16 different model configurations(as we have 4 parameters and each parameter has 2 different configurations, total search will be 2^4 = 16) that the GridSearchCV has comeup with

MLFlow created 6 runs under the parent run. where these 6 runs are the top 6 configurations of the overall 16 configurations

### Model Signature

**Model Signature**
The Model Signature in MLflow is integral to the clear and accurate operation of models. It defines the expected format for model inputs and outputs, including any additional parameters needed for inference. This specification acts as a definitive guide, ensuring seamless model integration with MLflow’s tools and external services.

**Model Input Example**
Complementing the Model Signature, the Model Input Example gives a concrete instance of what valid model input looks like.

Mlflow's `autolog` automatically inferes the model signature

![](https://mlflow.org/docs/latest/_images/signature-vs-no-signature.png)

Model signatures and input examples are foundational to robust ML workflows, offering a blueprint for model interactions that ensures consistency, accuracy, and ease of use. They act as a contract between the model and its users, providing a definitive guide to the expected data format, thus preventing miscommunication and errors that can arise from incorrect or unexpected inputs.

