When came with a new model. We want to ask some questions. Like what has changed from previous version of model to new version. Is there any preprocessing needed? What are extra libraries that we need to run a new model

And what if when running this new model in production we face some issues and roll back to old model. We need to know where the old model is stored

When doing an ML task, we use the MLFlow Tracking Server to log the parameters, metrics, artifactions and also many different model versions

Once we believe those models are fit for production, then we will "register model" to the MLFlow registry

MLFlow registry is the place where we store the production ready models. So whenver a deployment engineer wants to update the models, they can take a look at the Model Registry to find the new prod ready models

The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, model aliasing, model tagging, and annotations.

Model Registry does not deploy the models, instead it stores the models that are prod ready

In [2]:
import pickle
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import LinearSVR


from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.metrics import root_mean_squared_error

from sklearn.model_selection import cross_val_score, KFold

In [3]:
import mlflow

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")

# Create a new MLflow Experiment - Inside an experiment, there will be Runs
mlflow.set_experiment("taxi-model-registry")

2024/08/31 14:42:21 INFO mlflow.tracking.fluent: Experiment with name 'taxi-model-registry' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/478746432289998830', creation_time=1725095541014, experiment_id='478746432289998830', last_update_time=1725095541014, lifecycle_stage='active', name='taxi-model-registry', tags={}>

In [4]:
# a function to read the data, preprocess it and return it
def read_and_preprocess(filename):
    data = pd.read_parquet(filename)
    
    # create the target variable
    data['ride_duration'] = data['tpep_dropoff_datetime'] - data['tpep_pickup_datetime'] 
    data['ride_duration'] = data['ride_duration'].apply(lambda x: x.total_seconds()/60) 

    # take only the data below 1 hour
    data = data[(data['ride_duration'] >= 1) & (data['ride_duration'] <= 60)]

    # # sample the data to 70k rows
    # if len(data) > 70000:
    #     sampled_data = data.iloc[:70000,:].copy()
    # else:
    #     sampled_data = data.copy()
    sampled_data = data.copy()
    
    # chosing categorical
    categorical = ['PULocationID', 'DOLocationID']

    # convert these numerical categorical features to string categorical features
    sampled_data[categorical] = sampled_data[categorical].astype(str)

    return sampled_data

In [5]:
df_train = read_and_preprocess('../01-intro/data/yellow_tripdata_2021-01.parquet')
df_valid = read_and_preprocess('../01-intro/data/yellow_tripdata_2021-02.parquet')

In [6]:
# chosing categorical and numerical features
categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

# to use the DictVectorizer, we need to convert the dataframe to dict
train_dicts = df_train[categorical + numerical].to_dict(orient='records')
val_dicts = df_valid[categorical + numerical].to_dict(orient='records')


dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
X_valid = dv.fit_transform(val_dicts)

# storing our target variable
target = 'ride_duration'
y_train = df_train[target].values
y_val = df_valid[target].values