# Tutorial : introduction to MLFlow

This first application introduces the basic concepts of `MLFlow`. The goal is to predict the income class of individuals using a sample data from the US Census and a Random Forest classifier using the popular [scikit-learn](https://scikit-learn.org/stable/) machine learning `Python` library. 

We first illustrate how we would perform the training and fine-tuning of the model in a traditional way. Then, we show how we can integrate it as an **MLflow experiment**, so as to **log** relevant parameters and metrics in `MLflow`'s **tracking server** and visualize them in the UI. Finally, we illustrate how selected models can transition from the tracking server to the **model registry**, and how they can then be used from there to perform inference on new data points.

In [1]:
import os
from pprint import pprint
import json

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib
import mlflow
import mlflow.sklearn
import mlflow.pyfunc


* 'schema_extra' has been renamed to 'json_schema_extra'


In [2]:
SEED = 0

## Data preprocessing

For this application, we'll use a classical dataset extracted from the 1994 US Census bureau data. The goal is to determine whether a person makes over $50K a year ('>50K') or less ('<=50K') using sociodemographic characteristics on the selected individuals. As the available variables are generally self-explanatory, we won't describe the data much, but more information on them can be found in the original [Kaggle challenge](https://www.kaggle.com/datasets/uciml/adult-census-income).

In [3]:
DATA_URL = "https://minio.lab.sspcloud.fr/projet-formation/diffusion/mlops/data/adult-census-us.csv"
df_census = pd.read_csv(DATA_URL)

In [4]:
df_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States,<=50K
1,3,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,<=50K
2,2,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States,<=50K
3,3,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States,<=50K
4,1,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba,<=50K


The goal is to predict the income class, so we must first set it aside from the training data. As this variable consists in string-encoded categories, we must encode it in a numerical format to be able to feed it to a machine learning model. A common practice for ordinal data (data for which an order exists, such as income class) is to encode labels as subsequent integers starting at 0, a technique known as **label encoding**.

In [5]:
le = LabelEncoder()

X = df_census.drop(columns="class")
y = le.fit_transform(df_census["class"].values)

These new integer-encoded categories can naturally be mapped to the original values of the variable, and conversely.

In [6]:
# The encoded classes
le.classes_

array(['<=50K', '>50K'], dtype=object)

In [7]:
# The corresponding original classes
print(y)
print(np.array([le.classes_[i] for i in y]))

[0 0 0 ... 0 0 1]
['<=50K' '<=50K' '<=50K' ... '<=50K' '<=50K' '>50K']


A common practice in machine learning projects is to start by setting a fraction of the data aside as a **test dataset**. This data will be used at the very end of the project in order to properly evaluate the generalization performance of our selected algorithm, i.e. how it would perform on new unseen data. The rest of the data (**training dataset**) will be used to train the algorithms and compare their performance. Without this step, we are at risk of overfitting our models on the available data so that our evaluation metrics would no longer properly estimate the generalization error.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39073 entries, 22729 to 2732
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             39073 non-null  int64 
 1   workclass       36830 non-null  object
 2   fnlwgt          39073 non-null  int64 
 3   education       39073 non-null  object
 4   education-num   39073 non-null  int64 
 5   marital-status  39073 non-null  object
 6   occupation      36821 non-null  object
 7   relationship    39073 non-null  object
 8   race            39073 non-null  object
 9   sex             39073 non-null  object
 10  capitalgain     39073 non-null  int64 
 11  capitalloss     39073 non-null  int64 
 12  hoursperweek    39073 non-null  int64 
 13  native-country  38404 non-null  object
dtypes: int64(6), object(8)
memory usage: 4.5+ MB


These general information show that thousands of observations are missing for some variables. To avoid wasting data and since these might not be missing-at-random, we'll impute values for the missing ones :
- for numerical variables, we'll impute the median of the variable
- for categorical variables, we'll impute the mode, i.e. the most frequent category in the data

As previously, string-encoded categorical variables must also be converted to some form of numerical data. We'll use the same encoding strategy as the one used to encode the target variable.

So as to make all these steps as reproducible as possible, we formalize them as a `scikit-learn` `Pipeline` object. More information on their justification and the way there are used can be found in the [documentation](https://scikit-learn.org/stable/modules/compose.html).

In [10]:
median_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
mode_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
ordinal_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

categorical_transformer = make_pipeline(mode_imputer, ordinal_encoder)

preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", median_imputer, make_column_selector(dtype_include=np.int64)),
        ("categorical", categorical_transformer, make_column_selector(dtype_include=object))
    ], remainder="passthrough"
)

As most `scikit-learn` objects, the pipeline must first `fit` the data (e.g. compute the most frequent value or median). Then, we can use it to `transform` the data. The resulting object is a `NumPy` array with the same structure as the original data. It is not very useful per se, but it shows us that the categorical variables are indeed transformed into numerical values. This numerical array can then be fed to a machine learning model to train it.

In [11]:
preprocessor.fit_transform(X_train)

array([[0.00000e+00, 1.17372e+05, 7.00000e+00, ..., 4.00000e+00,
        1.00000e+00, 3.80000e+01],
       [2.00000e+00, 3.57720e+05, 1.10000e+01, ..., 4.00000e+00,
        0.00000e+00, 3.80000e+01],
       [4.00000e+00, 2.02242e+05, 9.00000e+00, ..., 4.00000e+00,
        1.00000e+00, 3.80000e+01],
       ...,
       [2.00000e+00, 3.44624e+05, 1.00000e+01, ..., 4.00000e+00,
        1.00000e+00, 3.80000e+01],
       [3.00000e+00, 1.04489e+05, 1.30000e+01, ..., 4.00000e+00,
        1.00000e+00, 3.80000e+01],
       [0.00000e+00, 1.86925e+05, 1.00000e+01, ..., 4.00000e+00,
        1.00000e+00, 3.80000e+01]])

## Tracking machine learning experiments : the classical way

In order to really understand why the MLOps approach is desirable, we must first get an idea of how we would train our model without it, i.e. the "classical" way. The workflow we are trying to achieve is best described by the following figure from the [scikit-learn documentation](https://scikit-learn.org/stable/).

<img src="img/grid_search_workflow.png" alt="Drawing" style="width: 400px;"/>

Using the training data, we want to train a model so as to get the best generalization performance, i.e. minimize the prediction error on unseen data. To do so, we have to **fine-tune** the **hyperparameters** of our model, i.e. find the combination of **hyperparameters** that provide the best performance. In order to avoid **overfitting** when doing so, we use a procedure called **cross-validation** (described in details [here](https://scikit-learn.org/stable/modules/cross_validation.html)). When we have found the optimal set of hyperparameters, we use the model trained with those for a final evaluation on the test set.

In this example, we train a *Random Forest* to discriminate the two income classes. First, we build a `Pipeline` object that integrates the preprocessing step as well as the model, so as to be able to improve reproducibility of the results.

In [12]:
rf_clf = RandomForestClassifier(random_state=SEED)

pipe_rf = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', rf_clf)
])

Although the hyperparameters provided natively by `scikit-learn` are usually good defaults, we will of course want to check whether we can improve the performance further by **fine-tuning** the relevant hyperparameters. To do so, we use the *grid search* method, which amounts to testing all the possible hyperparameters combinations along given values (*grid*) for these hyperparameters. For each combination, a performance evaluation is performed using a 5-folds *cross-validation*. As the accuracy is rarely a relevant metric for classification problems because of class imbalance, we also request the precision, the recall and the f1-score.

In [13]:
param_grid = {
    "classifier__n_estimators": [50, 100, 200],
    "classifier__max_leaf_nodes": [5, 10, 50]
}

pipe_gscv = GridSearchCV(pipe_rf, 
                         param_grid=param_grid, 
                         scoring=["accuracy", "precision", "recall", "f1"],
                         refit="f1",
                         cv=5, 
                         n_jobs=5, 
                         verbose=1)

**Question** : can you guess the total number of `fit` steps that will be performed when calling the `fit` method on the `pipe_gscv` object ?

<details>
<summary>
    <font size=\"3\" color=\"darkgreen\"><b>Click to see the answer </b></font>
</summary>

From the grid search only, there are 3 * 3 = 9 candidate models to train. However, for each combination, a 5-folds cross-validation is performed, which involves 5 training steps (*fits*). So altogether, there will be 9 * 5 = 45 *fits* to compute.
</details>

In [14]:
pipe_gscv.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


We can get detailed results for each candidate model in a `Pandas DataFrame`. This enables us to compare the models and select the best candidate based on their respective performance. 

In [15]:
gscv_results = pd.DataFrame(pipe_gscv.cv_results_)
gscv_results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__max_leaf_nodes,param_classifier__n_estimators,params,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,...,std_test_recall,rank_test_recall,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1
0,1.37276,0.084143,0.113134,0.003589,5,50,"{'classifier__max_leaf_nodes': 5, 'classifier_...",0.827639,0.833781,0.832118,...,0.013073,7,0.541993,0.545963,0.543811,0.557421,0.528157,0.543469,0.009356,7
1,2.45571,0.180711,0.159425,0.021745,5,100,"{'classifier__max_leaf_nodes': 5, 'classifier_...",0.82508,0.831734,0.827895,...,0.007155,8,0.514732,0.536809,0.523557,0.530239,0.517448,0.524557,0.00813,8
2,4.766344,0.360433,0.233709,0.027991,5,200,"{'classifier__max_leaf_nodes': 5, 'classifier_...",0.8238,0.828279,0.825848,...,0.00581,9,0.503067,0.520714,0.514449,0.513224,0.512221,0.512735,0.005667,9
3,1.642686,0.094695,0.114007,0.012026,10,50,"{'classifier__max_leaf_nodes': 10, 'classifier...",0.831862,0.841331,0.837876,...,0.010626,4,0.567763,0.595828,0.590894,0.574122,0.583145,0.582351,0.010351,4
4,2.904217,0.272134,0.178663,0.037922,10,100,"{'classifier__max_leaf_nodes': 10, 'classifier...",0.831094,0.840691,0.837236,...,0.009021,5,0.564931,0.591401,0.587281,0.575057,0.577098,0.579154,0.009374,5


The fitted `Pipeline` object actually keep tracks of the best model for us. We can thus show the best performing set of hyperparameters, and use the model trained with these hyperparameters to compute the final score on the test set (not used until yet).

In [16]:
print(pipe_gscv.best_params_)

best_model = pipe_gscv.best_estimator_

{'classifier__max_leaf_nodes': 50, 'classifier__n_estimators': 50}


In [17]:
y_test_pred = best_model.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)

print(f"Final F1-score on test data : {f1_test}")

Final F1-score on test data : 0.646433990895296


In order for this analysis to be reproducible, we must find a way to export the results. Fortunately, `scikit-learn` models are serializable. One way to persist them is to use `joblib` (see the [documentation on model persistence](https://scikit-learn.org/stable/model_persistence.html) for a more detailed discussion on possible way to export models).

In [18]:
if not os.path.exists("models/"):
    os.makedirs("models/")
joblib.dump(pipe_gscv, 'models/pipeline_train_model_20230118.joblib')

['models/pipeline_train_model_20230118.joblib']

This is convenient for the development phase, but it is also very clear that **this way of persisting models is not production-grade nor scalable** :
- first and foremost, we lack a proper way to **track experiments** (data used for training, environment configuration, metrics...)
- we can not easily visualize the various metrics so as to compare and select the best model
- we can parallelize the cross-validation computation, but we can't readily parallelize the evaluation of each hyperparameters combination
- there is no easy and standardized way to distribute the serialized models, so the collaboration of several team members on a given experiment is complicated

The MLOps principles were precisely devised to solve these various problems. Let's see now how MLflow enables us to implement them easily.

## Tracking machine learning experiments : the MLFlow way

### Configuration

The main component of `MLflow` is the *Tracking Server*, which tracks experiments and save the relevant data and metadata. More precisely, for each *run* ("execution of some piece of data science code"), the *Tracking Server* records : 
- **experiments data** (parameters, metrics, tags, notes, metadata, ...) in a **backend store** (in our case, a `PostgreSQL` database)
- **artifacts** (models, files, images, ...) in an **artifact store** (in our case, `S3`-like storage)

As a user, we communicate with the *Tracking Server* through a client (in our case, a `Jupyter` notebook with a `Python` kernel).

Fortunately, in a properly set up environment, these three communication levels can be pre-configured so that they are relatively transparent to the user. This enables the data scientist to focus on the business task at hand.

<img src="img/mlflow-tracking.png" alt="Drawing" style="width: 800px;"/>

The client must know the URI of the *tracking server*. If a `MLflow` instance has been launched on the SSP Cloud previous to the client, the client will automatically discover the URI. If not, it must be set manually. Opening this URL opens the UI of `MLflow`, which we will be using later in the tutorial.

In [19]:
# Automatic discovery : if MLFlow has been launched before Jupyter/VSCode
if "MLFLOW_TRACKING_URI" in os.environ:
    print(os.environ["MLFLOW_TRACKING_URI"])
else:
    print("MLflow was not automatically discovered, a tracking URI must be provided manually.")

https://user-jguay-754955.user.lab.sspcloud.fr


In [None]:
# Manual configuration : if MLFlow has been launched after Jupyter/VSCode
# os.environ["MLFLOW_TRACKING_URI"] = "copy_uri_from_mlflow_service_README_here"

### Tracking experiments

In the previous steps, we fine-tuned our model, i.e. we trained the same model with several different combinations of hyper-parameters in order to ultimately select the one with the best performance according to a given metric. In comparison with the "traditional way" we saw above, `MLflow` enables us to track these experiments in a much more refined way, compatible with the *MLOps* principles.

The function `log_gsvc_to_mlflow` below enables us to convert the data contained in our `GridSearchCV` object (hyperparameters, metrics, artifact..) into an `MLflow` experiment, which can then be queried using the API.

In [20]:
def log_gsvc_to_mlflow(gscv, mlflow_experiment_name):
    """Log a scikit-learn trained GridSearchCV object as an MLflow experiment."""
     # Set up MLFlow context
    mlflow.set_experiment(experiment_name=mlflow_experiment_name)

    for run_idx in range(len(gscv.cv_results_["params"])):
        # For each hyperparameter combination we trained the model with, we log a run in MLflow
        run_name = f"run {run_idx}"
        with mlflow.start_run(run_name=run_name):
            # Log hyperparameters
            params = gscv.cv_results_["params"][run_idx]
            for param in params:
                mlflow.log_param(param, params[param])

            # Log fit metrics
            scores = [score for score in gscv.cv_results_ if "mean_test" in score or "std_test" in score]
            for score in scores:
                mlflow.log_metric(score, gscv.cv_results_[score][run_idx])

            # Log model as an artifact
            mlflow.sklearn.log_model(gscv, "gscv_model")

            # Log training data URL
            mlflow.log_param("data_url", DATA_URL)

In [21]:
log_gsvc_to_mlflow(gscv=pipe_gscv, mlflow_experiment_name="tutorial-mlflow-intro")

2023/11/22 14:10:19 INFO mlflow.tracking.fluent: Experiment with name 'tutorial-mlflow-intro' does not exist. Creating a new experiment.


### Application 1.1: Tracking models with MLflow

If the previous cell executed correctly, that means the experiments and in particular all the data we wanted `MLflow` to log are now available in the *tracking server*. In order to interact with these data and try to select the best model, we'll learn to use the UI. Please follow the following steps :
1. Open the UI using the URI we printed above
2. In the *Experiments* tab, open the "tutorial-mlflow-intro" experiment
3. Verify that there are indeed 9 runs that have been recorded, one for each hyperparameters combination
4. Open a given run and verify that you can retrieve the various information we wanted to log (hyperparameters, evaluation metrics, training data URL), check that you can download the serialized `scikit-learn` model, and check the `requirements.txt` file to understand how `MLflow` automatically inferred the required `Python` environment
5. Go back to the list of runs by clicking again on the "tutorial-mlflow-intro" experiment
6. Add additional columns using the *Columns* drop-down menu in the *Table* panel
7. Sort the models in descending order according to the **mean test F1-score** 

### Application 1.2: Registering a model with MLflow
In the previous section, we logged properly our experiment in the `MLflow` tracking server, which enables to compare our models and see the best performing ones in a visual way. Now, we want to be able to select a model, put it in production, and allow the other members of the projet to query it. For this to be possible, we have to **move the model from the tracking server to the model registry**.
1. Consider both the high mean test F1-score and low standard deviation of test F1-score to make a decision on the best model.
2. Click on the run corresponding to your chosen best model.
3. Click on "Register Model"
4. Create a New Model and give it a relevant name (e.g. "rf-census")
5. Move to the model registry by clicking on the "Models" tab
6. If everything worked correctly, you should see your model in the list of the registered models. Click on it to get the list of the registered versions of the model. For now, there is only one version as we pushed it only one time.
7. Click on "Version 1" to get the information on this specific version of the model. Two things are especially interesting to note :

   **a.** The "Stage" section. Here you can indicate to all members of the project what is stage of this specific model. Let's transition it to "Production" to indicate that is our reference model, which we want to deploy

   **b.** The "Source run" section. Here you can retrieve the run that corresponds to this model. If you click on the run id, you retrieve all the information we logged (environment, metrics, artifacts location...). 

### Querying a model

As above, let's perform the final evaluation on the test set using the model we put in production. We can retrieve the model either by its version or by its stage, should it have one. We fetch the model using the `mlflow.pyfunc.load_model` function. We then have our `scikit-learn` model, which can directly be used for prediction. Let's check that we find the same final F1-score on the test data.

#### Using the version number

In [22]:
# Fetch the model
model_name = "rf-census"
version = 1

model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{version}")

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

In [23]:
# Final evaluation
y_test_pred = model.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)

print(f"Final F1-score on test data : {f1_test}")

Final F1-score on test data : 0.646433990895296


This score indeed corresponds to the one we found in the "classical ML training" section using the best trained model !

#### Using the stage

Equivalently, we can use the stage of the model, which should produce the same score.

In [25]:
# Fetch the model
model_name = "rf-census"
stage = 'Production'

model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{stage}")

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

In [26]:
# Final evaluation
y_test_pred = model.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)

print(f"Final F1-score on test data : {f1_test}")

Final F1-score on test data : 0.646433990895296


## Conclusion

`MLflow` enables us to **set our machine learning experiments to the standards of the `MLOps` approach** in a very user-friendly way :
- data scientists can very easily decide what data they want to log for each experiment so as to **keep a detailed track of those experiments**
- other members of the team (e.g. data engineers that might be in charge of deploying the model) can very easily fetch the model and use it for prediction in an application. To do so, they only need to know the name of the model as well as its version or its stage, but not the actual location of the artifact on the storage, as this layer is abstracted by `MLflow`. **This makes collaboration on machine learning projects very convenient and efficient**.