# Chapter 3: Tracking Models, Parameters, and Metrics

Given that MLflow can support multiple scenarios through the life cycle of DL models, it is common to use MLflow's capabilities incrementally. Usually, people start with MLflow tracking since it is easy to use and can handle many scenarios for reproducibility, provenance tracking, and auditing purposes.

We will then take a deep dive into how we can track a model, along with its parameters and metrics, using MLflow's tracking and registry APIs. By the end of this chapter, you should feel comfortable using MLflow's tracking and registry APIs for various reproducibility and auditing purposes.

## **Setting up a full-fledged local MLflow tracking server**

The benefit of having a model registry is that we can register the model, version control the model, and prepare for model deployment into production. Therefore, this model registry will bridge the gap between offline experimentation and an online deployment production scenario. Thus, we need a full-fledged MLflow tracking server with the following stores to track the complete life cycle of a model:
* **Backend store**: A **relational database backend is needed to support MLflow's storage of metadata** (metrics, parameters, and many others) about the experiment. This also allows the query capability of the experiment to be used. **We will use a MySQL database as a local backend store**.
* **Artifact store**: An object store that can store arbitrary types of objects, such as serialized models, vocabulary files, figures, and many others. In a production environment, a popular choice is the AWS S3 store. We will use [**MinIO**](https://min.io/), a multi-cloud object store, as a local artifact store, which is fully compatible with the AWS S3 store API but can run on your laptop without you needing to access the cloud.

To make this local setup as easy as possible, we will use the [docker-compose](https://docs.docker.com/compose/) tool with one line of command to start and stop a local full-fledged MLflow tracking server, as described in the following steps. The following steps will launch the local MLflow tracking server inside your local Docker container:

1. Clone https://github.com/PacktPublishing/Practical-Deep-Learning-at-Scale-with-MLflow/tree/main/chapter03.
2. Change directory to the `mlflow_docker_setup` subfolder, which can be found under the chapter03 folder.
3. Run the following command:
```bash
bash start_mlflow.sh
```
4. Go to `http://localhost/` to see the MLflow UI web page. Then, click the Models tab in the UI. Note that this tab would not work if you only had a local filesystem as the backend store for the MLflow tracking server. Hence, the MLflow UI's backend is now running on the Docker container service you just started, not a local filesystem
5. Go to `http://localhost:9000/`, and the following screen should appear for the **MinIO** artifact store web UI. Enter `minio` for Access Key and `minio123` for Secret Key. These are defined in the `.env` file, under the `mlflow_docker_setup` folder

At this point, you should have a full-fledged local MLflow tracking server running successfully! If you want to stop the server, simply type the following command:
```bash
bash stop_mlflow.sh
```
The Docker-based MLflow tracking server will stop. We are now ready to use this local MLflow server to track model provenance, parameters, and metrics.

## **Tracking model provenance**

**Provenance** tracking for digital artifacts has been long studied in the litterature. For example, when **you're using a piece of patient diagnosis data in the biomedical industry**, people usually want to know **where it comes from, what kind of processing and cleaning has been done to the data, who owns the data, and other history and lineage information** about the data. **The rise of ML/DL models for industrial and business scenarios in production makes provenance tracking a required functionality**. The different granularities of provenance tracking are critical for operationalizing and managing not just the data science offline experimentation, but also before/during/after the model is deployed in production. So, what needs to be tracked for provenance?

### **Understanding the open provenance tracking framework**

Let's look at a general provenance tracking framework to understand the big picture of why provenance tracking is a major effort. The following diagram is based on the [Open Provenance Model Vocabulary Specification](http://open-biomed.sourceforge.net/opmv/ns.html):

![Text](open_provenance_model_vocab_spec.jpg)

In the preceding diagram, there are three important items:

* **Artifacts**: Things that are produced or used by processes (**A1** and **A2**).
* **Processes**: Actions that are performed by using or producing artifacts (**P1** and **P2**).
* **Causal relationships**: Edges or relationships between artifacts and processes, such as used, ***wasGeneratedBy***, and ***wasDerivedFrom*** in the preceding diagram (**R1**, **R2**, and **R3**).

Intuitively, this open provenance model (OPM) framework allows us to ask the following 5W1H (five Ws and one H) questions, as follows:

![Alt text](types_prov_questions.png)

Having a systematic provenance framework and a set of questions will help us learn how to track model provenance and provide answers to these questions. This will motivate us when we implement MLflow model tracking in the next section.

## **Implementing MLflow model tracking**

We can use an MLflow tracking server to answer most of these types of provenance questions if we implement both **MLflow logging** and **registry** for the DL model we use. First, let's review what MLflow provides in terms of model provenance tracking. MLflow provides two sets of APIs for model provenance:
* **Logging API**: This **allows each run of the experiment or a model pipeline to log** the model artifcat into the artifact store
* **Registry API**: This **allows a centralized location to track the version** of the model and the stages of the model's life cycle (**None, Archived, Staging**, or **Production**).

### ***DIFFERENCE BETWEEN MODEL LOGGING AND MODEL REGISTRY***

Although every run of the experiment needs to be logged and the model needs to be saved in the artifact store, **not every instance of the model needs to be registered in the model registry**. That's because, for many early exploratory model experimentations, the model might not be good. Thus, it is not necessarily registered to track the version. **Only when a model has good offline performance and becomes a candidate for promoting to production do we need to register it in the model registry to go through the model promotion process**.

Although MLflow's official API documentation **separates logging and registry into two components**, we will **refer to them together as model tracking functionality in MLflow**.

Although auto-logging is powerful, there are two issues with the current version:

We already saw MLflow's auto-logging for the DL model build previously, although auto-logging is powerful, there are two issues with the current version:

* It does not automatically register the model to the model registry.
* It does not work out of the box for the logged model to work directly with the original input data (in our case, an English sentence) if you just follow MLflow's suggestion to use the `mlflow.pyfunc.load_model` API to load the logged model. This is a limitation that's probably due to the experimental nature of the current auto-logging APIs in MLflow.

Let's walk through an example to review MLflow's capabilities and auto-logging's limitations and how we can solve them:

1. Set up the following environment variables in your Bash terminal, where your MinIO and MySQL-based Docker component is running:
```bash
export MLFLOW_S3_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY=minio
export AWS_SECRET_ACCESS_KEY=minio123
```
2. To follow along with this model tracking implementation, check out the dl_model_tracking.ipynb notebook file in VS Code by going to [this chapter's GitHub repository](https://github.com/PacktPublishing/Practical-Deep-Learning-at-Scale-with-MLFlow/blob/main/chapter03/dl_model_tracking.ipynb).

ote that, in the fourth cell of the dl_model_tracking.ipynb notebook, we need to point it to the correct and new MLflow tracking URI that we just set up in the Docker and define a new experiment, as follows:
```py
EXPERIMENT_NAME = "mlflow_dl_model_chapter_03"
mlflow.set_tracking_uri('http://localhost')
```

3. We will still use the auto-logging capabilities provided by MLflow but we will assign the run with a variable name, `dl_model_tracking_run`:
`mlflow.pytorch.autolog()`
```py
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name="chapter03") as dl_model_tracking_run:
    params = {
        "model_name": model_name,
        "num_epochs": num_epochs,
        "num_classes": num_classes,
        "train_size": len(train_dataset),
        "val_size": len(val_dataset),
        "metric_function" : "accuracy",
        "device": str(device),
        "num_gpus": torch.cuda.device_count(),
        "batch_size": batch_size,
        "learning_rate": 2e-5,
        "max_token_length": max_token_length
    }
    mlflow.log_params(params)
    print(f"Run ID: {run.info.run_id}")

    with open("model_summary.txt", "w") as f:
        f.write(str(summary(model)))
    mlflow.log_artifact("model_summary.txt")


    print("\nStarting training...")
    for epoch in range(num_epochs):
        print(f"---⌛ Epoch {epoch+1}/{num_epochs} ---")
        train_loss = train_epoch(model, train_dataloader, optimizer, device)
        val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate_epoch(model, val_dataloader, device)

        print(f"Train Loss: {train_loss:.4f}")
        print(f"Validation Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}, "
                f"Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1: {val_f1:.4f}")

    # Save the trained model to MLflow.
    mlflow.pytorch.log_model(model, artifact_path="model", name="model")

    # --- 5. Testing ---
    # Once the fine-tuning step is completed, we will test the accuracy
    # of the model by running trainer.test():
    print("\nStarting testing...")
    test_loss, test_accuracy, test_precision, test_recall, test_f1 = evaluate_epoch(model, test_dataloader, device)
    print(f"Test Loss: {test_loss:.4f}, Accuracy: {test_accuracy:.4f}, "
            f"Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}")

    # --- 6. Prediction Example ---
    print("\nExample Prediction on new data:")
    model.eval()
    new_reviews = [
        "This movie was absolutely fantastic! I loved every moment of it.",
        "Utterly disappointing. A complete waste of time.",
        "It was okay, nothing special.",
        "The acting was superb, but the plot was a bit weak."
    ]

    for review in new_reviews:
        encoding = tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=max_token_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        input_ids = encoding['input_ids'].to(device)
        attention_mask = encoding['attention_mask'].to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predicted_label_id = torch.argmax(logits, dim=-1).item()
            predicted_sentiment = "positive" if predicted_label_id == 1 else "negative"
            print(f"Review: '{review}' -> Predicted Sentiment: {predicted_sentiment}")
```
`dl_model_tracking_run` allows us to get the `run_id` parameter and other metadata about this run programmatically, as we will see in the next step. Once this code cell has been executed, we will have a trained model logged in the MLflow tracking server with all the required parameters and metrics. However, the model hasn't been registered yet. We can find the logged experiment in the MLflow web UI, along with all the relevant parameters and metrics, at http://localhost/#/experiments/1/runs/37a3fe9b6faf41d89001eca13ad6ca47. You can find the model artifacts in the MinIO storage backend. Go to http://localhost:9000/minio/mlflow/1/37a3fe9b6faf41d89001eca13ad6ca47/artifacts/model/ to see the storage UI.
4. Retrieve the run_id parameter from dl_model_tracking_run, as well as other metadata, as follows:
```py
run_id = dl_model_tracking_run.info.run_id
print("run_id: {}; lifecycle_stage: {}".format(run_id,
    mlflow.get_run(run_id).info.lifecycle_stage))
```
This will print out something like the following:
```bash
run_id: 37a3fe9b6faf41d89001eca13ad6ca47; lifecycle_stage: active
```
5. Retrieve the logged model by defining the logged model URI. This will allow us to reload the logged model at this specific location:
```py
logged_model = f'runs:/{run_id}/model'
```
6. se mlflow.pytorch.load_model and the following logged_model URI to load the model back into memory and make a new prediction for a given input sentence, as follows:
```py
model = mlflow.pytorch.load_model(logged_model)
model.predict({'This is great news'})
```
This will output a model prediction label, as follows:
```bash
['positive']
```

Let's put everything together

In [9]:
import mlflow
import torch
from torchinfo import summary
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import AdamW
from torch.nn.functional import cross_entropy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd
import os
import zipfile
import requests
from tqdm.auto import tqdm

# import python dotenv
from dotenv import load_dotenv
# Load environment variables
load_dotenv('./mlflow_docker_setup/.env')
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")

os.environ['MLFLOW_S3_ENDPOINT_URL'] = "http://localhost:9000"
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY

In [10]:
# --- 1. Data Download Utility (similar to Flash's download_data) ---
def download_and_extract_zip(url, path):
    if not os.path.exists(path):
        os.makedirs(path)
    zip_file_name = os.path.join(path, "downloaded_data.zip")

    print(f"Downloading data from {url}...")
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get('content-length', 0))
    block_size = 1024 # 1 Kibibyte
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
    with open(zip_file_name, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()

    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print("ERROR, something went wrong during download")
        return

    print(f"Extracting {zip_file_name} to {path}...")
    with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
        zip_ref.extractall(path)
    os.remove(zip_file_name) # Clean up the zip file
    print("Download and extraction complete.")

# --- 2. Custom Dataset for IMDB (replaces TextClassificationData) ---
class IMDBDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

        # Ensure 'sentiment' is mapped to numerical labels
        # 0 for negative, 1 for positive
        self.label_map = {'negative': 0, 'positive': 1}
        self.dataframe['sentiment_label'] = self.dataframe['sentiment'].map(self.label_map)


    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        review = str(self.dataframe.loc[idx, 'review'])
        label = self.dataframe.loc[idx, 'sentiment_label']

        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt', # Return PyTorch tensors
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }



In [11]:
# --- 3. Training Function ---
# """
# Once we have the data, we can now perform fine-tuning using a foundation model.
# First, we declare classifier_model by calling TextClassifier with a backbone
# assigned to prajjwal1/bert-tiny (which is a much smaller BERT-like pretrained
# model located in the Hugging Face model repository: https://huggingface.co/prajjwal1/bert-tiny).
# This means our model will be based on the bert-tiny model.
# """
def train_epoch(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch_idx, batch in enumerate(tqdm(dataloader, desc="Training")):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    return total_loss / len(dataloader)

# --- 4. Evaluation Function ---
def evaluate_epoch(model, dataloader, device):
    model.eval()
    total_loss = 0
    all_predictions = []
    all_true_labels = []

    with torch.no_grad():
        for batch_idx, batch in enumerate(tqdm(dataloader, desc="Evaluating")):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1).cpu().numpy()
            all_predictions.extend(predictions)
            all_true_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_true_labels, all_predictions)
    # Get precision, recall, f1-score for positive class (label 1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_true_labels, all_predictions, average='binary', pos_label=1
    )

    return avg_loss, accuracy, precision, recall, f1

In [12]:
# --- 1. & 2. Data Download and Preparation ---
data_url = "https://pl-flash-data.s3.amazonaws.com/imdb.zip"
data_path = "./data/"
download_and_extract_zip(data_url, data_path)

Downloading data from https://pl-flash-data.s3.amazonaws.com/imdb.zip...


  0%|          | 0.00/15.9M [00:00<?, ?iB/s]

Extracting ./data/downloaded_data.zip to ./data/...
Download and extraction complete.


In [13]:
train_df = pd.read_csv(os.path.join(data_path, "imdb/train.csv"))
val_df = pd.read_csv(os.path.join(data_path, "imdb/valid.csv"))
test_df = pd.read_csv(os.path.join(data_path, "imdb/test.csv"))

# Determine number of classes from the sentiment column
num_classes = train_df['sentiment'].nunique()
print(f"Number of classes: {num_classes}")

# Initialize Tokenizer
model_name = "prajjwal1/bert-tiny" # As used in Flash example
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_token_length = 128 # A common max length for BERT-tiny, adjust as needed

# Create custom datasets
train_dataset = IMDBDataset(train_df, tokenizer, max_token_length)
val_dataset = IMDBDataset(val_df, tokenizer, max_token_length)
test_dataset = IMDBDataset(test_df, tokenizer, max_token_length)

# Create DataLoaders
batch_size = 128
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

Number of classes: 2


In [14]:
# --- 3. Model Definition and Setup ---
# Use AutoModelForSequenceClassification for classification tasks with Hugging Face models
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)

# Determine device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

# Optimizer and Learning Rate
optimizer = AdamW(model.parameters(), lr=2e-5) # Common learning rate for fine-tuning BERT

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda


In [15]:
EXPERIMENT_NAME = "mlflow_dl_model_chapter_03"
mlflow.set_tracking_uri('http://localhost:80')
mlflow.set_experiment(EXPERIMENT_NAME)
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
print("experiment_id:", experiment.experiment_id)

experiment_id: 1


In [None]:
# --- 4. Training Loop ---

num_epochs = 10 # As specified in PyTorch Lightning example


with mlflow.start_run(experiment_id=experiment.experiment_id, run_name="chapter03") as dl_model_tracking_run:
    params = {
        "model_name": model_name,
        "num_epochs": num_epochs,
        "num_classes": num_classes,
        "train_size": len(train_dataset),
        "val_size": len(val_dataset),
        "metric_function" : "accuracy",
        "device": str(device),
        "num_gpus": torch.cuda.device_count(),
        "batch_size": batch_size,
        "learning_rate": 2e-5,
        "max_token_length": max_token_length
    }
    mlflow.log_params(params)
    print(f"Run ID: {run.info.run_id}")

    with open("model_summary.txt", "w") as f:
        f.write(str(summary(model)))
    mlflow.log_artifact("model_summary.txt")


    print("\nStarting training...")
    for epoch in range(num_epochs):
        print(f"---⌛ Epoch {epoch+1}/{num_epochs} ---")
        train_loss = train_epoch(model, train_dataloader, optimizer, device)
        val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate_epoch(model, val_dataloader, device)

        print(f"Train Loss: {train_loss:.4f}")
        print(f"Validation Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}, "
                f"Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1: {val_f1:.4f}")

    # Save the trained model to MLflow.
    mlflow.pytorch.log_model(model, artifact_path="model")

    # --- 5. Testing ---
    # Once the fine-tuning step is completed, we will test the accuracy
    # of the model by running trainer.test():
    print("\nStarting testing...")
    test_loss, test_accuracy, test_precision, test_recall, test_f1 = evaluate_epoch(model, test_dataloader, device)
    print(f"Test Loss: {test_loss:.4f}, Accuracy: {test_accuracy:.4f}, "
            f"Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}")

    # --- 6. Prediction Example ---
    print("\nExample Prediction on new data:")
    model.eval()
    new_reviews = [
        "This movie was absolutely fantastic! I loved every moment of it.",
        "Utterly disappointing. A complete waste of time.",
        "It was okay, nothing special.",
        "The acting was superb, but the plot was a bit weak."
    ]

    for review in new_reviews:
        encoding = tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=max_token_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        input_ids = encoding['input_ids'].to(device)
        attention_mask = encoding['attention_mask'].to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predicted_label_id = torch.argmax(logits, dim=-1).item()
            predicted_sentiment = "positive" if predicted_label_id == 1 else "negative"
            print(f"Review: '{review}' -> Predicted Sentiment: {predicted_sentiment}")



Run ID: fd21aa738a3c4d108e54d6a68d5dc356

Starting training...
---⌛ Epoch 1/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.6655
Validation Loss: 0.6159, Accuracy: 0.6880, Precision: 0.6947, Recall: 0.6471, F1: 0.6701
---⌛ Epoch 2/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.5721
Validation Loss: 0.5051, Accuracy: 0.7592, Precision: 0.7645, Recall: 0.7345, F1: 0.7492
---⌛ Epoch 3/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.4823
Validation Loss: 0.4627, Accuracy: 0.7788, Precision: 0.7599, Recall: 0.8015, F1: 0.7801
---⌛ Epoch 4/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.4370
Validation Loss: 0.4524, Accuracy: 0.7860, Precision: 0.7480, Recall: 0.8489, F1: 0.7953
---⌛ Epoch 5/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.4090
Validation Loss: 0.4365, Accuracy: 0.7956, Precision: 0.7662, Recall: 0.8382, F1: 0.8006
---⌛ Epoch 6/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.3872
Validation Loss: 0.4274, Accuracy: 0.8036, Precision: 0.7911, Recall: 0.8137, F1: 0.8023
---⌛ Epoch 7/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.3684
Validation Loss: 0.4246, Accuracy: 0.8044, Precision: 0.7724, Recall: 0.8513, F1: 0.8099
---⌛ Epoch 8/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.3572
Validation Loss: 0.4160, Accuracy: 0.8072, Precision: 0.7845, Recall: 0.8358, F1: 0.8093
---⌛ Epoch 9/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.3362
Validation Loss: 0.4163, Accuracy: 0.8124, Precision: 0.7862, Recall: 0.8472, F1: 0.8156
---⌛ Epoch 10/10 ---


Training:   0%|          | 0/176 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Train Loss: 0.3198
Validation Loss: 0.4432, Accuracy: 0.8084, Precision: 0.7592, Recall: 0.8913, F1: 0.8200





Starting testing...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Test Loss: 0.4330, Accuracy: 0.8160, Precision: 0.7820, Recall: 0.8811, F1: 0.8286

Example Prediction on new data:
Review: 'This movie was absolutely fantastic! I loved every moment of it.' -> Predicted Sentiment: positive
Review: 'Utterly disappointing. A complete waste of time.' -> Predicted Sentiment: negative
Review: 'It was okay, nothing special.' -> Predicted Sentiment: negative
Review: 'The acting was superb, but the plot was a bit weak.' -> Predicted Sentiment: negative
🏃 View run chapter03 at: http://localhost:80/#/experiments/1/runs/fd21aa738a3c4d108e54d6a68d5dc356
🧪 View experiment at: http://localhost:80/#/experiments/1


In [17]:
run_id = dl_model_tracking_run.info.run_id
print("run_id: {}; lifecycle_stage: {}".format(run_id,
    mlflow.get_run(run_id).info.lifecycle_stage))

run_id: fd21aa738a3c4d108e54d6a68d5dc356; lifecycle_stage: active


In [20]:
# use the run_id to construct a logged_model URI. An example is shown here:
# logged_model = 'runs:/37a3fe9b6faf41d89001eca13ad6ca47/model'
import mlflow.pytorch


logged_model = f'runs:/{run_id}/model'

# Load model as a pytorch model, not as the pyfunc model
# model = mlflow.pytorch.load_model(logged_model)
model = mlflow.pytorch.load_model(logged_model)

new_reviews = [
    "This movie was absolutely fantastic! I loved every moment of it.",
    "Utterly disappointing. A complete waste of time.",
    "It was okay, nothing special.",
    "The acting was superb, but the plot was a bit weak."
]

for review in new_reviews:
    encoding = tokenizer.encode_plus(
        review,
        add_special_tokens=True,
        max_length=max_token_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_label_id = torch.argmax(logits, dim=-1).item()
        predicted_sentiment = "positive" if predicted_label_id == 1 else "negative"
        print(f"Review: '{review}' -> Predicted Sentiment: {predicted_sentiment}")

Downloading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Review: 'This movie was absolutely fantastic! I loved every moment of it.' -> Predicted Sentiment: positive
Review: 'Utterly disappointing. A complete waste of time.' -> Predicted Sentiment: negative
Review: 'It was okay, nothing special.' -> Predicted Sentiment: negative
Review: 'The acting was superb, but the plot was a bit weak.' -> Predicted Sentiment: negative


### ***MLFLOW.PYTORCH.LOAD_MODEL VERSUS MLFLOW.PYFUNC.LOAD_MODEL***

By default, and in the MLflow experiment tracking page's artifact section, if you have a logged model, MLflow will recommend using `mlflow.pyfunc.load_model` to load back a logged model for prediction. However, this only works for inputs such as a pandas DataFrame, NumPy array, or tensor; this does not work for an NLP text input. Since auto-logging for PyTorch lightning uses mlflow.`pytorch.log_model` to save the model, the correct way to load a logged model back is to use `mlflow.pytorch.load_model`, as we have shown here. This is because MLflow's default design is to use `mlflow.pyfunc.load_model` with standardization and a known limitation that can only accept input formats in terms of numbers. For text and image data, it requires a tokenization step as a preprocessing step. However, since the PyTorch model we saved here already performs tokenization as part of the serialized model, we can use the native `mlflow.pytorch.load_model` to directly load the model that accepts text as inputs.

With that, we have successfully logged the model and loaded the model back to make a prediction. If we think this model is performing well enough, then we can register it.

In [21]:
# register the model
model_registry_version = mlflow.register_model(logged_model, 'nlp_dl_model')
print(f'Model Name: {model_registry_version.name}')
print(f'Model Version: {model_registry_version.version}')

Successfully registered model 'nlp_dl_model'.
2025/05/25 17:26:27 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: nlp_dl_model, version 1


Model Name: nlp_dl_model
Model Version: 1


Created version '1' of model 'nlp_dl_model'.


![alt text](<MLflow tracking server web UI showing the newly registered model.png>)

**By default, a newly registered model's stage is `None`**, as shown in the preceding screenshot.

By **having a model registered with a version number and stage label**, we have **laid the foundation for deployment to staging (also known as pre-production) and then production**. We will discuss how to perform model deployment based on registered models later.

At this point, we have solved the two issues we raised at the beginning of this section regarding the limitations of auto-logging:

* **How to load a logged DL PyTorch model** using the `mlflow.pytorch.load_model` API instead of the `mlflow.pyfunc.load_model` API
* **How to register a logged DL PyTorch model** using the `mlflow.register_model` API

### ***CHOICES OF MLFLOW DL MODEL LOGGING APIS***

For DL models, the **auto-logging for PyTorch only works for PyTorch lightning frameworks**. There are **other DL frameworks,** such as **TensorFlow, Keras, fastai, and MXNet**, that are also **supported by the corresponding MLflow auto-logging APIs**.

For **other PyTorch frameworks such as Hugging Face, we can use MLflow's `mlflow.pyfunc.log_model` to log the model**, especially when we need to have multi-step DL model pipelines. We will implement such custom MLflow model flavors later in this book. If you don't want to use auto-logging for PyTorch, then you can directly use `mlflow.pytorch.log_model`. PyTorch's auto-logging uses `mlflow.pytorch.log_model` inside its implementation (see the official MLflow open source implementation [here](https://github.com/mlflow/mlflow/blob/290bf3d54d1e5ce61944455cb302a5d6390107f0/mlflow/pytorch/_pytorch_autolog.py#L314)

If we don't want to use auto-logging, then we can use MLflow's model logging API directly. This also gives us an alternative way to simultaneously register the model in one call. You can use the following line of code to both log and register the trained model:
```py
mlflow.pytorch.log_model(pytorch_model=trainer.model, artifact_path='dl_model', registered_model_name='nlp_dl_model')
```
Note that this line of code does not log any parameters or metrics of the model.

With that, **we have not only logged many experiments and models in the tracking server for offline experimentation but also registered performant models for production deployment in the future with version control and provenance tracking.** We can now answer some of the provenance questions that we posted at the beginning of this chapter:


The **why** and **where** provenance questions are yet to be fully answered but will be done so later in this book. This is because the **why** provenance question for the production model **can only be tracked and logged when the model is ready for deployment**, where **we need to add comments and reasons to justify the model's deployment**. The **where** provenance question can be answered fully **when we have a multiple-step model pipeline**. However, here, we only have a single-step pipeline, which is the simplest case. **A multi-step pipeline contains explicitly separate modulized code to specify which step performs what functionality so that we can easily change the detailed implementation of any of the steps without changing the flow of the pipeline**. In the next two sections, we will investigate how we can track metrics and the parameters of models without using auto-logging.

## **Tracking model metrics**

The default metric for the text classification model in the PyTorch `lightning-flash` package is **Accuracy**. If we want to change the metric to **F1 score** (a harmonic mean of precision and recall), which is a very common metric for measuring a classifier's performance, then we need to change the configuration of the classifier model before we start the model training process. Let's learn how to make this change and then use MLflow's non-auto-logging API to log the metrics:

1. When defining the classifier variable, instead of using the default metric, we will pass a metric function called `torchmetrics.F1` as a variable, as follows:
```py
classifier_model = TextClassifier(backbone="prajjwal1/bert-tiny", num_classes=datamodule.num_classes, metrics=torchmetrics.F1(datamodule.num_classes))
```
This uses the built-in metrics function of `torchmetrics`, the `F1` module, along with the number of classes in the data we need to classify as a parameter. This makes sure that the model is trained and tested using this new metric. You will see an output similar to the following:

```py
{'test_cross_entropy': 0.785443127155304, 'test_f1': 0.5343999862670898}
```

This shows that the model training and testing were using the F1 score as the metric, not the default accuracy metric. For more information on how you can use torchmetrics for customized metrics, please consult [its documentation site](https://torchmetrics.readthedocs.io/en/latest/).

2. Now, if **we want to log all the metrics to the MLflow tracking server, including the training, validation, and testing metrics**, we need to get all the current metrics by calling the trainer's callback function, as follows:
```py
cur_metrics = trainer.callback_metrics
```

Then, we need to cast all the metric values to float to make sure that they are compatible with the MLflow log_metrics API:
```python
metrics = dict(map(lambda x: (x[0], float(x[1])), cur_metrics.items()))
```

3. Now, we can call MLflow's `log_metrics` to log all the metrics in the tracking server:
```py    
mlflow.log_metrics(metrics)
```

You will see the following metrics after using the F1 score as the classifier's metric, which will be logged in MLflow's tracking server:
```py
{
   'train_f1': 0.5838666558265686,
   'train_f1_step': 0.75,
   'train_cross_entropy': 0.7465656399726868,
   'train_cross_entropy_step': 0.30964696407318115,
   'val_f1': 0.5203999876976013,
   'val_cross_entropy': 0.8168156743049622,
   'train_f1_epoch': 0.5838666558265686,
   'train_cross_entropy_epoch': 0.7465656399726868,
   'test_f1': 0.5343999862670898,
   'test_cross_entropy': 0.785443127155304
}
```

Using MLflow's `log_metrics` API **gives us more control with additional lines of code**, but **if we are satisfied with its auto-logging capabilities, then the only thing we need to change is what metric we want to use for the model training and testing processes**. In this case, we only need to define a new metric to use when declaring a new DL model (that is, use the F1 score instead of the default accuracy metric).

4. If you want to **track multiple model metrics simultaneously, such as the F1 score, accuracy, precision, and recall, then the only thing you need to do is define a Python list of metrics you want to compute and track**, as follows:

```python
list_of_metrics = [
   torchmetrics.Accuracy(),
   torchmetrics.F1(num_classes=datamodule.num_classes),
   torchmetrics.Precision(num_classes=datamodule.num_classes),
   torchmetrics.Recall(num_classes=datamodule.num_classes)
]
```
Then, in the model initialization statement, instead of passing a single metric to the metrics parameter, you can just pass the list_of_metrics Python list that we just defined, above the metrics parameter, as follows:
```py
classifier_model = TextClassifier(backbone="prajjwal1/bert-tiny", num_classes=datamodule.num_classes, metrics=list_of_metrics)
```
No more changes need to be made to the rest of the code. So, in the `dl_model-non-auto-tracking.ipynb notebook` (https://github.com/PacktPublishing/Practical-Deep-Learning-at-Scale-with-MLFlow/blob/main/chapter03/dl_model-non-auto-tracking.ipynb), you will notice that the preceding line is commented out by default. However, you can uncomment it and then comment out the previous one:
```py
classifier_model = TextClassifier(backbone="prajjwal1/bert-tiny", num_classes=datamodule.num_classes, metrics=torchmetrics.F1Score(datamodule.num_classes))
```

Then, when you run the rest of the notebook, you will get the model testing reports, along with the following metrics, in the notebook's output:
```python
{
   'test_accuracy': 0.6424000263214111,
   'test_cross_entropy': 0.6315688490867615,
   'test_f1': 0.6424000263214111,
   'test_precision': 0.6424000263214111,
   'test_recall': 0.6424000263214111
}
```
**You may notice that the numbers for accuracy, F1, precision, and recall are the same. This is because, by default, torchmetrics uses a micro-average method, which computes a single scalar average score for all the classes by counting total true positives, false negatives, and false positives**. Scikit-learn has an average option called binary that outputs only the score for the positive label when it is a [binary classification model](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#).

However, `torchmetrics` does not support a binary average method for a binary classification model. The only alternative is to use a none method, which computes the metric for each class and returns the metric for each class, even for a binary classification model. So, this does not produce a single scalar number. However, you can always call `scikit-learn`'s metrics API to compute an F1-score or other metrics based on the binary average method by passing two lists of values. Here, we can use `y_true` and `y_predict`, where `y_true` is the list of ground truth label values and y_predict is the list of model predicted label values. This can be a good exercise for you to try out as this is a common practice for all ML models, not special treatment for a DL model.

## **Tracking model parameters**

As we have already seen, **there are lots of benefits of using auto-logging in MLflow**, but **if we want to track additional model parameters, we can either use MLflow to log additional parameters on top of what auto-logging records, or directly use MLflow to log all the parameters we want without using auto-logging at all**.

Let's walk through a notebook without using MLflow auto-logging. If we want to have full control of what parameters will be logged by MLflow, we can use two APIs: `mlflow.log_param` and `mlflow.log_params`. The **first one logs a single pair of key-value parameters**, while **the second logs an entire dictionary of key-value parameters**. So, what kind of parameters might we be interested in tracking? The following answers this:

* **Model hyperparameters**: Hyperparameters are defined before the learning process begins, which means they control how the learning process learns. These parameters can be turned and can directly affect how well a model trains. In a DL model, the list of hyperparameters includes the backbone language model, learning rate, loss function, the optimizer to be used, and many more. MLflow's auto-logging does not automatically log all the hyperparameters, so this is an opportunity for us to directly use MLflow's log_params API to record them in the experiment.
* **Model parameters**: These parameters are learned during the model training process. For a DL model, these usually refer to the neural network weights that are learned during training. We don't need to log these weight parameters individually since they are already in the logged DL model.

Let's log these hyperparameters using MLflow's log_params API, as follows:
```py
 params = {
        "model_name": model_name,
        "num_epochs": num_epochs,
        "num_classes": num_classes,
        "train_size": len(train_dataset),
        "val_size": len(val_dataset),
        "metric_function" : "accuracy",
        "device": str(device),
        "num_gpus": torch.cuda.device_count(),
        "batch_size": batch_size,
        "learning_rate": 2e-5,
        "max_token_length": max_token_length
    }
```
Note that here, we log the maximal number of epochs, the trainer's first optimizer's name, the optimizer's default parameters, and the overall classifier's hyperparameters (`classifier_model.hparams`). The one-line piece of code `mlflow.log_params(params)` logs all the key-value parameters in the params dictionary to the MLflow tracking server. If you see the following hyperparameters in the MLflow tracking server, then it means it works!

In [22]:
mlflow.log_params(params)