# Chapter 1 : Deep learning Life Cycle and MLOps Challenges

While research in areas of deep learning has made giant leaps, bringing these DL models from offline experimentation to production and continuously improving the models to deliver sustainable values is still a challenge. For example, a recent article by VentureBeat (https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/) found that $87\%$ of data science projects never make it to production. While there might be business reasons for such a low production rate, a major contributing factor is the difficulty caused by the lack of experiment management and a mature model production and feedback platform.

This chapter will help us to understand the challenges and bridge these gaps by learning the concepts, steps, and components that are commonly used in the full life cycle of DL model development. Additionally, we will learn about the challenges of an emerging field known as Machine Learning Operations (MLOps), which aims to standardize and automate ML life cycle development, deployment, and operation. Having a solid understanding of these challenges will motivate us to learn the skills presented in the rest of this book using MLflow, an open source, ML full life cycle platform. The business values of adopting MLOps' best practices are numerous; they include faster time-to-market of model-derived product features, lower operating costs, agile A/B testing, and strategic decision making to ultimately improve customer experience

## **Understand the DL file cycle and MLOps challenges**

Understanding the DL life cycle and MLOps challenges
Nowadays, the most successful DL models that are deployed in production primarily observe the following two steps:

1. **Self-supervised learning**: This refers to the pretraining of a model in a data-rich domain that does not require labeled data. This step produces a pretrained model, which is also called a foundation model, for example, BERT, GPT-3 for NLP, and VGG-NETS for computer vision.
2. **Transfer learning**: This refers to the fine-tuning of the pretrained model in a specific prediction task such as text sentiment classification, which requires labeled training data.

## **DL over classical ML**

Unlike classical ML model development, where, usually, a feature engineering step is required to extract and transform raw data into features to train an ML model such as decision tree or logistic regression, DL can learn the features automatically, which is especially attractive for modeling unstructured data such as texts, images, videos, audio, and speeches. DL is also called representational learning due to this characteristic. In addition to this, DL is usually data- and compute-intensive, requiring Graphics Process Units (GPUs), Tensor Process Units (TPU), or other types of computing hardware accelerators for at-scale training and inference. Explainability for DL models is also harder to implement, compared with traditional ML models, although recent progress has now made that possible.

## **Implementing a basic DL sentiment classifier**

In [4]:
!uv add torchmetrics torchinfo
!uv add torch torchvision torchaudio

[2mResolved [1m80 packages[0m [2min 4ms[0m[0m
[2mAudited [1m74 packages[0m [2min 0.08ms[0m[0m
[2K[2mResolved [1m83 packages[0m [2min 774ms[0m[0m                                        [0m
[2K[37m⠹[0m [2mPreparing packages...[0m (0/1)                                                   [37m⠋[0m [2mPreparing packages...[0m (0/0)                                                   
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m     0 B/3.29 MiB            [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m 16.00 KiB/3.29 MiB          [1A
[2K[1A[37m⠸[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m 32.00 KiB/3.29 MiB          [1A
[2K[1A[37m⠸[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m 32.00 KiB/3.29 MiB          [1A
[2K[1A[37m⠸[0m [2mPreparing packages...[0m (0/1)--------------[0m[0m 40.32 KiB/3.29 MiB          [1A
[2K[1A[37m⠸[0m [2mPreparing packages..

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import AdamW
from torch.nn.functional import cross_entropy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd
import os
import zipfile
import requests
from tqdm.auto import tqdm

# --- 1. Data Download Utility (similar to Flash's download_data) ---
def download_and_extract_zip(url, path):
    if not os.path.exists(path):
        os.makedirs(path)
    zip_file_name = os.path.join(path, "downloaded_data.zip")

    print(f"Downloading data from {url}...")
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get('content-length', 0))
    block_size = 1024 # 1 Kibibyte
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
    with open(zip_file_name, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()

    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print("ERROR, something went wrong during download")
        return

    print(f"Extracting {zip_file_name} to {path}...")
    with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
        zip_ref.extractall(path)
    os.remove(zip_file_name) # Clean up the zip file
    print("Download and extraction complete.")

# --- 2. Custom Dataset for IMDB (replaces TextClassificationData) ---
class IMDBDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

        # Ensure 'sentiment' is mapped to numerical labels
        # 0 for negative, 1 for positive
        self.label_map = {'negative': 0, 'positive': 1}
        self.dataframe['sentiment_label'] = self.dataframe['sentiment'].map(self.label_map)


    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        review = str(self.dataframe.loc[idx, 'review'])
        label = self.dataframe.loc[idx, 'sentiment_label']

        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt', # Return PyTorch tensors
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [9]:
# --- 3. Training Function ---
# """
# Once we have the data, we can now perform fine-tuning using a foundation model.
# First, we declare classifier_model by calling TextClassifier with a backbone
# assigned to prajjwal1/bert-tiny (which is a much smaller BERT-like pretrained
# model located in the Hugging Face model repository: https://huggingface.co/prajjwal1/bert-tiny).
# This means our model will be based on the bert-tiny model.
# """
def train_epoch(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch_idx, batch in enumerate(tqdm(dataloader, desc="Training")):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    return total_loss / len(dataloader)

# --- 4. Evaluation Function ---
def evaluate_epoch(model, dataloader, device):
    model.eval()
    total_loss = 0
    all_predictions = []
    all_true_labels = []

    with torch.no_grad():
        for batch_idx, batch in enumerate(tqdm(dataloader, desc="Evaluating")):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1).cpu().numpy()
            all_predictions.extend(predictions)
            all_true_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(all_true_labels, all_predictions)
    # Get precision, recall, f1-score for positive class (label 1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_true_labels, all_predictions, average='binary', pos_label=1
    )

    return avg_loss, accuracy, precision, recall, f1


In [10]:

# --- Main Script ---
# if __name__ == "__main__":
# --- 1. & 2. Data Download and Preparation ---
data_url = "https://pl-flash-data.s3.amazonaws.com/imdb.zip"
data_path = "./data/"
download_and_extract_zip(data_url, data_path)

train_df = pd.read_csv(os.path.join(data_path, "imdb/train.csv"))
val_df = pd.read_csv(os.path.join(data_path, "imdb/valid.csv"))
test_df = pd.read_csv(os.path.join(data_path, "imdb/test.csv"))

# Determine number of classes from the sentiment column
num_classes = train_df['sentiment'].nunique()
print(f"Number of classes: {num_classes}")

# Initialize Tokenizer
model_name = "prajjwal1/bert-tiny" # As used in Flash example
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_token_length = 128 # A common max length for BERT-tiny, adjust as needed

# Create custom datasets
train_dataset = IMDBDataset(train_df, tokenizer, max_token_length)
val_dataset = IMDBDataset(val_df, tokenizer, max_token_length)
test_dataset = IMDBDataset(test_df, tokenizer, max_token_length)

# Create DataLoaders
batch_size = 64
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# --- 3. Model Definition and Setup ---
# Use AutoModelForSequenceClassification for classification tasks with Hugging Face models
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)

# Determine device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

# Optimizer and Learning Rate
optimizer = AdamW(model.parameters(), lr=2e-5) # Common learning rate for fine-tuning BERT

Downloading data from https://pl-flash-data.s3.amazonaws.com/imdb.zip...


  0%|          | 0.00/15.9M [00:00<?, ?iB/s]

Extracting ./data/downloaded_data.zip to ./data/...
Download and extraction complete.
Number of classes: 2


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: cuda


model.safetensors:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

In [11]:
# --- 4. Training Loop ---
num_epochs = 10 # As specified in PyTorch Lightning example

print("\nStarting training...")
for epoch in range(num_epochs):
    print(f"--- Epoch {epoch+1}/{num_epochs} ---")
    train_loss = train_epoch(model, train_dataloader, optimizer, device)
    val_loss, val_accuracy, val_precision, val_recall, val_f1 = evaluate_epoch(model, val_dataloader, device)

    print(f"Train Loss: {train_loss:.4f}")
    print(f"Validation Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}, "
            f"Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1: {val_f1:.4f}")



Starting training...
--- Epoch 1/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.6459
Validation Loss: 0.5599, Accuracy: 0.7236, Precision: 0.7689, Recall: 0.6225, F1: 0.6880
--- Epoch 2/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.5087
Validation Loss: 0.4703, Accuracy: 0.7748, Precision: 0.7738, Recall: 0.7631, F1: 0.7684
--- Epoch 3/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.4406
Validation Loss: 0.4451, Accuracy: 0.7872, Precision: 0.7932, Recall: 0.7647, F1: 0.7787
--- Epoch 4/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.4091
Validation Loss: 0.4278, Accuracy: 0.7980, Precision: 0.8201, Recall: 0.7525, F1: 0.7848
--- Epoch 5/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.3804
Validation Loss: 0.4183, Accuracy: 0.8048, Precision: 0.8002, Recall: 0.8015, F1: 0.8008
--- Epoch 6/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.3591
Validation Loss: 0.4171, Accuracy: 0.8056, Precision: 0.7838, Recall: 0.8325, F1: 0.8074
--- Epoch 7/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.3371
Validation Loss: 0.4123, Accuracy: 0.8196, Precision: 0.8229, Recall: 0.8047, F1: 0.8137
--- Epoch 8/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.3143
Validation Loss: 0.4205, Accuracy: 0.8100, Precision: 0.8484, Recall: 0.7451, F1: 0.7934
--- Epoch 9/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.2931
Validation Loss: 0.4285, Accuracy: 0.8144, Precision: 0.8578, Recall: 0.7443, F1: 0.7970
--- Epoch 10/10 ---


Training:   0%|          | 0/352 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Train Loss: 0.2736
Validation Loss: 0.4441, Accuracy: 0.8132, Precision: 0.8636, Recall: 0.7345, F1: 0.7938


### **DL FINE-TUNING TIME**

Depending on your running environment, the fine-tuning step might take a couple of minutes on a GPU or around 10 minutes (if you're only using a CPU). You can reduce max_epochs=1 if you simply want to get a basic version of the sentiment classifier quickly.

In [None]:
# --- 5. Testing ---
# Once the fine-tuning step is completed, we will test the accuracy
# of the model by running trainer.test():
print("\nStarting testing...")
test_loss, test_accuracy, test_precision, test_recall, test_f1 = evaluate_epoch(model, test_dataloader, device)
print(f"Test Loss: {test_loss:.4f}, Accuracy: {test_accuracy:.4f}, "
        f"Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}")

# --- 6. Prediction Example ---
print("\nExample Prediction on new data:")
model.eval()
new_reviews = [
    "This movie was absolutely fantastic! I loved every moment of it.",
    "Utterly disappointing. A complete waste of time.",
    "It was okay, nothing special.",
    "The acting was superb, but the plot was a bit weak."
]

for review in new_reviews:
    encoding = tokenizer.encode_plus(
        review,
        add_special_tokens=True,
        max_length=max_token_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_label_id = torch.argmax(logits, dim=-1).item()
        predicted_sentiment = "positive" if predicted_label_id == 1 else "negative"
        print(f"Review: '{review}' -> Predicted Sentiment: {predicted_sentiment}")


Starting testing...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Test Loss: 0.4693, Accuracy: 0.8036, Precision: 0.8732, Recall: 0.7147, F1: 0.7861

Example Prediction on new data:
Review: 'This movie was absolutely fantastic! I loved every moment of it.' -> Predicted Sentiment: positive
Review: 'Utterly disappointing. A complete waste of time.' -> Predicted Sentiment: negative
Review: 'It was okay, nothing special.' -> Predicted Sentiment: negative
Review: 'The acting was superb, but the plot was a bit weak.' -> Predicted Sentiment: positive


## **Understanding DL's full life cycle development**

You might have gathered that the core DL development paradigm revolves around three key artifacts: Data, Model, and Code. In addition to this, Explainability is another major artifact that is required in many mission-critical application scenarios such as medical diagnoses, the financial industry, and decision making for criminal justice.

Once we have a model that's good enough for the use cases and customer scenarios, the complexity increases as we need a way to continuously deploy and update the model in production, monitor the model and data drift, and then retrain the model when necessary. This complexity further increases when at-scale training, deployment, monitoring, and explainability are needed.

Let's examine what a DL life cycle looks like (see Figure 1.3). There are five stages:

1. Data collection, cleaning, and annotation/labeling.
2. Model development (which is also known as offline experimentation). The core DL development paradigm in Figure 1.1 is considered part of the model development stage, which itself can be an iterative process.
3. Model deployment and serving in production.
4. Model validation and A/B testing (which is also known as online experimentation; this is usually in a production environment).
5. Monitoring and feedback data collection during production.

## **Understanding MLOps challenges**

MLOps has some connections to DevOps, where a set of technology stacks and standard operational procedures are used for software development and deployment combined with IT operations. Unlike traditional software development, ML and especially DL represent a new era of software development paradigms called Software 2.0.

* Here are the three foundation layers:
  * Infrastructure management and automation
  * Application life cycle management and Continuous Integration and Continuous Deployment (CI/CD)
  * Service system observability
* Here are the four pillars:
  * Data observability and management
  * Model observability and life cycle management
  * Explainability and Artificial Intelligence (AI) observability
  * Code reproducibility and observability

Additionally, we will explain MLflow's roles in these MLOps layers and pillars so that we have a clear picture regarding what MLflow can do to build up the MLOps layers in their entirety:

* **Infrastructure management and automation**: This includes, but is not limited to, Kubernetes (also known as k8s) for automated container orchestration and Terraform (commonly used for managing hundreds of cloud services and access control). These tools are adapted to manage ML and DL applications that have deployed models as service endpoints. These infrastructure layers are not the focus of this book; instead, we will focus on how to deploy a trained DL model using MLflow's provided capabilities.

* **Application life cycle management and CI/CD**: This includes, but is not limited to, Docker containers for virtualization, container life cycle management tools such as Kubernetes, and CircleCI or Concourse for CI and CD. Usually, CI means that whenever there are code or model changes in a GitHub repository, a series of automatic tests will be triggered to make sure no breaking changes are introduced. Once these tests have been passed, new changes will be automatically released as part of a new package. This will then trigger a new deployment process (CD) to deploy the new package to the production environment (often, this will include human approval as a safety gate). Note that these tools are not unique to ML applications but have been adapted to ML and DL applications, especially when we require GPU and distributed clusters for the training and testing of DL models. In this book, we will not focus on these tools but will mention the integration points or examples when needed.

* **Service system observability**: This is mostly for monitoring the hardware/clusters/CPU/memory/storage, operating system, service availability, latency, and throughput. This includes tools such as Grafana, Datadog, and more. Again, these are not unique to ML and DL applications and are not the focus of this book.

* **Data observability and management**: This is traditionally under-represented in the DevOps world but becomes very important in MLOps as data is critical within the full life cycle of ML/DL models. This includes data quality monitoring, outlier detection, data drift and concept drift detection, bias detection, secured and compliant data sharing, data provenance tracking and versioning, and more. The tool stacks in this area that are suitable for ML and DL applications are still emerging. A few examples include DataFold (https://www.datafold.com/) and Databand (https://databand.ai/open-source/). A recent development in data management is a unified lakehouse architecture and implementation called Delta Lake (http://delta.io) that can be used for ML data management. MLflow has native integration points with Delta Lake, and we will cover that integration in this book.

* **Model observability and life cycle management**: This is unique to ML/DL models, and it only became widely available recently due to the rise of MLflow. This includes tools for model training, testing, versioning, registration, deployment, serialization, model drift monitoring, and more. We will learn about the exciting capabilities that MLflow provides in this area. Note that once we combine CI/CD tools with MLflow training/monitoring, user feedback loops, and human annotations, we can achieve Continuous Training, Continuous Testing, and Continuous Labeling. MLflow provides the foundational capabilities so that further automation in MLOps becomes possible, although such complete automation will not be the focus of this book. Interested readers can find relevant references at the end of this chapter to explore this area further.

* **Explainability and AI observability**: This is unique to ML/DL models and is especially important for DL models, as traditionally, DL models are treated as black boxes. Understanding why the model provides certain predictions is critical for societally important applications. For example, in medical, financial, juridical, and many human-in-the-loop decision support applications, such as civilian and military emergency response, the demand for explainability is increasingly higher. MLflow provides native integration with a popular explainability framework called SHAP, which we will cover in this book.

* **Code reproducibility and observability**: This is not entirely unique to ML/DL applications. However, DL models face some special challenges as the number of DL code frameworks are diverse and the need to reproduce a model is not entirely up to the code alone (we also need data and execution environments such as GPU clusters). In addition to this, notebooks are commonly used in model development and production. How to manage the notebooks along with the model run is important. Usually, GitHub is used to manage the code repository; however, we need to structure the ML project code in a way that's reproducible either locally (such as on a local laptop) or remotely (for example, in a Databricks' GPU cluster). MLflow provides this capability to allow DL projects that have been written once to run anywhere, whether this is in an offline experimentation environment or an online production environment. We will cover MLflow's MLproject capability in this book.