<div style="text-align: center; padding: 30px; border: 3px solid #f39c12; border-radius: 15px; background-color: #f4f6f7; font-family: 'Arial', sans-serif; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1);">
  
  <h1 style="color: #007acc; font-size: 36px; font-weight: bold; margin-bottom: 15px; text-transform: uppercase;">PyTorch Lightning and MLFlow Tutorial</h1>

  <h3 style="color: #333; font-size: 22px; font-weight: normal; margin-bottom: 20px;">Learning the Basics of PyTorch Lightning and MLFlow by Training a densenet121 Model on CIFAR-100</h3>

  <div style="font-size: 18px; color: #333; margin-bottom: 10px;">
    <strong>Author:</strong> Seyed Abolfazl Mortazavi
  </div>
  <div style="font-size: 18px; color: #333; margin-bottom: 20px;">
    <strong>Date:</strong> December 2024
  </div>

  <div style="text-align: center;">
    <a href="https://github.com/SAMortazavi" style="font-size: 18px; color: #007acc; text-decoration: none; font-weight: bold; border: 2px solid #007acc; border-radius: 8px; padding: 10px 20px; transition: 0.3s; display: inline-block;">Visit GitHub</a>
  </div>
</div>


# PyTorch Lightning CIFAR-100 Classification Tutorial

This notebook demonstrates how to use **PyTorch Lightning** to train a deep learning model on the **CIFAR-100** dataset. The model employs a pre-trained **DenseNet-121** architecture as a backbone for feature extraction, with an additional fully connected (FC) layer for classification. The steps are as follows:

1. **Data Preprocessing**: The dataset is transformed with random cropping, horizontal flipping, and normalization. The CIFAR-100 images are resized to 32x32 pixels, augmented, and normalized to ensure efficient training.

2. **Data Loading**: The CIFAR-100 dataset is split into training, validation, and test sets. Data loaders are used to batch and shuffle the data for training and evaluation.

3. **Model Architecture**: The `MyModel` class is built using **PyTorch Lightning** and **DenseNet-121** as the backbone for feature extraction. The final classification layer is added to match the number of CIFAR-100 classes (100 classes). The model is designed to train on a GPU, with automatic gradient accumulation for memory efficiency.

4. **Training**: The model is trained for up to 25 epochs with the **Adam** optimizer and **learning rate scheduler** (ReduceLROnPlateau). The **cross-entropy loss** is used for multi-class classification, and **accuracy** is tracked during training, validation, and testing.

5. **Callbacks**: The notebook includes the use of **ModelCheckpoint** to save the best-performing models based on validation loss, and **EarlyStopping** to halt training if validation performance doesn't improve after several epochs.

6. **Logging**: While training progress was originally planned to be logged using **TensorBoard**, it has been replaced with **MLFlow** for experiment tracking and model logging.

**Note:** In this tutorial, reaching high accuracy is not the primary goal. The main objective is to demonstrate how to use **PyTorch Lightning** and **MLFlow** effectively for model training and evaluation. Additionally, this code is written in **Kaggle**, and due to that environment, the **pyngrok** library is used to establish the necessary tunnels for external connections.


# **Importing necessary Libraries**

In [77]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import lightning as L
from torch.utils.data import DataLoader,random_split
from torchvision import datasets, transforms
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers.tensorboard import TensorBoardLogger
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from torchmetrics import Accuracy
import torchvision.models as models
from lightning.pytorch import Callback
import mlflow.pytorch
from pytorch_lightning.loggers import MLFlowLogger

# **Defining Transforms for dataset**

In [11]:
stats = ((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
train_transform = transforms.Compose([transforms.RandomCrop(32, padding = 4, padding_mode = 'reflect'),
                         transforms.RandomHorizontalFlip(),
                         transforms.RandomAffine(degrees=(10, 30),
                                         translate=(0.1, 0.3),
                                         scale=(0.7, 1.3),
                                         shear=0.1),
                         transforms.ToTensor(),
                         transforms.Normalize(*stats)])
test_transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(*stats)
])

# **Loading the Dataset**

In [14]:
train_data=datasets.CIFAR100(root='data',train=True,transform=train_transform,download=True)
test_data=datasets.CIFAR100(root='data',train=False,transform=test_transform,download=True)
train_size=int(0.8*len(train_data))
validation_size=len(train_data) - train_size
train_data,val_data=random_split(train_data,[train_size,validation_size])
train_loader=DataLoader(train_data,batch_size=64,shuffle=True,num_workers=3)
val_loader=DataLoader(val_data,batch_size=64,shuffle=False,num_workers=3)
test_loader=DataLoader(test_data,batch_size=64,shuffle=False,num_workers=3)

Files already downloaded and verified
Files already downloaded and verified


# **Creating the Model**

In [47]:
class MyModel(L.LightningModule):
    def __init__(self):
        super(MyModel, self).__init__()
        self.save_hyperparameters()
        
        # Backbone: DenseNet
        backbone = models.densenet121(pretrained=True)
        num_filters = backbone.classifier.in_features
        layers = list(backbone.features.children())
        self.feature_extractor = nn.Sequential(*layers)
        
        # Classifier
        num_target_classes = 100
        self.classifier = nn.Linear(num_filters, num_target_classes)
        
        # Metrics
        self.acc = Accuracy(task="multiclass", num_classes=num_target_classes)
    
    # Forward pass
    def forward(self, x):
        representations = self.feature_extractor(x)
        representations = representations.mean([2, 3])  # Global Average Pooling
        x = self.classifier(representations)
        return x

    # Training step
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        training_loss = F.cross_entropy(y_pred, y)
        self.log("train_loss", training_loss, prog_bar=True)
        return training_loss

    # Validation step
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        val_loss = F.cross_entropy(y_pred, y)
        val_acc = self.acc(y_pred, y)
        self.log("val_loss", val_loss, prog_bar=True)
        self.log("val_acc", val_acc, prog_bar=True)

    # Test step
    def test_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        test_loss = F.cross_entropy(y_pred, y)
        test_acc = self.acc(y_pred, y)
        self.log("test_loss", test_loss, prog_bar=True)
        self.log("test_acc", test_acc, prog_bar=True)

    # Configure optimizers
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode="min", factor=0.1, patience=5
        )
        return {
            "optimizer": optimizer,
            "lr_scheduler": {
                "scheduler": scheduler,
                "monitor": "val_loss",
            },
        }


# **Checkpoint and early stopping**

In [37]:
callback_checkpoint=ModelCheckpoint(
    dirpath='./',
    filename='Checkpoint',
    monitor='val_loss',
    mode='min'
)
early_stopping = EarlyStopping(monitor="val_loss", patience=5, mode="min", verbose=False)

# **MLFlow lgger**

In [41]:
mlflow_logger=MLFlowLogger(
    experiment_name='final Run of the Lightning and MLFlow',
    log_model=True
)

# **Create and use Model**

In [48]:
model=MyModel()
trainer=L.Trainer(
        accelerator='gpu',devices=-1,callbacks=callback_checkpoint,
        max_epochs=25,accumulate_grad_batches=5,logger=mlflow_logger
)
mlflow.pytorch.autolog()
trainer.fit(model,train_loader,val_loader)
mlflow.pytorch.log_model(
    pytorch_model=model,
    artifact_path="models",
    registered_model_name="DenseNet",
)

2024/12/09 06:07:07 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'cc53a0ed21d74a52acb1fd37878e2498', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pytorch workflow
/opt/conda/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:654: Checkpoint directory /kaggle/working exists and is not empty.
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name              | Type               | Params | Mode 
-----------------------------------------------------------------
0 | feature_extractor | Sequential         | 7.0 M  | train
1 | classifier        | Linear             | 102 K  | train
2 | acc               | MulticlassAccuracy | 0      | train
-----------------------------------------------------------------
7.1 M     Trainable params
0         Non-trainable params
7.1 M     Total params
28.225    Total estimated model params size (MB)
433       Modules in train 

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Successfully registered model 'DenseNet'.
Created version '1' of model 'DenseNet'.


<mlflow.models.model.ModelInfo at 0x792b96148eb0>

In [49]:
trainer.test(model,test_loader)

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test_loss': 1.8499935865402222, 'test_acc': 0.5490999817848206}]

# **MLFlow UI using ngrok**

In [51]:
!pip install pyngrok --quiet

  pid, fd = os.forkpty()


In [71]:
from pyngrok import ngrok
ngrok.kill()
NGROK_Token="TourOwnToken"
ngrok.set_auth_token(NGROK_Token)
ngrok_tunnel=ngrok.connect(addr='5000',proto='http',bind_tls=True)
print("MLflow tracking UI:",ngrok_tunnel.public_url)

MLflow tracking UI: https://d0f1-34-83-203-88.ngrok-free.app


In [72]:
!mlflow ui

[2024-12-09 06:43:16 +0000] [1601] [INFO] Starting gunicorn 23.0.0
[2024-12-09 06:43:16 +0000] [1601] [INFO] Listening at: http://127.0.0.1:5000 (1601)
[2024-12-09 06:43:16 +0000] [1601] [INFO] Using worker: sync
[2024-12-09 06:43:16 +0000] [1602] [INFO] Booting worker with pid: 1602
[2024-12-09 06:43:16 +0000] [1603] [INFO] Booting worker with pid: 1603
[2024-12-09 06:43:16 +0000] [1604] [INFO] Booting worker with pid: 1604
[2024-12-09 06:43:16 +0000] [1605] [INFO] Booting worker with pid: 1605
^C
[2024-12-09 06:45:31 +0000] [1601] [INFO] Handling signal: int
[2024-12-09 06:45:31 +0000] [1604] [INFO] Worker exiting (pid: 1604)
[2024-12-09 06:45:31 +0000] [1602] [INFO] Worker exiting (pid: 1602)
[2024-12-09 06:45:31 +0000] [1605] [INFO] Worker exiting (pid: 1605)
[2024-12-09 06:45:31 +0000] [1603] [INFO] Worker exiting (pid: 1603)


# **Checking Saved Model from MLFlow**

In [74]:
mlflow.pytorch.log_model(
    pytorch_model=model,
    artifact_path="models",
    registered_model_name="DenseNet",
)


Registered model 'DenseNet' already exists. Creating a new version of this model...
Created version '3' of model 'DenseNet'.


<mlflow.models.model.ModelInfo at 0x792b9c182ad0>

# **Finding Run ID of the Model**

In [75]:
experiment = mlflow.get_experiment_by_name("final Run of the Lightning and MLFlow")  
runs = mlflow.search_runs(experiment_ids=experiment.experiment_id)
print(repr(runs[['run_id', 'artifact_uri']]))

                             run_id  \
0  d58ce97f670c417d9b3ba7626b2610da   

                                        artifact_uri  
0  file:///kaggle/working/mlruns/5469535754364011...  
