<div style="text-align: center; padding: 30px; border: 3px solid #f39c12; border-radius: 15px; background-color: #f4f6f7; font-family: 'Arial', sans-serif; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1);">
  
  <h1 style="color: #007acc; font-size: 36px; font-weight: bold; margin-bottom: 15px; text-transform: uppercase;">PyTorch Lightning and MLFlow Tutorial</h1>

  <h3 style="color: #333; font-size: 22px; font-weight: normal; margin-bottom: 20px;">Learning the Basics of PyTorch Lightning and MLFlow by Training a ResNet50 Model on CIFAR-10</h3>

  <div style="font-size: 18px; color: #333; margin-bottom: 10px;">
    <strong>Author:</strong> Seyed Abolfazl Mortazavi
  </div>
  <div style="font-size: 18px; color: #333; margin-bottom: 20px;">
    <strong>Date:</strong> December 2024
  </div>

  <div style="text-align: center;">
    <a href="https://github.com/SAMortazavi" style="font-size: 18px; color: #007acc; text-decoration: none; font-weight: bold; border: 2px solid #007acc; border-radius: 8px; padding: 10px 20px; transition: 0.3s; display: inline-block;">Visit GitHub</a>
  </div>
</div>


# PyTorch Lightning CIFAR-10 Classification Tutorial

This notebook demonstrates how to use **PyTorch Lightning** to train a deep learning model on the **CIFAR-10** dataset. The model employs a pre-trained **ResNet-50** architecture as a backbone for feature extraction, with an additional fully connected (FC) layer for classification. The steps are as follows:

1. **Data Preprocessing**: The dataset is transformed with random cropping, horizontal flipping, and normalization. The CIFAR-10 images are resized to 32x32 pixels, augmented, and normalized to ensure efficient training.

2. **Data Loading**: The CIFAR-10 dataset is split into training, validation, and test sets. Data loaders are used to batch and shuffle the data for training and evaluation.

3. **Model Architecture**: The `MyModel` class is built using **PyTorch Lightning** and **ResNet-50** as the backbone for feature extraction. The final classification layer is added to match the number of CIFAR-10 classes (10 classes). The model is designed to train on a GPU, with automatic gradient accumulation for memory efficiency.

4. **Training**: The model is trained for up to 25 epochs with the **Adam** optimizer and **learning rate scheduler** (ReduceLROnPlateau). The **cross-entropy loss** is used for multi-class classification, and **accuracy** is tracked during training, validation, and testing.

5. **Callbacks**: The notebook includes the use of **ModelCheckpoint** to save the best-performing models based on validation loss, and **EarlyStopping** to halt training if validation performance doesn't improve after several epochs.

6. **Logging**: While training progress was originally planned to be logged using **TensorBoard**, it has been replaced with **MLFlow** for experiment tracking and model logging.

**Note:** In this tutorial, reaching high accuracy is not the primary goal. The main objective is to demonstrate how to use **PyTorch Lightning** and **MLFlow** effectively for model training and evaluation. Additionally, this code is written in **Kaggle**, and due to that environment, the **pyngrok** library is used to establish the necessary tunnels for external connections.


In [3]:
!pip install lightning --quiet
!pip install mlflow --quiet
!pip install pyngrok --quiet

# **Importing necessary Libraries**

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import lightning as L
from torch.utils.data import DataLoader,random_split
from torchvision import datasets, transforms
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers.tensorboard import TensorBoardLogger
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from torchmetrics import Accuracy
import torchvision.models as models
from lightning.pytorch import Callback
import mlflow.pytorch
from pytorch_lightning.loggers import MLFlowLogger

# **Defining Transforms for dataset**

In [4]:
transform_train = transforms.Compose(
   [
       transforms.RandomCrop(32, padding=4),
       transforms.RandomHorizontalFlip(),
       transforms.ToTensor(),
       transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
   ],
)
transform_test = transforms.Compose(
   [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# **Loading the Dataset**

In [5]:
train_data=datasets.CIFAR10(root='data',train=True,download=True,transform=transform_train)
test_data=datasets.CIFAR10(root='data',train=False,download=True,transform=transform_test)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:02<00:00, 59116629.34it/s]


Extracting data/cifar-10-python.tar.gz to data
Files already downloaded and verified


In [6]:
train_size=int(0.8*len(train_data))
val_size=len(train_data)-train_size
train_data,val_data=random_split(train_data,[train_size,val_size])
train_loader=DataLoader(train_data,batch_size=64,shuffle=True,num_workers=3)
val_loader=DataLoader(val_data,batch_size=64,shuffle=False,num_workers=3)
test_loader=DataLoader(test_data,batch_size=64,shuffle=False,num_workers=3)

# **Creating the Model**

In [None]:
class MyModel(L.LightningModule):
    # in the following fucntions I use the mlflow.log_metrics to save the model metrics
  def __init__(self):
  # Here I use the resnet50 as a backbone model
    super(MyModel,self).__init__()
    self.save_hyperparameters()
    self.backbone = models.resnet50(weights="DEFAULT")
    num_filters = self.backbone.fc.in_features
    # I remove the final FC layer of the resnet50 model because it is not match
    # with the CIFAR10 num of classes
    layers = list(self.backbone.children())[:-1]
    self.feature_extractor = nn.Sequential(*layers)
    self.feature_extractor.eval()
    # Adding new fc layer for classification
    num_target_classes = 10
    self.classifier = nn.Linear(num_filters, num_target_classes)
    # Accuracy
    self.acc=Accuracy(task='multiclass',num_classes=10)
    #Forward Pass
  def forward(self, x):
    with torch.no_grad():
        representations = self.feature_extractor(x).flatten(1)
    x = self.classifier(representations)
    return x
  #train,val and test steps
  def training_step(self, batch,batch_idx) :
    x,y=batch
    y_pred=self(x)
    trainingloss=F.cross_entropy(y_pred,y)
    self.log('train loss',trainingloss,prog_bar=True)
    return trainingloss
  def validation_step(self, batch, batch_idx):
    x, y = batch
    y_pred = self(x)
    val_loss = F.cross_entropy(y_pred, y)
    self.log('val_loss', val_loss, prog_bar=True) # consider that If the validation loss is not named `val_loss`, MLflow may not work properly.
    val_acc = self.acc(y_pred, y)
    self.log('validation acc', val_acc, prog_bar=True)
  def test_step(self, batch,batch_idx):
    x,y=batch
    y_pred=self(x)
    loss=F.cross_entropy(y_pred,y)
    self.log('test loss',loss,prog_bar=True)
    test_acc=self.acc(y_pred,y)
    self.log('Test Acc',test_acc,prog_bar=True)
    # Optimizer
  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.1, patience=5
    )
    return {
        "optimizer": optimizer,
        "lr_scheduler": {
            "scheduler": scheduler,
            "monitor": "val_loss",
        },
    }


# **Checkpoint and early stopping**

In [8]:
callback_checkpoint=ModelCheckpoint(
    dirpath='/kaggle/working/checkpoints',
    monitor='val_loss',
    every_n_epochs= 5,
    save_top_k=-1,
    filename="cifar10-{epoch:02d}-{val_loss :.2f}-{validation acc:.2f}",
    mode='min'
)
logger = TensorBoardLogger(save_dir="lightning_logs")
early_stopping = EarlyStopping(monitor="val_loss", patience=5, mode="min", verbose=False)

# **MLFlow lgger**

In [9]:
mlflow_logger = MLFlowLogger(
    experiment_name="CIFAR10 with Resnet50 final Run", #this is the name of the expriment
    log_model=True
)

# **Create and use Model**

In [10]:
# in this cell I creat the model and use the mlflow to save the logs
model=MyModel()
trainer=L.Trainer(accelerator='gpu',devices=-1,callbacks=callback_checkpoint,
        max_epochs=25,accumulate_grad_batches=5,logger=mlflow_logger)
mlflow.pytorch.autolog()
trainer.fit(model,train_loader,val_loader)
mlflow.pytorch.log_model(
    pytorch_model=model,
    artifact_path="models",
    registered_model_name="Resnet 50",
)

Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 211MB/s] 
2024/12/08 13:27:16 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3caacd4cd354425fb553f1e42c7b9fc1', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pytorch workflow
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name              | Type               | Params | Mode 
-----------------------------------------------------------------
0 | backbone          | ResNet             | 25.6 M | train
1 | feature_extractor | Sequential         | 23.5 M | eval 
2 | classifier        | Linear             | 20.5 K | train
3 | acc               | MulticlassAccuracy | 0      | train
-----------------------------------------------------------------
25.6 M    Trainable params
0         Non-trainable params
25.6 M  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Successfully registered model 'Resnet 50'.
Created version '1' of model 'Resnet 50'.


<mlflow.models.model.ModelInfo at 0x7e6c548282e0>

In [11]:
trainer.test(model,test_loader)

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

[{'test loss': 0.9138151407241821, 'Test Acc': 0.6945000290870667}]

In [13]:
from pyngrok import ngrok
ngrok.kill()
NGROK_Token="Your Own Token"
ngrok.set_auth_token(NGROK_Token)
ngrok_tunnel=ngrok.connect(addr='5000',proto='http',bind_tls=True)
print("MLflow tracking UI:",ngrok_tunnel.public_url)

MLflow tracking UI: https://7aaa-34-23-177-60.ngrok-free.app                                        


In [None]:
!mlflow ui

  pid, fd = os.forkpty()


[2024-12-08 13:34:17 +0000] [1274] [INFO] Starting gunicorn 23.0.0
[2024-12-08 13:34:17 +0000] [1274] [INFO] Listening at: http://127.0.0.1:5000 (1274)
[2024-12-08 13:34:17 +0000] [1274] [INFO] Using worker: sync
[2024-12-08 13:34:17 +0000] [1275] [INFO] Booting worker with pid: 1275
[2024-12-08 13:34:17 +0000] [1276] [INFO] Booting worker with pid: 1276
[2024-12-08 13:34:17 +0000] [1277] [INFO] Booting worker with pid: 1277
[2024-12-08 13:34:18 +0000] [1278] [INFO] Booting worker with pid: 1278
