## Profile PyTorch Code
How to use `Tensorboard` or `PyTorch Kineto` plugin for profiling PyTorch code.

<img src="https://i.imgur.com/fwSc5Z9.png"/>

The work done by `processes`, `threads` and `streams` on the CPU and GPU is displayed along with precise timing information in an interactive viewer.

In [1]:
import glob

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms

from torch.profiler import tensorboard_trace_handler
import wandb

KeyboardInterrupt: 

In [None]:
torchvision.datasets.MNIST.mirrors = [mirror for mirror in torchvision.datasets.MNIST.mirrors
                                        if not mirror.startswith('http://yann.lecun.com')]
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: raghvender. Use `wandb login --relogin` to force relogin


True

### Setup Profiling Training

### Model

In [None]:
OPTIMIZERS = {
    "Adadelta": optim.Adadelta,
    "Adagrad" : optim.Adagrad,
    "SGD": optim.SGD,
}

class Net(pl.LightningModule):
    """Very simple LeNet-style DNN, plus DropOut."""
    def __init__(self, optimizer="Adadelta"):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

        self.optimizer = self.set_optimizer(optimizer)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

    def set_optimizer(self, optimizer):
        return OPTIMIZERS[optimizer]

In [None]:
def training_step(self, batch, idx):
  inputs, labels = batch
  outputs = self(inputs)
  loss =  F.nll_loss(outputs, labels)

  return {"loss": loss}
      
def configure_optimizers(self):
  return self.optimizer(self.parameters(), lr=0.1)

Net.training_step = training_step
Net.configure_optimizers = configure_optimizers

### Profiler Callback
The profiler operates like a PyTorch optimizer. It has `.step` method that we need to call to demarcate the code for profiling.

A single training step (`forward` and `backward prop`) is both the typical target of performance optimizations and already rich enough to more than fill out a profiling `trace`, so we want to call `.step` on each step.

In [None]:
class TorchTensorboardProfilerCallback(pl.Callback):
    """Quick-and-dirty Callback for invoking TensorboardProfiler during training.
    
    For greater robustness, extend the pl.profiler.profilers.BaseProfiler. See
    https://pytorch-lightning.readthedocs.io/en/stable/advanced/profiler.html"""

    def __init__(self, profiler):
        super().__init__()
        self.profiler = profiler 

    def on_train_batch_end(self, trainer, pl_module, outputs, *args, **kwargs):
        self.profiler.step()
        pl_module.log_dict(outputs)  # also logging the loss, while we're here

### Run Profiled Training

In [None]:
# initial values are defaults, for all except batch_size, which has no default
config = {"batch_size": 32,  # try log-spaced values from 1 to 50,000
          "num_workers": 0,  # try 0, 1, and 2
          "pin_memory": False,  # try False and True
          "precision": 32,  # try 16 and 32
          "optimizer": "Adadelta",  # try optim.Adadelta and optim.SGD
          }

with wandb.init(project='trace', config=config) as run:
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST("../data", train=True, download=True,
                            transform=transform)
    ## Using a raw DataLoader, rather than LightningDataModule, for greater transparency
    trainloader = torch.utils.data.DataLoader(
      dataset,
      # Key performance-relevant configuration parameters:
      ## batch_size: how many datapoints are passed through the network at once?
      batch_size=wandb.config.batch_size,
      # larger batch sizes are more compute efficient, up to memory constraints

      ##  num_workers: how many side processes to launch for dataloading (should be >0)
      num_workers=wandb.config.num_workers,
      # needs to be tuned given model/batch size/compute

      ## pin_memory: should a fixed "pinned" memory block be allocated on the CPU?
      pin_memory=wandb.config.pin_memory,
      # should nearly always be True for GPU models, see https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
    )

    # Set up model
    model = Net(optimizer=wandb.config["optimizer"])

    # Set up profiler
    wait, warmup, active, repeat = 1, 1, 2, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(
      wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(
      schedule=schedule, on_trace_ready=tensorboard_trace_handler("wandb/latest-run/tbprofile"), with_stack=True)

    with profiler:
        profiler_callback = TorchTensorboardProfilerCallback(profiler)

        trainer = pl.Trainer(gpus=1, max_epochs=1, max_steps=total_steps,
                            logger=pl.loggers.WandbLogger(log_model=True, save_code=True),
                            callbacks=[profiler_callback], precision=wandb.config.precision)

        trainer.fit(model, trainloader)

    profile_art = wandb.Artifact(f"trace-{wandb.run.id}", type="profile")
    profile_art.add_file(glob.glob("wandb/latest-run/tbprofile/*.pt.trace.json")[0], "trace.pt.trace.json")
    run.log_artifact(profile_art)

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../data\MNIST\raw\train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data\MNIST\raw\train-images-idx3-ubyte.gz to ../data\MNIST\raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../data\MNIST\raw\train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data\MNIST\raw\train-labels-idx1-ubyte.gz to ../data\MNIST\raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../data\MNIST\raw\t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data\MNIST\raw\t10k-images-idx3-ubyte.gz to ../data\MNIST\raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../data\MNIST\raw\t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data\MNIST\raw\t10k-labels-idx1-ubyte.gz to ../data\MNIST\raw



  rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type    | Params
-------------------------------------
0 | conv1    | Conv2d  | 320   
1 | conv2    | Conv2d  | 18.5 K
2 | dropout1 | Dropout | 0     
3 | dropout2 | Dropout | 0     
4 | fc1      | Linear  | 1.2 M 
5 | fc2      | Linear  | 1.3 K 
-------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.800     Total estimated model params size (MB)
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

: 

: 