<a href="https://colab.research.google.com/github/wandb/edu/blob/main/lightning/cnn/debug_hyperparameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

# Debugging Hyperparameters for a Convolutional Neural Network

## Installing and Importing Libraries

In [None]:
%%capture
!pip install pytorch-lightning torchviz wandb

repo_url = "https://raw.githubusercontent.com/wandb/edu/main/"
utils_path = "lightning/utils.py"
# Download a util file of helper methods for this notebook
!curl {repo_url + utils_path} --output utils.py

import math

import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import transforms
import torchvision.datasets
import wandb

import utils

In [None]:
!wandb login

## Defining the `Model`

In [None]:
class FullyConnected(pl.LightningModule):

  def __init__(self, config, batchnorm=False, dropout=0, activation=None):
    super().__init__()
    out_features = config["out_features"]
    self.linear = torch.nn.Linear(**config)
    if activation is None:
      activation = torch.nn.Identity  # defaults to passing inputs unchanged
    self.activation = activation()

    # add batchnorm and dropout
    post_act_layers = []
    if batchnorm:
      post_act_layers.append(torch.nn.BatchNorm1d(out_features))
    if dropout:
      post_act_layers.append(torch.nn.Dropout(dropout))
    self.post_act = torch.nn.Sequential(*post_act_layers)

  def forward(self, x):
    return self.post_act(self.activation(self.linear(x)))


class Convolution(pl.LightningModule):

  def __init__(self, config, batchnorm=False, dropout=0, activation=None):
    super().__init__()
    out_channels = config["out_channels"]
    self.conv2d = torch.nn.Conv2d(**config)
    if activation is None:
      activation = torch.nn.Identity  # defaults to passing inputs unchanged
    self.activation = activation()

    # add batchnorm and dropout
    post_act_layers = []
    if batchnorm:
      post_act_layers.append(torch.nn.BatchNorm2d(out_channels))
    if dropout:
      post_act_layers.append(torch.nn.Dropout2d(dropout))
    self.post_act = torch.nn.Sequential(*post_act_layers)

  def forward(self, x):
    return self.post_act(self.activation(self.conv2d(x)))

class LitCNN(utils.LoggedImageClassifierModule):
  """A simple CNN Model, with under-the-hood wandb
  and pytorch-lightning channels (logging, metrics, etc.).
  """

  def __init__(self, config, max_images_to_display=32):  # make the model
    super().__init__(max_images_to_display=max_images_to_display)

    # first, convolutional component
    self.conv_layers = torch.nn.Sequential(
      Convolution(config["conv"][0],
                  activation=config["conv"]["activation"],
                  batchnorm=config["conv"]["batchnorm"],
                  dropout=config["conv"]["dropout"]),
      Convolution(config["conv"][1],
                  activation=config["conv"]["activation"],
                  batchnorm=config["conv"]["batchnorm"],
                  dropout=config["conv"]["dropout"]),
      torch.nn.MaxPool2d(**config["pool"]),
    )

    # need a fixed-size input for fully-connected component,
    #  so apply a "re-sizing" layer, to size set in config
    self.resize_layer = torch.nn.AdaptiveAvgPool2d(
      (config["final_height"], config["final_width"]))

    final_size = (config["final_height"] * config["final_width"]
                  * config["conv"][1]["out_channels"])
    config["fc"][0]["in_features"] = final_size

    # now, we can apply our fully-connected component
    self.fc_layers = torch.nn.Sequential(
      FullyConnected(config["fc"][0],
                     activation=config["fc"]["activation"],
                     batchnorm=config["fc"]["batchnorm"],
                     dropout=config["fc"]["dropout"]),
      FullyConnected(config["fc"][1],
                     activation=config["fc"]["activation"],
                     batchnorm=config["fc"]["batchnorm"],
                     dropout=config["fc"]["dropout"]),
      # "read-out" layer produces class predictions
      FullyConnected({"in_features": config["fc"][1]["out_features"],
                      "out_features": 10}),
    )

    self.loss = config["loss"]
    self.optimizer = config["optimizer"]
    self.optimizer_params = config["optimizer.params"]

  def forward(self, x):  # produce outputs
    # first apply convolutional layers
    for layer in self.conv_layers: 
      x = layer(x)

    # then convert to a fixed-size vector
    x = self.resize_layer(x)
    x = torch.flatten(x, start_dim=1)

    # then apply the fully-connected layers
    for layer in self.fc_layers: # snap together the LEGOs
      x = layer(x)

    return F.log_softmax(x, dim=1)  # compute log of softmax, for numerical reasons

  def configure_optimizers(self):  # ⚡: setup for .fit
    return self.optimizer(self.parameters(), **self.optimizer_params)

## Building and Training the `Model`

In [None]:
config = {
  "batch_size": 1024,
  "max_epochs": 5,
  "conv": {  # configuration for the convolutional layers
      "activation": torch.nn.Tanh,  # which activation function in the conv layers?
      "batchnorm": True,  # should we use batchnorm in the conv layers?
      "dropout": 0.9,  # how much dropout should we apply? set to 0 to deactivate
      0: {  # these are passed as kwargs to the first torch.nn.Conv2d
          "in_channels": 1,  # must match number of channels in data
          "out_channels": 3,  # must match conv[1]["out_channels"]
          "kernel_size": [7, 3],
          "padding": [5, 0],
          "stride": [1, 2], 
          "dilation": [2, 1],
      },
      1: {  # these are passed as kwargs to the second torch.nn.Conv2d
          "in_channels": 3,  # must match conv[0]["out_channels"]
          "out_channels": 128,
          "kernel_size": [2, 5],
          "padding": [1, 3],
          "stride": [1, 4], 
          "dilation": [6, 1],
      },
  },
  "pool": {  # these are passed as kwargs to torch.nn.MaxPool2d
      "kernel_size": 3,
      "stride": 1,
  },
  "final_height": 8,  # how large should we resize conv outputs to?
  "final_width": 8,   #  this hyperparameter can stay fixed
  "fc": {  # configuration for the fully-connected/torch.nn.Linear layers
        "activation": torch.nn.Identity,  # which activation function in the linear layers?
        "batchnorm": False,  # should we use batchnorm in the linear layers?
        "dropout": 0.,  # how much dropout should we apply? set to 0 to deactivate
        0 : {  # these are passed as kwargs to the first torch.nn.Linear
            "in_features": None,  # calculated from other values
            "out_features": 10,  # must match fc[1]["in_features"]
        },
        1 : {  # these are passed as kwargs to the second torch.nn.Linear
            "in_features": 10,  # must match fc[0]["out_features"]
            "out_features": 16,
        },
  },
  "loss": torch.nn.NLLLoss(),  # cross-entropy loss
  "optimizer": torch.optim.Adam,
  "optimizer.params": {"lr": 0.1},
}

dmodule = utils.MNISTDataModule(batch_size=config["batch_size"])
lcnn = LitCNN(config, max_images_to_display=32)
dmodule.prepare_data()
dmodule.setup()

### Debugging Code

In [None]:
# for debugging purposes (checking shapes, etc.), make these available
dloader = dmodule.train_dataloader()  # set up the Loader

example_batch = next(iter(dloader))  # grab a batch from the Loader
example_x, example_y = example_batch[0].to("cuda"), example_batch[1].to("cuda")

print(f"Input Shape: {example_x.shape}")

lcnn.to("cuda")
conv_outs = lcnn.conv_layers(example_x)
print(f"Conv Output Shape: {conv_outs.shape}")
fc_inputs = torch.flatten(lcnn.resize_layer(conv_outs), start_dim=1)
print(f"FC Input Shape: {fc_inputs.shape}")
outputs = F.log_softmax(lcnn.fc_layers(fc_inputs), dim=1)
print(f"Output Shape: {outputs.shape}")
print(f"Target Shape: {example_y.shape}")
print(f"Loss : {lcnn.loss(outputs, example_y)}")

### Running `.fit`

In [None]:
with wandb.init(config=config, project="debug-cnn", entity="wandb"):
  lcnn = LitCNN(config, max_images_to_display=32)
  dmodule = utils.MNISTDataModule(batch_size=config["batch_size"])
  # 👟 configure Trainer 
  trainer = pl.Trainer(gpus=1,  # use the GPU for .forward
                      logger=pl.loggers.WandbLogger(
                        save_code=True),  # log to Weights & Biases
                      callbacks=[utils.FilterLogCallback((), log_input=True)],
                      max_epochs=config["max_epochs"], log_every_n_steps=1,
                      progress_bar_refresh_rate=50)
  
  # 🏃‍♀️ run the Trainer on the model
  trainer.fit(lcnn, dmodule)
  

## Exercises


#### 1. Back to the Defaults



With the original, intentionally bad values in the config above,
the accuracy on the validation set after 5 epochs should typically be
around 50%.

This performance is far better than chance for this dataset (10% accuracy),
but very far away from the best possible performance with
the right hyperparameters
(near 100% accuracy).
It's important to reflect on what this means for debugging neural network
hyperparameter choices --
unless you know, because of your or others' past work,
what the ceiling for performance on your metric for your model is,
you never know whether you need to keep tweaking the hyperparameters
or go back to the drawing board.

We can start by returning some of the hyperparameter values
to their defaults.

Walk through the `config` and find hyperparameters that have default values
in PyTorch:
`padding`, `stride`, and `dilation` for the `conv`
and `pool` layers, to start.

Look up the documentation for [`Conv2d`](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
and [`MaxPool2d`](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html),
find the default values for these hyperparameters,
and set the values in the config to those default values,
then run training again.

Does the validation accuracy metric improve?


#### 2. Standing on the Shoulders of Giants

Setting as many values to their defaults as possible is a good start,
but there are other hyperparameters,
like kernel size,
that don't have default values.
How do we set those?

There are many tutorials online about how to train
a convolutional neural network
to solve an image classification problem.
They will typically get 90% validation accuracy
or more on this problem.
These tutorials will generally include their hyperparameter choices,
either explicitly, inline (ideally with an explanation!),
or implicitly, in their code.

Review some of the tutorials and examples below to find
good values for the hyperparameters --
skip over the text and find the code,
so that you can focus on their hyperparameter choices.
All of them will choose kernel sizes,
channel/feature counts in convolutional/linear layers,
and activation functions.
Compare these against each other and the hyperparameters above
to find reasonable values.

- ["Convolutional Neural Networks in PyTorch", from Adventures in ML](https://adventuresinmachinelearning.com/convolutional-neural-networks-tutorial-in-pytorch/).
Look for the "Creating the Model" section.
Compare the channel counts and linear layer sizes
from this example with the values above.
- ["Dropout in PyTorch", by Ayush Thakur](https://wandb.ai/authors/ayusht/reports/Dropout-in-PyTorch-An-Example--VmlldzoxNTgwOTE).
Note where the Dropout is applied and what its value is
(`_forward_features` here applies the conv and pooling layers).
Compare those to the hyperparameters above.
- [Official PyTorch MNIST example code](https://github.com/pytorch/examples/blob/cbb760d5e50a03df667cdc32a61f75ac28e11cbf/mnist/main.py#L11).
Which activation function is used? What values are used for Dropout?
- ["MNIST Handwritten Digit Recognition in PyTorch", by Gregor Koehler](https://nextjournal.com/gkoehler/pytorch-mnist).
Note the choice of pooling size and stride here.

There are some shared themes you might notice:
- Which activation function is most commonly used?
- Are dropout values typically above or below `0.5`?
- None of these examples use batch normalization.
What happens when you deactivate batch normalization?
- Are kernels/padding/dilation/stride typically symmetric
(height = width) or asymmetric? Do any of the examples deviate
from the default values for the parameters that have defaults?

You can also look further afield:
- Search the web for "deep learning pytorch mnist",
possibly including search terms like "dropout", "cnn", etc.,
to find even more examples and tutorials.
You might also find helpful nuggets
if you look for examples in Keras/TensorFlow!
- Research papers often contain useful insight,
even though they're generally harder to read
than blog posts.
The architecture above is inspired by the
[VGGNet paper](https://arxiv.org/pdf/1409.1556.pdf)
from ICLR 2015, by Simonyan et al.
See Section 2, "ConvNet Configurations",
for most of the hyperparameter choices.
Note that their classifier has 1000 classes,
so you'll want to scale the channel/feature
dimensions down accordingly.

> _Note_: some examples use a slightly different API
for the max-pooling/dropout layers
(e.g. [this line](https://github.com/pytorch/examples/blob/cbb760d5e50a03df667cdc32a61f75ac28e11cbf/mnist/main.py#L26)
in the official PyTorch example),
but that API has the same arguments
(details [here](https://stackoverflow.com/questions/58514197/difference-between-nn-maxpool2d-vs-nn-functional-max-pool2d)).