# Regularizations with CNNs

## Lab 3 Regularization by Initalization, Batch Norm & Dropout

Author: M. Rußwurm, 2024, based on notebooks from D.Tuia (2020)

In this lab, we get the more complex CNN model (from Lab 2) to work by 
1. changing the weight initialization
2. adding batch normalization
3. adding dropout

### Setup

Let's get the required python packages

**d2l** Package:
The "d2l" (short for "dive into deep learning") package is a Python library designed to accompany the book "Dive into Deep Learning"

**Pytoch**:
Pytorch is an open-source machine learning library and scientific computing framework, primarily used for deep learning applications. 

**sklearn.metrics**:
The "sklearn.metrics" module is part of the scikit-learn library, a popular machine learning library in Python. The metrics module specifically focuses on providing tools for evaluating the performance of machine learning models.

In [52]:
!pip install -q d2l

import torch
from torch import nn
from d2l import torch as d2l
from sklearn.metrics import classification_report

## Data - FashionMNIST

Let's start by loading FashionMNIST data

Fashion MNIST is a dataset used in machine learning and computer vision, serving as a benchmark for image classification tasks. It consists of 70,000 grayscale images of clothing items, categorized into 10 classes such as t-shirts, dresses, and sneakers. Fashion MNIST is a popular alternative to the traditional handwritten digit MNIST dataset, providing a more complex challenge for developing and testing image recognition algorithms.

In [53]:
fashionMNIST = d2l.FashionMNIST(batch_size=128)

train_dataloader = fashionMNIST.get_dataloader(train=True)
val_dataloader = fashionMNIST.get_dataloader(train=False)

text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']

for batch in train_dataloader:
    X,y = batch
    fashionMNIST.visualize(batch)
    break

Later, we would like to validate the model. 
This given function 
1. iterates through all data in a (validaiton) dataloader
2. stores the ground truth (y_true) and predictions (y_pred)
3. prints a classification report

In [54]:
@torch.no_grad()
def validate(model, dataloader):
    y_pred = []
    y_true = []
    for X,y in dataloader:
        y_true.append(y)
        y_pred.append(model(X).argmax(1))
        
    y_true = torch.hstack(y_true)
    y_pred = torch.hstack(y_pred)
    
    print(classification_report(y_pred=y_pred.numpy(), y_true=y_true.numpy(), labels=torch.arange(10).numpy(), target_names=text_labels))
    

## Run 1 - simple CNN (LeNet) default initialization

Let's create an instance of the LeNet model from Lab 2 again.

In [61]:
class LeNetCNNModel(d2l.Classifier):
    def __init__(self, lr=0.1, weight_decay=1e-4, momentum=0.4):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(10))
    def training_step(self, batch):
        Y_hat = self(*batch[:-1])
        loss = self.loss(Y_hat, batch[-1])
        self.plot('loss', loss, train=True)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=True)
        return loss # the package takes care of the

    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
        
    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.lr, 
                               weight_decay=self.weight_decay)
        return optimizer

Let's train the model again.

**Task** 
* Train the model confirm that it does does not train in its current form.

In [62]:
#TODO: Train the model with weight_decay=1e-4, momentum=0.5
# model = ...
# trainer = ...
# trainer.fit(...)



model.layer_summary(X_shape=X.shape)
validate(model, val_dataloader)

## Run 2: Kaiming or Xavier Uniform Initialization of Weights

Let's now get the model to train by using a better weight initalization. Check the [Torch documentation](https://pytorch.org/docs/stable/nn.init.html) for details.
We give you the code to change the initialization, please modify initializations.

**Task**: train the model with different initializations:
* replace `kaiming_uniform_` with `xavier_uniform_`
* test `kaiming_normal_` with `xavier_normal_`
* initialize bias terms with 0 or not, 
* change the non-linearity function from `sigmoid` to `relu` 

For more information, check the [Torch documentation](https://pytorch.org/docs/stable/nn.init.html)

In [63]:
# TODO try different initializations and observe the training (next cells)
def init_cnn(module):
    
    # Linear Layers
    if type(module) == nn.Linear:
        nn.init.kaiming_uniform_(module.weight, nonlinearity="sigmoid")
        
    # CNN Layers
    if type(module) == nn.Conv2d:
        nn.init.kaiming_uniform_(module.weight, nonlinearity="sigmoid")
        

In [64]:
model = LeNetCNNModel(lr=1, weight_decay=1e-4, momentum=0.5)
model.layer_summary(X_shape=X.shape)

model.apply_init([X], init_cnn)

trainer = d2l.Trainer(max_epochs=5, num_gpus=1)
trainer.fit(model, fashionMNIST)

validate(model, val_dataloader)

## Run 3 Model - Batch Normalization

Let's improve the model further by adding batch normalization layers between the convolutions and linear transformations

**Task**
* add `nn.LazyBatchNorm2d()` and `nn.LazyBatchNorm1d()` to the model from above.

The layer summary should look like this:
```
Conv2d output shape:	  torch.Size([128, 6, 28, 28])
Sigmoid output shape:	  torch.Size([128, 6, 28, 28])
BatchNorm2d output shape: torch.Size([128, 6, 28, 28])
AvgPool2d output shape:	  torch.Size([128, 6, 14, 14])
Conv2d output shape:	  torch.Size([128, 16, 10, 10])
Sigmoid output shape:	  torch.Size([128, 16, 10, 10])
BatchNorm2d output shape: torch.Size([128, 16, 10, 10])
AvgPool2d output shape:	  torch.Size([128, 16, 5, 5])
Flatten output shape:	  torch.Size([128, 400])
Linear output shape:	  torch.Size([128, 120])
Sigmoid output shape:	  torch.Size([128, 120])
BatchNorm1d output shape: torch.Size([128, 120])
Linear output shape:	  torch.Size([128, 84])
Sigmoid output shape:	  torch.Size([128, 84])
BatchNorm1d output shape: torch.Size([128, 84])
Linear output shape:	  torch.Size([128, 10])
```

In [66]:
class LeNetCNNModelBatchNorm(d2l.Classifier):
    def __init__(self, lr=0.1, weight_decay=1e-4, momentum=0.4):
        super().__init__()
        self.save_hyperparameters()
        
        # TODO add BatchNorm Layers
        # self.net = nn.Sequential(...)
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(), nn.LazyBatchNorm2d(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(), nn.LazyBatchNorm2d(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(), nn.LazyBatchNorm1d(),
            nn.LazyLinear(84), nn.Sigmoid(), nn.LazyBatchNorm1d(),
            nn.LazyLinear(10))
    def training_step(self, batch):
        Y_hat = self(*batch[:-1])
        loss = self.loss(Y_hat, batch[-1])
        self.plot('loss', loss, train=True)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=True)
        return loss # the package takes care of the

    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
        
    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.lr, 
                               weight_decay=self.weight_decay)
        return optimizer
    
model = LeNetCNNModelBatchNorm(lr=1)
model.layer_summary(X_shape=X.shape)

let's train the model and observe the training and performance

In [67]:
model = LeNetCNNModelBatchNorm(lr=1)
model.apply_init([X], init_cnn)

trainer = d2l.Trainer(max_epochs=5, num_gpus=1)
trainer.fit(model, fashionMNIST)

validate(model, val_dataloader)

## Run 4 Model - Dropout

Let's now also add dropout:

**Task**
* add `nn.Dropout2d()` and `nn.Dropout1d()` layers in the model

The layer summary should look like this:
```
Conv2d output shape:	   torch.Size([128, 6, 28, 28])
Dropout2d output shape:	   torch.Size([128, 6, 28, 28])
Sigmoid output shape:	   torch.Size([128, 6, 28, 28])
BatchNorm2d output shape:  torch.Size([128, 6, 28, 28])
AvgPool2d output shape:	   torch.Size([128, 6, 14, 14])
Conv2d output shape:	   torch.Size([128, 16, 10, 10])
Dropout2d output shape:	   torch.Size([128, 16, 10, 10])
Sigmoid output shape:	   torch.Size([128, 16, 10, 10])
BatchNorm2d output shape:  torch.Size([128, 16, 10, 10])
AvgPool2d output shape:	   torch.Size([128, 16, 5, 5])
Flatten output shape:	   torch.Size([128, 400])
Linear output shape:	   torch.Size([128, 120])
Dropout1d output shape:	   torch.Size([128, 120])
Sigmoid output shape:	   torch.Size([128, 120])
BatchNorm1d output shape:  torch.Size([128, 120])
Linear output shape:	   torch.Size([128, 84])
Dropout1d output shape:	   torch.Size([128, 84])
Sigmoid output shape:	   torch.Size([128, 84])
BatchNorm1d output shape:  torch.Size([128, 84])
Linear output shape:	   torch.Size([128, 10])
```

In [70]:
class LeNetCNNModelBatchNormDropout(d2l.Classifier):
    def __init__(self, lr=0.1, weight_decay=1e-4, momentum=0.4):
        super().__init__()
        self.save_hyperparameters()
        # TODO add nn.Dropout2d() and nn.Dropout1d() to the model from above
        # self.net = nn.Sequential(...)
        
        
    def training_step(self, batch):
        Y_hat = self(*batch[:-1])
        loss = self.loss(Y_hat, batch[-1])
        self.plot('loss', loss, train=True)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=True)
        return loss # the package takes care of the

    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
        
    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.lr, 
                               weight_decay=self.weight_decay)
        return optimizer
    
model = LeNetCNNModelBatchNormDropout(lr=1, weight_decay=1e-4, momentum=0.5)
model.layer_summary(X_shape=X.shape)

In [71]:
model = LeNetCNNModelBatchNormDropout(lr=1, weight_decay=1e-4, momentum=0.5)
model.apply_init([X], init_cnn)
trainer = d2l.Trainer(max_epochs=5, num_gpus=1)
trainer.fit(model, fashionMNIST)

validate(model, val_dataloader)

# Questions

1. Can you explain why the initialization of neural network (Run 1 vs Run 2) matters with loss surfaces?



2. Why does Dropout (Run 4) and BatchNorm (Run 3) increase the training loss compared to the validation loss? What is different in BatchNorm and Dropout during training compared to validation/testing?

