# Optional: Batch Normalization & Dropout 

In this Notebook we will introduce the idea of Batch Normalization and Dropout and how both these methods help in Neural Network training. 

By completing this exercise you will:
1. Know the implementation details of Batch Norm and Dropout
2. Notice the difference in behaviour during train and test time
3. Use Batch Norm and Dropout in a Fully Connected Layer to see how it affects training

Let us start with Batch Normalization:

# 1. Batch Normalization

## 1.1 What is Batch Normalization
One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization, which was proposed by [1].

The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However, even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process, the distribution of features at each layer of the network will shift as the weights of each layer are updated.

The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time, these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

[1] Sergey Ioffe and Christian Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift", ICML 2015.




### Before We Start

It is important that we take a look at the Mathematical formula behind Batch Norm. Please make sure to understand the formula below since this will definitely help you with the implementation later. :)

<!-- In case the image does not show, uncomment the following:
<img name="batchnorm" src="./img/batchnorm.jpg">
 -->
<img name="batchnorm" src="https://i2dl.vc.in.tum.de/static/images/exercise_08/batchnorm.jpg">


### A quick explanation of the formula
It may look a bit confusing at first glance, but that's not true. Let's summarize the mathematics here: In the left column, we are given the input data, which consists (as always) of $N$ samples of dimension $D$. Furthermore, we need the learnable shift and scale parameters which we call $\beta$ and $\gamma$. The intermediates describe the mean $\mu$ and variance $\sigma$ that we need to compute from the input data and then $\hat{x}$ which is the normalized input data. The output is given by $y$ which is the combination of the normalized data with the learnable parameters. 

The right column contains the mathematical formulations and can be summarized as follows:
1. For the given input x, we calculate the mean $\mu$ across all input samples.
2. Based on the mean $\mu$, we compute the variances $\sigma$ of each value in the sample.
3. We then normalize the input data based on the computed mean and variance.
4. Finally, we combine the normalized data with the learnable parameters $\gamma$ and $\beta$.

Please remember that Batch Normalization behaves differently at training and test time. In the figure above, we see the behavior at training time. 




## (Optional) Mount folder in Colab

Uncomment thefollowing cell to mount your gdrive if you are using the notebook in google colab:

In [None]:
# Use the following lines if you want to use Google Colab
# Don't forget to change to GPU. Runtime --> Change runtime type --> GPU
# NOTE: terminate all other colab sessions that use GPU!
# NOTE 2: Make sure the correct exercise folder (e.g exercise_08) is given.
# NOTE 3: For simplicity, create a folder "i2dl" within your main drive folder, and put the exercise there.

"""
from google.colab import drive
import os
gdrive_path='/content/gdrive/MyDrive/i2dl/exercise_08'

# This will mount your google drive under 'MyDrive'
drive.mount('/content/gdrive', force_remount=True)
# In order to access the files in this notebook we have to navigate to the correct folder
os.chdir(gdrive_path)
# Check manually if all files are present
"""

## 1.2 Batch Normalization: Implementation

In [7]:
# As usual, a bit of setup

import time
import numpy as np
import matplotlib.pyplot as plt
from exercise_code.layers import *
from exercise_code.tests import *
import torch.nn as nn
import torch.nn.functional as F
import torch
import torchvision
import torchvision.transforms as transforms
import os
import shutil

from torch.utils.tensorboard import SummaryWriter

from exercise_code.BatchNormModel import SimpleNetwork, BatchNormNetwork, DropoutNetwork

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

os.environ['KMP_DUPLICATE_LIB_OK']='True' # To prevent the kernel from dying.

# supress cluttering warnings in solutions
import warnings
warnings.filterwarnings('ignore')

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))


# This will set the device on which to run the code, which defaults to the CPU.
# If you have an Nvidia GPU, it will use that with cuda
# For Apple Silicon, you can try to use the "mps" device by commenting in line 7.
#  But it is not guaranteed to work and sometimes the CPU performs better.

device = torch.device(
    "cuda:0" if torch.cuda.is_available() else 
#   "mps" if torch.backends.mps.is_available() else 
    "cpu"
)

### Batch normalization: Forward Pass

<div class="alert alert-success">
    <h3>Task: Check Code</h3>
    <p>In the file <code>exercise_code/layers.py </code>, we have implemented the <code>batchnorm_forward</code> function. Read this implementation and make sure you understand what batch normalization is doing. Then execute the following cells to test the implementation.
 </p>
</div>

In [8]:
# Check the training-time forward pass by checking means and variances
# of features both before and after batch normalization

# Simulate the forward pass for a two-layer network
N, D1, D2, D3 = 200, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)

print('Before batch normalization:')
print('  means: ', a.mean(axis=0))
print('  stds: ', a.std(axis=0), '\n')

# Means should be close to zero and stds close to one
print('After batch normalization with (gamma=1, beta=0)')
a_norm, _ = batchnorm_forward(a, np.ones(D3), np.zeros(D3), {'mode': 'train'})
print('  mean: ', a_norm.mean(axis=0))
print('  std: ', a_norm.std(axis=0) , '\n')

# Now means should be close to beta and stds close to gamma
beta = np.asarray([11.0, 12.0, 13.0])
gamma = np.asarray([1.0, 2.0, 3.0])
a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})
print('After batch normalization with (nontrivial gamma, beta)')
print('  means: ', a_norm.mean(axis=0))
print('  stds: ', a_norm.std(axis=0) )

Before batch normalization:
  means:  [-18.45638153  -6.43262094 -16.38788136]
  stds:  [31.98106437 31.5359993  24.56387286] 

After batch normalization with (gamma=1, beta=0)
  mean:  [-4.44089210e-17 -9.21485110e-17  2.99760217e-17]
  std:  [1.         0.99999999 0.99999999] 

After batch normalization with (nontrivial gamma, beta)
  means:  [11. 12. 13.]
  stds:  [1.         1.99999999 2.99999998]


Since the mean and variances in batch norm are computed in training time,
before invoking the test-time forward pass run the training-time
forward pass (previous cell) many times to warm up the running averages. Then
checking the means and variances of activations for a test-time
forward pass.

In [9]:
# Check the test-time forward pass by checking means and variances 
# of features after batch normalization

N, D1, D2, D3 = 200, 50, 60, 3
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)

bn_param = {'mode': 'train'}
gamma = np.ones(D3)
beta = np.zeros(D3)
for t in range(50):
    X = np.random.randn(N, D1)
    a = np.maximum(0, X.dot(W1)).dot(W2)
    batchnorm_forward(a, gamma, beta, bn_param)
bn_param['mode'] = 'test'
X = np.random.randn(N, D1)
a = np.maximum(0, X.dot(W1)).dot(W2)
a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)

# Means should be close to zero and stds close to one, but will be
# noisier than training-time forward passes.
print('After batch normalization (test-time):')
print('  means: ', a_norm.mean(axis=0))
print('  stds: ', a_norm.std(axis=0))

After batch normalization (test-time):
  means:  [0.06199073 0.10025696 0.06207077]
  stds:  [1.08709564 0.96084071 0.99678458]


### Batch Normalization: Backward Pass
Since batch normalization is realized by a more complex function of learnable parameters, it is a good exercise to train your backprop skills through this computational graph.

<div class="alert alert-info">
    <h3>Task: Implement</h3>
    <p>Open <code>exercise_code/layers.py</code> and implement the backward pass for Batch Normalization in the function <code> batchnorm_backward() </code>.
    </p>
    <p> To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass. You can stay close to the forward pass implementation we have provided for you, i.e. go line by line backward.
    </p>
    <p> Once you have finished, run the following to numerically check your backward pass.
    </p>
</div>

In [11]:
# Gradient check batchnorm backward pass

N, D = 4, 5
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

bn_param = {'mode': 'train'}

fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]
fg = lambda a: batchnorm_forward(x, gamma, beta, bn_param)[0]
fb = lambda b: batchnorm_forward(x, gamma, beta, bn_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma, dout)
db_num = eval_numerical_gradient_array(fb, beta, dout)

_, cache = batchnorm_forward(x, gamma, beta, bn_param)
dx, dgamma, dbeta = batchnorm_backward(dout, cache)

print('dx error: ', rel_error(dx_num, dx))
print('dgamma error: ', rel_error(da_num, dgamma))
print('dbeta error: ', rel_error(db_num, dbeta))

dx error:  1.085192131267446e-08
dgamma error:  4.597795665096508e-12
dbeta error:  4.37335740671464e-12


## 1.3 Using Batch Normalization with PyTorch

Now that we have seen the implementation of Batch Normalization, it is interesting to see how it would affect the overall Model Performance. Since you have already worked with PyTorch in the last exercise, you have seen how easy it makes our lives. As an experiment, we will use a simple Fully Connected Network in PyTorch here.

### Setup TensorBoard
In exercise 07 you've already learned how to use TensorBoard. Let's use it again to make the debugging of our network and training process more convenient! Throughout this notebook, feel free to add further logs or visualizations your TensorBoard!

In [12]:
# Few Hyperparameters before we start things off
hidden_dim = 200
batch_size = 50

logdir = './batch_norm_logs'
if os.path.exists(logdir):
    # We delete the logs on the first run
    shutil.rmtree(logdir)
os.mkdir(logdir)

epochs = 5
learning_rate = 0.00005

In [None]:
################# COLAB ONLY #################
# %load_ext tensorboard
# %tensorboard --logdir=./ --port 6006

# Use the cmd for less trouble, if you can. From the working directory, run: tensorboard --logdir=./ --port 6006

### Train a model without Batch Normalization. 

<div class="alert alert-success">
    <h3>Task: Check Code</h3>
    <p>Let us first start with a simple network which does not make use of Batch Normalization. We have already implemented the a simple network <code>SimpleNetwork</code> in <code>exercise_code/BatchNormModel.py</code>. Feel free to check it out and play around with the parameters. The cell below is setting up a short trainings process for this network.
 </p>
</div>

In [13]:
from tqdm import tqdm
def create_tqdm_bar(iterable, desc):
    return tqdm(enumerate(iterable),total=len(iterable), ncols=150, desc=desc)

def train_model(model, train_loader, val_loader, loss_func, tb_logger, epochs=10, name='Autoencoder'):
    
    optimizer = model.configure_optimizer()
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=epochs * len(train_loader) / 5, gamma=0.7)
    validation_loss = 0
    model = model.to(device)
    for epoch in range(epochs):
        
        # Train
        training_loop = create_tqdm_bar(train_loader, desc=f'Training Epoch [{epoch + 1}/{epochs}]')
        training_loss = 0
        for train_iteration, batch in training_loop:
            optimizer.zero_grad()
            loss = model.training_step(batch, loss_func)
            loss.backward()
            optimizer.step()
            scheduler.step()
            
            training_loss += loss.item()

            # Update the progress bar.
            training_loop.set_postfix(train_loss = "{:.8f}".format(training_loss / (train_iteration + 1)), val_loss = "{:.8f}".format(validation_loss))

            # Update the tensorboard logger.
            tb_logger.add_scalar(f'{name}/train_loss', loss.item(), epoch * len(train_loader) + train_iteration)

        # Validation
        val_loop = create_tqdm_bar(val_loader, desc=f'Validation Epoch [{epoch + 1}/{epochs}]')
        validation_loss = 0
        with torch.no_grad():
            for val_iteration, batch in val_loop:
                loss = model.validation_step(batch, loss_func) # You need to implement this function.
                validation_loss += loss.item()

                # Update the progress bar.
                val_loop.set_postfix(val_loss = "{:.8f}".format(validation_loss / (val_iteration + 1)))

                # Update the tensorboard logger.
                tb_logger.add_scalar(f'{name}/val_loss', validation_loss / (val_iteration + 1), epoch * len(val_loader) + val_iteration)
        # This value is for the progress bar of the training loop.
        validation_loss /= len(val_loader)

In [14]:
# Train a simple model without batch normalization.

model = SimpleNetwork(hidden_dim=hidden_dim, batch_size=batch_size, learning_rate=learning_rate).to(device)
path = os.path.join('logs', 'Bn_model_logs')
if os.path.exists(path):
    shutil.rmtree(path)
path = os.path.join(path, f'simple-model')
tb_logger = SummaryWriter(path)

# Train the classifier.
train_dl, val_dl, _ = model.prepare_data()

epochs = 5
loss_func = F.cross_entropy # The loss function we use for regression (Could also be nn.L1Loss()).
train_model(model, train_dl, val_dl, loss_func, tb_logger, epochs=epochs, name='BatchNorm')


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../datasets/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26.4M/26.4M [00:00<00:00, 96.6MB/s]


Extracting ../datasets/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../datasets/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../datasets/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29.5k/29.5k [00:00<00:00, 4.54MB/s]


Extracting ../datasets/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../datasets/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../datasets/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4.42M/4.42M [00:00<00:00, 49.3MB/s]


Extracting ../datasets/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../datasets/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../datasets/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5.15k/5.15k [00:00<00:00, 27.7MB/s]

Extracting ../datasets/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../datasets/FashionMNIST/raw




Training Epoch [1/5]: 100%|█████████████████████████████████████████████| 960/960 [00:15<00:00, 60.49it/s, train_loss=0.54225833, val_loss=0.00000000]
Validation Epoch [1/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 105.12it/s, val_loss=0.42230390]
Training Epoch [2/5]: 100%|█████████████████████████████████████████████| 960/960 [00:14<00:00, 67.58it/s, train_loss=0.39014810, val_loss=0.42230390]
Validation Epoch [2/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 119.19it/s, val_loss=0.39300138]
Training Epoch [3/5]: 100%|█████████████████████████████████████████████| 960/960 [00:11<00:00, 86.86it/s, train_loss=0.34735235, val_loss=0.39300138]
Validation Epoch [3/5]: 100%|██████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 86.31it/s, val_loss=0.34910136]
Training Epoch [4/5]: 100%|█████████████████████████████████████████████| 960/960 [00:13<00:0

### Train a model incl. Batch Normalization

<div class="alert alert-success">
    <h3>Task: Check Code</h3>
    <p> Now that we have already seen how our simple network should work, let us look at a model that is actually using Batch Normalization. Again, we provide you with such a model <code>BatchNormNetwork</code> in <code>exercise_code/BatchNormModel.py</code>. Same as before: Feel free to check it out and play around with the parameters. The cell below is setting up a short trainings process for this model. 
 </p>
</div>

In [15]:
model = BatchNormNetwork(hidden_dim=hidden_dim, batch_size=batch_size, learning_rate=learning_rate)

path = os.path.join('logs', 'Bn_model_logs')
path = os.path.join(path, f'BN-model')
tb_logger = SummaryWriter(path)

# Train the classifier.
train_dl, val_dl, _ = model.prepare_data()

epochs = 5
loss_func = F.cross_entropy # The loss function we use for regression (Could also be nn.L1Loss()).
train_model(model, train_dl, val_dl, loss_func, tb_logger, epochs=epochs, name='BatchNorm')

Training Epoch [1/5]: 100%|█████████████████████████████████████████████| 960/960 [00:14<00:00, 65.28it/s, train_loss=0.47301565, val_loss=0.00000000]
Validation Epoch [1/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 153.23it/s, val_loss=0.39101067]
Training Epoch [2/5]: 100%|█████████████████████████████████████████████| 960/960 [00:14<00:00, 68.29it/s, train_loss=0.35042817, val_loss=0.39101067]
Validation Epoch [2/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 106.98it/s, val_loss=0.37462278]
Training Epoch [3/5]: 100%|█████████████████████████████████████████████| 960/960 [00:14<00:00, 66.83it/s, train_loss=0.30778737, val_loss=0.37462278]
Validation Epoch [3/5]: 100%|██████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 81.89it/s, val_loss=0.33263666]
Training Epoch [4/5]: 100%|█████████████████████████████████████████████| 960/960 [00:13<00:00

### Observations
Take a look at TensorBoard to compare the performance of both networks:

In [None]:
################# COLAB ONLY #################
# %load_ext tensorboard
# %tensorboard --logdir=./ --port 6006

# Use the cmd for less trouble, if you can. From the working directory, run: tensorboard --logdir=./ --port 6006

As you can see, using Batch Normalization resulted in better performance. You can easily observe lower validation loss and higher validation accuracy from the graphs. Batch Norm in general is helpful since it would lead to faster model training.

Batch Normalization has other related benefits, for instance, it provides a bit of regularization. However, we would look for better methods of regularization such as Dropout. So in the second part of this notebook, let's have a more detailed look at the effect of Dropout. :)

# 2. Dropout

## 2.1 What is Dropout

Dropout [1] is a technique for regularizing neural networks by randomly setting some features to zero during the forward pass. While training, dropout is implemented by only keeping a neuron active with some probability p
(a hyperparameter), or setting it to zero otherwise. The Dropout technique would help your Neural Network to perform better on Test data.

We want to repeat the approach that we saw above for Batch Normalization, but this time for Dropout. Let us thus first have a look at the implementation and then compare two networks with each other where one is using Dropout and one is not. 

[1] Srivastava et al, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", 2014

<!-- In case the image does not show, uncomment the following:<img name="dropout" src="./img/dropout.jpg"> -->
<img name="dropout" src="https://i2dl.vc.in.tum.de/static/images/exercise_08/dropout.jpg">


## 2.2 Dropout Implementation

### Dropout: Forward Pass

The dropout method is a little less complex to implement than the Batch Normalization, hence we ask you to implement both, the forward and the backward pass. Let us start with the forward pass:

<div class="alert alert-info">
    <h3>Task: Implement</h3>
    <p> In the file <code>exercise_code/layers.py</code>, implement the forward pass for Dropout in <code>dropout_forward()</code>. Since Dropout behaves differently during training and testing, make sure to implement the operation for both modes.
    </p>
    <p> Once you have done so, run the cell below to test your implementation.
    </p>
</div>

In [19]:
x = np.random.randn(500, 500) + 10
# Let us use different dropout values(p) for our dropout layer and see their effects
for p in [0.3, 0.6, 0.75]:
    out, _ = dropout_forward(x, {'mode': 'train', 'p': p})
    out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})

    print('Running tests with p = ', p)
    print('Mean of input: ', x.mean())
    print('Mean of train-time output: ', out.mean())
    print('Mean of test-time output: ', out_test.mean())
    print('Fraction of train-time output set to zero: ', (out == 0).mean())
    print('Fraction of test-time output set to zero: ', (out_test == 0).mean())
    print()

Running tests with p =  0.3
Mean of input:  10.00299235287822
Mean of train-time output:  10.010620653806221
Mean of test-time output:  10.00299235287822
Fraction of train-time output set to zero:  0.299328
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.6
Mean of input:  10.00299235287822
Mean of train-time output:  9.997353506875358
Mean of test-time output:  10.00299235287822
Fraction of train-time output set to zero:  0.6003
Fraction of test-time output set to zero:  0.0

Running tests with p =  0.75
Mean of input:  10.00299235287822
Mean of train-time output:  10.02131711890852
Mean of test-time output:  10.00299235287822
Fraction of train-time output set to zero:  0.749544
Fraction of test-time output set to zero:  0.0



### Dropout: Backward Pass

<div class="alert alert-info">
    <h3>Task: Implement</h3>
    <p> In the file <code>exercise_code/layers.py</code>, implement the backward pass for dropout in <code>dropout_backward()</code>. After doing so, run the following cell to numerically gradient-check your implementation.
    </p>
</div>

In [20]:
x = np.random.randn(10, 10) + 10
dout = np.random.randn(*x.shape)

dropout_param = {'mode': 'train', 'p': 0.8, 'seed': 123}
out, cache = dropout_forward(x, dropout_param)
dx = dropout_backward(dout, cache)
dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)

print('dx relative error: ', rel_error(dx, dx_num))

dx relative error:  1.8929033151985206e-11


## 2.3 Using Dropout with PyTorch

Same experiment as for Batch Normalization: We will train a pair of two-layer networks on a training dataset where one network will use no Dropout and one will use a Dropout probability of 0.75. We will then visualize the training and validation accuracies of the two networks over time.

### Setup TensorBoard

In exercise 07 you've already learned how to use TensorBoard. Let's use it again to make the debugging of our network and training process more convenient! Throughout this notebook, feel free to add further logs or visualizations to your TensorBoard!

In [21]:
# Few Hyperparameters before we start things off
hidden_dim = 200
batch_size = 50

epochs = 5
learning_rate = 0.00005

logdir = './logs/dropout_logs'
if os.path.exists(logdir):
    # We delete the logs on the first run
    shutil.rmtree(logdir)
os.mkdir(logdir)

In [24]:
################# COLAB ONLY #################
# %load_ext tensorboard
# %tensorboard --logdir=./ --port 6006

# Use the cmd for less trouble if you can.

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.18.0 at http://localhost:6007/ (Press CTRL+C to quit)
^C


<div class="alert alert-success">
    <h3>Task: Check Code</h3>
    <p> As before, we have already implemented those two networks for you. You may check them out in <code>exercise_code/BatchNormModel.py</code>. As always, feel free to play around with the parameters here. Run the following two cells to setup both models and train them for a few epochs in order to compare the performance with and without Dropout. 
 </p>
</div>

### Train a model without Dropout

Let us first start with a simple network `SimpleNetwork` which does not make use of Dropout.

In [25]:
# train a model without Dropout
model = SimpleNetwork(hidden_dim=hidden_dim, batch_size=batch_size, learning_rate=learning_rate)
path = os.path.join('logs', 'Dropout_model_logs')
if os.path.exists(path):
    shutil.rmtree(path)
path = os.path.join(path, f'Simple-model')
tb_logger = SummaryWriter(path)

# Train the classifier.
train_dl, val_dl, _ = model.prepare_data()

epochs = 5
loss_func = F.cross_entropy # The loss function we use for regression (Could also be nn.L1Loss()).
train_model(model, train_dl, val_dl, loss_func, tb_logger, epochs=epochs, name='Dropout')

Training Epoch [1/5]: 100%|█████████████████████████████████████████████| 960/960 [00:13<00:00, 71.01it/s, train_loss=0.54338337, val_loss=0.00000000]
Validation Epoch [1/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 168.54it/s, val_loss=0.42335227]
Training Epoch [2/5]: 100%|█████████████████████████████████████████████| 960/960 [00:12<00:00, 79.08it/s, train_loss=0.39164129, val_loss=0.42335227]
Validation Epoch [2/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 149.18it/s, val_loss=0.39433253]
Training Epoch [3/5]: 100%|█████████████████████████████████████████████| 960/960 [00:13<00:00, 68.94it/s, train_loss=0.34860347, val_loss=0.39433253]
Validation Epoch [3/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 127.07it/s, val_loss=0.35084208]
Training Epoch [4/5]: 100%|█████████████████████████████████████████████| 960/960 [00:11<00:00

### Train a model with Dropout

Now that we have already seen how our simple network should work, let us look at the model `DropoutNetwork` that is actually using Dropout.

In [26]:
# train a model with Dropout
model = DropoutNetwork(hidden_dim=hidden_dim, batch_size=batch_size, learning_rate=learning_rate,dropout_p=0.2)
path = os.path.join('logs', 'Dropout_model_logs')
path = os.path.join(path, f'Dropout-model')
tb_logger = SummaryWriter(path)

# Train the classifier.
train_dl, val_dl, _ = model.prepare_data()

epochs = 5
loss_func = F.cross_entropy # The loss function we use for regression (Could also be nn.L1Loss()).
train_model(model, train_dl, val_dl, loss_func, tb_logger, epochs=epochs, name='Dropout')

Training Epoch [1/5]: 100%|█████████████████████████████████████████████| 960/960 [00:12<00:00, 74.16it/s, train_loss=0.49560051, val_loss=0.00000000]
Validation Epoch [1/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 152.64it/s, val_loss=0.41237447]
Training Epoch [2/5]: 100%|█████████████████████████████████████████████| 960/960 [00:12<00:00, 78.91it/s, train_loss=0.37492261, val_loss=0.41237447]
Validation Epoch [2/5]: 100%|██████████████████████████████████████████████████████████████████| 240/240 [00:02<00:00, 89.07it/s, val_loss=0.39313713]
Training Epoch [3/5]: 100%|█████████████████████████████████████████████| 960/960 [00:13<00:00, 70.64it/s, train_loss=0.33560108, val_loss=0.39313713]
Validation Epoch [3/5]: 100%|█████████████████████████████████████████████████████████████████| 240/240 [00:01<00:00, 158.32it/s, val_loss=0.35334592]
Training Epoch [4/5]: 100%|█████████████████████████████████████████████| 960/960 [00:12<00:00

### Observations

Take a look at TensorBoard to compare the performance of both networks:

In [None]:
%load_ext tensorboard
%tensorboard --logdir dropout_logs

By using Dropout, it becomes evident that the Training Loss may increase; however, the model tends to exhibit improved performance on the Validation Set. Similar to Batch Normalization, Dropout demonstrates distinct behavior during training and testing phases.