<a href="https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W1D5_Regularization/student/W1D5_Tutorial1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 1: Regularization techniques part 1
**Week 1, Day 5: Regularization**

**By Neuromatch Academy**

__Content creators:__ Ravi Teja Konkimalla, Mohitrajhu Lingan Kumaraian, Kevin Machado Gamboa, Kelson Shilling-Scrivo, Lyle Ungar

__Content reviewers:__ Piyush Chauhan, Kelson Shilling-Scrivo

__Content editors:__ Roberto Guidotti, Spiros Chavlis

__Production editors:__ Saeed Salehi, Spiros Chavlis

**Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs**

<p align='center'><img src='https://github.com/NeuromatchAcademy/widgets/blob/master/sponsors.png?raw=True'/></p>

---
# Tutorial Objectives

1. Big ANNs are efficient universal approximators due to their adaptive basis functions
2. ANNs memorize some but generalize well
3. Regularization as shrinkage of overparameterized models: early stopping 

In [None]:
#@markdown Tutorial slides

#@markdown
# you should link the slides for all tutorial videos here (we will store pdfs on osf)

from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/1N9aguIPiBSjzo0ToqPi5uwwG8ChY6_E3/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

---
# Setup
Note that some of the code for today can take up to an hour to run. We have therefore "hidden" the code and shown the resulting outputs.


In [None]:
# Imports
import time
import copy
import torch
import pathlib
import random

import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
import matplotlib.animation as animation

from tqdm.auto import tqdm
from __future__ import print_function
from IPython.display import HTML, display
from torchvision import datasets, transforms
from torchvision.datasets import ImageFolder

In [None]:
# @title Figure Settings
import ipywidgets as widgets
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")

In [None]:
# @title Loading Animal Faces data
%%capture
!rm -r AnimalFaces32x32/
!git clone https://github.com/arashash/AnimalFaces32x32
!rm -r afhq/
!unzip ./AnimalFaces32x32/afhq_32x32.zip

In [None]:
# @title Loading Animal Faces Randomized data
%%capture
!rm -r Animal_faces_random/
!git clone https://github.com/Ravi3191/Animal_faces_random.git
!rm -r afhq_random_32x32/
!unzip ./Animal_faces_random/afhq_random_32x32.zip
!rm -r afhq_10_32x32/
!unzip ./Animal_faces_random/afhq_10_32x32.zip

In [None]:
# @title Plotting functions
def imshow(img):
  img = img / 2 + 0.5     # unnormalize
  npimg = img.numpy()
  plt.imshow(np.transpose(npimg, (1, 2, 0)))
  plt.axis(False)
  plt.show()


def plot_weights(norm, labels, ws, title='Weight Size Measurement'):
  plt.figure(figsize=[8, 6])
  plt.title(title)
  plt.ylabel('Frobenius Norm Value')
  plt.xlabel('Model Layers')
  plt.bar(labels, ws)
  plt.axhline(y=norm,
              linewidth=1,
              color='r',
              ls='--',
              label='Total Model F-Norm')
  plt.legend()
  plt.show()


def early_stop_plot(train_acc_earlystop, val_acc_earlystop, best_epoch):
  plt.figure(figsize=(8, 6))
  plt.plot(val_acc_earlystop,label='Val - Early',c='red',ls = 'dashed')
  plt.plot(train_acc_earlystop,label='Train - Early',c='red',ls = 'solid')
  plt.axvline(x=best_epoch, c='green', ls='dashed',
              label='Epoch for Max Val Accuracy')
  plt.title('Early Stopping')
  plt.ylabel('Accuracy (%)')
  plt.xlabel('Epoch')
  plt.legend()
  plt.show()

In [None]:
#@title Seeding for Reproducibility
def set_seed(seed=None, seed_torch=True):
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')


# In case that `DataLoader` is used
def seed_worker(worker_id):
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)


set_seed(seed=90108, seed_torch=False)

In [None]:
#@title Set device (GPU or CPU). Execute `set_device()`

# inform the user if the notebook uses GPU or CPU.

def set_device():
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device


set_device()

**Ensure you're running a GPU notebook:**
From "Runtime" in the drop-down menu above, click "Change runtime type". Ensure that "Hardware Accelerator" says "GPU".

**Ensure you can save:** From "File", click "Save a copy in Drive"

In [None]:
# Seed parameter
# Notice that changing this values some results may not be identical
# with the solutions
SEED = 2021

Let's start the tutorial by defining some functions which we will use frequently today, such as: `AnimalNet`, `train`, `test` and `main`.

In [None]:
# Network Class - Animal Faces
class AnimalNet(nn.Module):
  def __init__(self):
    super(AnimalNet, self).__init__()
    self.fc1 = nn.Linear(3 * 32 * 32, 128)
    self.fc2 = nn.Linear(128, 32)
    self.fc3 = nn.Linear(32, 3)

  def forward(self, x):
    x = x.view(x.shape[0],-1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output

The train function takes in the current model, along with the train_loader and loss function, and updates the parameters for a single pass of the entire dataset. The test function takes in the current model after every epoch and calculates the accuracy on the test dataset.


In [None]:
def train(args, model, device, train_loader, optimizer,
          reg_function1=None, reg_function2=None, criterion=F.nll_loss):
  """
  Trains the current inpur model using the data
  from Train_loader and Updates parameters for a single pass
  """
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    if reg_function1 is None:
      loss = criterion(output, target)
    elif reg_function2 is None:
      loss = criterion(output, target)+args['lambda']*reg_function1(model)
    else:
      loss = criterion(output, target) + args['lambda1']*reg_function1(model) + args['lambda2']*reg_function2(model)
    loss.backward()
    optimizer.step()


def test(model, device, test_loader, criterion=F.nll_loss):
  """
  Tests the current Model
  """
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      data, target = data.to(device), target.to(device)
      output = model(data)
      test_loss += criterion(output, target, reduction='sum').item()  # sum up batch loss
      pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
      correct += pred.eq(target.view_as(pred)).sum().item()

  test_loss /= len(test_loader.dataset)
  return 100. * correct / len(test_loader.dataset)


def main(args, model, train_loader, val_loader,
         reg_function1=None, reg_function2=None):
  """
  Trains the model with train_loader and tests the learned model using val_loader
  """

  use_cuda = not args['no_cuda'] and torch.cuda.is_available()
  device = set_device()

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=args['lr'],
                        momentum=args['momentum'])

  val_acc_list, train_acc_list,param_norm_list = [], [], []
  for epoch in tqdm(range(args['epochs'])):
    train(args, model, device, train_loader, optimizer,
          reg_function1=reg_function1, reg_function2=reg_function2)
    train_acc = test(model,device,train_loader)
    val_acc = test(model,device,val_loader)
    param_norm = calculate_frobenius_norm(model)
    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)
    param_norm_list.append(param_norm)

  return val_acc_list, train_acc_list, param_norm_list, model, 0

---
#Section 1: Regularization is Shrinkage

In [None]:
#@title Video 1: Introduction to Regularization
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=854, height=480, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"bc1nsP4htVg", width=854, height=480, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

A key idea of neural nets, is that they use models that are "too complex" - complex enough to fit all the noise in the data. One then needs to "regularize" them to make the models fit complex enough, but not too complex. The more complex the model, the better it fits the training data, but if it is too complex, it generalizes less well; it memorizes the training data but is less accurate on future test data.

In [None]:
#@title Video 2: Regularization as Shrinkage
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=854, height=480, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"B4CsCKViB3k", width=854, height=480, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

One way to think about Regularization is to think in terms of the magnitude of the overall weights of the model. A model with big weights can fit more data perfectly, wheras a model with smaller weights tends to underperform on the train set but it can suprisingly do very well on the test set. Having the weights too small can also be an issue an it can then underfit the model.

This week we use the sum of Frobenius Norm of all the tensors in the model as a measure of the "size of the model".

##Coding Exercise 1: Frobenius Norm
Before we start, let's define the Frobenius norm, sometimes also called the Euclidean norm of an $m×n$ matrix $A$  as the square root of the sum of the absolute squares of its elements. 
\begin{equation}
||A||_F= \sqrt{\sum_{i=1}^m\sum_{j=1}^n|a_{ij}|^2}
\end{equation} 

This is just a measure of how big the matrix is, analagous to how big a vector is.

 **Hint:** Use functions `model.parameters()` or `model.named_parameters()`


In [None]:
def calculate_frobenius_norm(model):
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Define `calculate_frobenius_norm` function")
  ####################################################################
  norm = 0.0
  # Sum the square of all parameters
  for param in model.parameters():
    norm += ...

  # Take a square root of the sum of squares of all the parameters
  norm = ...
  return norm


# Seed added for reproducibility
set_seed(seed=SEED)

## uncomment below to test your code
# net = nn.Linear(10, 1)
# print(f'Frobenius Norm of Single Linear Layer: {calculate_frobenius_norm(net)}')

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_bcc74b44.py)



```
Random seed 2021 has been set.
Frobenius Norm of Single Linear Layer: 0.6572162508964539
```

Apart from calculating the weight size for an entire model, we could also determine the weight size in every layer. For this, we can modify our `calculate_frobenius_norm` function as shown below. 

**Have a look how it works!!**

In [None]:
# Frobenius Norm per Layer
def calculate_frobenius_norm(model):

  # initialization of variables
  norm, ws, labels = 0.0, [], []

  # Sum all the parameters
  for name, parameters in model.named_parameters():
    p = torch.sum(parameters**2)
    norm += p

    ws.append((p**0.5).cpu().detach().numpy())
    labels.append(name)

  # Take a square root of the sum of squares of all the parameters
  norm = (norm**0.5).cpu().detach().numpy()

  return norm, ws, labels


set_seed(SEED)
net = nn.Linear(10,1)
norm, ws, labels = calculate_frobenius_norm(net)
print(f'Frobenius Norm of Single Linear Layer: {norm:.4f}')
# Plots the weights
plot_weights(norm, labels, ws)

Using the last function `calculate_frobenius_norm`, we can also obtain the Frobenius Norm per layer for a whole NN model and use the `plot_weigts` function to visualize them.

In [None]:
# Creates a new model
model = AnimalNet()

# Calculates the forbenius norm per layer
norm, ws, labels = calculate_frobenius_norm(model)
print(f'Frobenius Norm of Models weights: {norm:.4f}')

# Plots the weights
plot_weights(norm, labels, ws)

---
#Section 2: Overfitting


In [None]:
#@title Video 3: Overfitting
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=854, height=480, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"RlaGyRKP2nY", width=854, height=480, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

## Section 2.1: Visualizing Overfitting



Let's create some synthetic dataset that we will use to illustrate overfitting in neural networks.

In [None]:
# creating train data
# input
X = torch.rand((10, 1))
# output
Y = 2*X + 2*torch.empty((X.shape[0], 1)).normal_(mean=0, std=1)  # adding small error in the data

#visualizing trian data
plt.figure(figsize=(8, 6))
plt.scatter(X.numpy(),Y.numpy())
plt.xlabel('input (x)')
plt.ylabel('output(y)')
plt.title('toy dataset')
plt.show()

#creating test dataset
X_test = torch.linspace(0, 1, 40)
X_test = X_test.reshape((40, 1, 1))

Let's create an overparametrized Neural Network that can fit on the dataset that we just created and train it. 

First, let's build the model architecture:

In [None]:
# Network Class - 2D
class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()

    self.fc1 = nn.Linear(1, 300)
    self.fc2 = nn.Linear(300, 500)
    self.fc3 = nn.Linear(500, 1)

  def forward(self, x):
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    output = self.fc3(x)
    return output

Next, let's define the different parameters for training our model:


In [None]:
# train the network on toy dataset
model = Net()

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = 1e-4)

iters = 0
# Calculates frobenius before training
normi, wsi, label = calculate_frobenius_norm(model)

At this point, we can now train our model.

In [None]:
# initializing variables

# losses
train_loss = []
test_loss = []
# model norm
model_norm = []
# Initializing variables to store weights
norm_per_layer = []

max_epochs = 10000

running_predictions = np.empty((40, int(max_epochs / 500 + 1)))

for epoch in tqdm(range(max_epochs)):
  # frobenius norm per epoch
  norm, pl, layer_names = calculate_frobenius_norm(model)

  # training
  model_norm.append(norm)
  norm_per_layer.append(pl)
  model.train()
  optimizer.zero_grad()
  predictions = model(X)
  loss = criterion(predictions, Y)
  loss.backward()
  optimizer.step()

  train_loss.append(loss.data)
  model.eval()
  Y_test = model(X_test)
  loss = criterion(Y_test, 2*X_test)
  test_loss.append(loss.data)

  if (epoch % 500 == 0 or epoch == max_epochs - 1):
    running_predictions[:, iters] = Y_test[:, 0, 0].detach().numpy()
    iters += 1

Now that we have finished training, let's see how the model has evolved over the training process.

In [None]:
#@title Animation (Run Me!)

# create a figure and axes
fig = plt.figure(figsize=(14, 5))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
# organizing subplots
plot1, = ax1.plot([],[])
plot2 = ax2.bar([], [])


def frame(i):
  ax1.clear()
  title1 = ax1.set_title('')
  ax1.set_xlabel("Input(x)")
  ax1.set_ylabel("Output(y)")

  ax2.clear()
  ax2.set_xlabel('Layer names')
  ax2.set_ylabel('Frobenius norm')
  title2 = ax2.set_title('Weight Measurement: Forbenius Norm')

  ax1.scatter(X.numpy(),Y.numpy())
  plot1 = ax1.plot(X_test[:,0,:].detach().numpy(), running_predictions[:,i])
  title1.set_text(f'Epochs: {i * 500}')
  plot2 = ax2.bar(label, norm_per_layer[i*500])
  plt.axhline(y=model_norm[i*500], linewidth=1,
              color='r', ls='--',
              label=f'Norm: {model_norm[i*500]:.2f}')
  plt.legend()

  return plot1, plot2


anim = animation.FuncAnimation(fig, frame, frames=range(20),
                               blit=False, repeat=False, repeat_delay=10000)
html_anim = HTML(anim.to_html5_video());
plt.close()
display(html_anim)

In [None]:
#@title Plot the train and test losses
plt.figure(figsize=(8, 6))
plt.plot(train_loss,label='train_loss')
plt.plot(test_loss,label='test_loss')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.title('loss vs epoch')
plt.legend()
plt.show()

### Think! 2.1: Interpreting losses

Regarding the train and test graph above, discuss among yourselves:

*   What trend do you see w.r.t to train and test losses ( Where do you see the minimum of these losses?)
*   What does it tell us about the model we trained?

  


[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_141a7a50.py)



Now let's visualize the Frobenious norm of the model as we trained. You should see that the value of weights increases over the epochs.

In [None]:
#@title Frobenious norm of the model
plt.figure(figsize=(8, 6))
plt.plot(model_norm)
plt.ylabel('norm of the model')
plt.xlabel('epochs')
plt.title('Size of the model vs Epochs')  # Change title to Frobenious norm of the model
plt.show()

Finally, you can compare the Frobenius norm per layer in the model, before and after training.

In [None]:
#@title Frobenius norm per layer before and after training
normf, wsf, label = calculate_frobenius_norm(model)

plot_weights(float(normi), label, wsi, title='Weight Size Before Training')
plot_weights(float(normf), label, wsf, title='Weight Size After Training')

## Section 2.2: Overfitting on Test Dataset


In principle we should not touch our test set until after we have chosen all our hyperparameters. Were we to use the test data in the model selection process, there is a risk that we might overfit the test data. Then we would be in serious trouble. If we overfit our training data, there is always the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever know?

Note that there is another kind of overfitting: you do "honest" fitting on one set of images or posts, or medical records, but it may not generalize to other sets of images, posts or medical records.


#### Validation Dataset
A common practice to address this problem is to split our data in three ways, using a validation dataset (or validation set) to tune the hyperparameters. Ideally, we would only touch the test data once, to assess the very best model or to compare a small number of models to each other, real-world test data is seldom discarded after just one use.



---
# Section 3: Memorization


Given sufficiently large networks and enough training, Neural Networks can acheive almost 100% train accuracy.

In this section we train three MLPs; one each on:


1.   Animal Faces Dataset
2.   A Completely Noisy Dataset (Random Shuffling of all labels)
3.   A partially Noisy Dataset (Random Shuffling of 15% labels)


Now, think for a couple of minutes as to what the train and test accuracies of each of these models might be, given that you train for sufficient time and use a powerful network.

First, let's create the required dataloaders for all three datasets. Notice how we split the data. We train on a fraction of the dataset as it will be faster to train and will overfit more clearly.

In [None]:
# Dataloaders for the Dataset
batch_size = 128
classes = ('cat', 'dog', 'wild')

# defining number of examples for train, val test
len_train, len_val, len_test = 100, 100, 14430

train_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
data_path = pathlib.Path('.')/'afhq'  # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)

# For reproducibility
g = torch.Generator()
g.manual_seed(SEED)

In [None]:
# Dataloaders for the Original Dataset

img_train_data, img_val_data,_ = torch.utils.data.random_split(img_dataset,
                                                               [len_train,
                                                                len_val,
                                                                len_test])

# Creating train_loader and Val_loader
train_loader = torch.utils.data.DataLoader(img_train_data,
                                           batch_size=batch_size,
                                           worker_init_fn=seed_worker,
                                           generator=g)

val_loader = torch.utils.data.DataLoader(img_val_data,
                                         batch_size=1000,
                                         worker_init_fn=seed_worker,
                                         generator=g)

In [None]:
# Dataloaders for the Random Dataset

# splitting randomized data into training and validation data
data_path = pathlib.Path('.')/'afhq_random_32x32/afhq_random' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
random_img_train_data, random_img_val_data,_ = torch.utils.data.random_split(img_dataset, [len_train, len_val, len_test])

# Randomized train and validation dataloader
rand_train_loader = torch.utils.data.DataLoader(random_img_train_data,
                                                batch_size=batch_size,
                                                num_workers=0,
                                                worker_init_fn=seed_worker,
                                                generator=g)

rand_val_loader = torch.utils.data.DataLoader(random_img_val_data,
                                              batch_size=1000,
                                              num_workers=0,
                                              worker_init_fn=seed_worker,
                                              generator=g)

In [None]:
# Dataloaders for the Partially Random Dataset

# Splitting data between training and validation dataset for partially randomized data
data_path = pathlib.Path('.')/'afhq_10_32x32/afhq_10' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
partially_random_train_data, partially_random_val_data,_ = torch.utils.data.random_split(img_dataset, [len_train, len_val, len_test])

# Training and Validation loader for partially randomized data
partial_rand_train_loader = torch.utils.data.DataLoader(partially_random_train_data,
                                                        batch_size=batch_size,
                                                        num_workers=0,
                                                        worker_init_fn=seed_worker,
                                                        generator=g)

partial_rand_val_loader = torch.utils.data.DataLoader(partially_random_val_data,
                                                      batch_size=1000,
                                                      num_workers=0,
                                                      worker_init_fn=seed_worker,
                                                      generator=g)

Now let's define a model which has many parameters compared to the training dataset size, and train it on these datasets.

In [None]:
# Network Class - Animal Faces
class BigAnimalNet(nn.Module):
  def __init__(self):
    super(BigAnimalNet, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 124)
    self.fc2 = nn.Linear(124, 64)
    self.fc3 = nn.Linear(64, 3)

  def forward(self, x):
    x = x.view(x.shape[0], -1)
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output

Before training our `BigAnimalNet()`, calculate the Frobenius norm again.

In [None]:
model = BigAnimalNet()
normi, wsi, label = calculate_frobenius_norm(model)

Now, train our `BigAnimalNet()` model

In [None]:
# Here we have 100 true train data.

args = {
    'epochs': 200,
    'lr': 5e-3,
    'momentum': 0.9,
    'no_cuda': False,
}

acc_dict = {}

start_time = time.time()
val_acc_pure, train_acc_pure, _, model ,_ = main(args=args,
                                                 model=model,
                                                 train_loader=train_loader,
                                                 val_loader=val_loader)
end_time = time.time()

print("Time to memorize the dataset:",end_time - start_time)

# # Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_pure, label='Val Accuracy Pure', c='red', ls='dashed')
plt.plot(train_acc_pure, label='Train Accuracy Pure', c='red', ls='solid')
plt.axhline(y=max(val_acc_pure), c='green', ls='dashed',
            label='max Val accuracy pure')
plt.title('Memorization')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:
#@markdown #### Frobenius norm for AnimalNet before and after training
normf, wsf, label = calculate_frobenius_norm(model)

plot_weights(float(normi), label, wsi, title='Weight Size Before Training')
plot_weights(float(normf), label, wsf, title='Weight Size After Training')

##Coding Exercise 2: Data Visualizer
Before we train the model on a data with random labels, let's visualize and verify for ourselves that the data is random. Here, we have classes = ("cat", "dog", "wild"). 

**Hint:** Use `.permute()` method. `plt.imshow()` expects imput to be in numpy format and in the format (Px, Py, 3), where Px and Py are the number of pixels along axis x and y respectively.

In [None]:
def visualize_data(dataloader):
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  # The dataloader here gives out mini batches of 100 data points.
  raise NotImplementedError("Complete the Visualize_random_data function")
  ####################################################################

  for idx, (data,label) in enumerate(dataloader):
    plt.figure(idx)
    # Choose the datapoint you would like to visualize
    index = ...

    # choose that datapoint using index and permute the dimensions
    # and bring the pixel values between [0,1]
    data = ...

    # Convert the torch tensor into numpy
    data = ...

    plt.imshow(data)
    plt.axis(False)
    image_class = classes[...]
    print(f'The image belongs to : {image_class}')

  plt.show()


## uncomment to run the function
# visualize_data(rand_train_loader)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_83bd9281.py)

*Example output:*

<img alt='Solution hint' align='left' width=832.0 height=832.0 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W1D5_Regularization/static/W1D5_Tutorial1_Solution_83bd9281_1.png>



Now let's train the network on the shuffled data and see if it memorizes.

In [None]:
# Here we have 100 completely shuffled train data.
args = {
    'epochs': 200,
    'lr': 5e-3,
    'momentum': 0.9,
    'no_cuda': False
}

acc_dict = {}
model = BigAnimalNet()


val_acc_random, train_acc_random, _,model,_ = main(args,
                                                   model,
                                                   rand_train_loader,
                                                   val_loader)

# Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_random, label='Val Accuracy random', c='red', ls='dashed')
plt.plot(train_acc_random, label='Train Accuracy random', c='red', ls='solid')
plt.axhline(y=max(val_acc_random), c='green', ls='dashed', label='Max Val Accuracy random')
plt.title('Memorization')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

Finally lets train on a partially shuffled dataset where 15% of the labels are noisy.

In [None]:
# Here we have 100 partially shuffled train data.
args = {
    'epochs': 200,
    'lr': 5e-3,
    'momentum': 0.9,
    'no_cuda': False,
}

acc_dict = {}
model = BigAnimalNet()


val_acc_shuffle, train_acc_shuffle, _,_,_ = main(args,
                                                 model,
                                                 partial_rand_train_loader,
                                                 val_loader)

# train and test acc plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_shuffle, label='Val Accuracy shuffle', c='red', ls='dashed')
plt.plot(train_acc_shuffle, label='Train Accuracy shuffle', c='red', ls='solid')
plt.axhline(y=max(val_acc_shuffle), c='green', ls='dashed', label='Max Val Accuracy shuffle')
plt.title('Memorization')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:
#@markdown #### Plotting them all together (Run Me!)
plt.figure(figsize=(8, 6))
plt.plot(val_acc_pure,label='Val - Pure',c='red',ls = 'dashed')
plt.plot(train_acc_pure,label='Train - Pure',c='red',ls = 'solid')
plt.plot(val_acc_random,label='Val - Random',c='blue',ls = 'dashed')
plt.plot(train_acc_random,label='Train - Random',c='blue',ls = 'solid')
plt.plot(val_acc_shuffle, label='Val - shuffle', c='y', ls='dashed')
plt.plot(train_acc_shuffle, label='Train - shuffle', c='y', ls='solid')

plt.title('Memorization')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

## Think! 2: Does it Generalize?
Given that the Neural Network fit/memorize the training data perfectly:

*   Do you think it generalizes well?
*   What makes you think it does or doesn't?


[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_8f1d49a8.py)



Isn't it supprising to see that the NN was able to acheive 100% training accuracy on randomly shuffled labels? This is one of the reasons why training accuracy is not a good indicator of model performance. 

Also it is interesting to note that sometimes the model trained on slightly shuffled data does slightly better than the one trained on pure data.  Shuffling some of the data is a form of regularization--one of many ways of adding noise to the training data.

---
# Section 4: Early Stopping


In [None]:
#@title Video 4: Early Stopping
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=854, height=480, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"GA6J-50GCWs", width=854, height=480, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)


Now that we have established that the validation accuracy reaches the peak well before the model overfits, we want to somehow stop the training early. You should have also observed from the above plots that the train/test loss on real data is not very smooth and hence you might guess that the choice of epoch can play a very large role on the val/test accuracy of your model. 

Early stopping stops training when the validation accuracies stop increasing. 

<center><img src="https://images.deepai.org/glossary-terms/early-stopping-machine-learning-5422207.jpg" alt="Overfitting" width="600"/></center>

## Coding Exercise 3: Early Stopping
Reimplement the main function to include early stopping as described above. Then run the code below to validate your implementation.

In [None]:
def early_stopping_main(args, model, train_loader, val_loader):

  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Complete the early_stopping_main function")
  ####################################################################

  use_cuda = not args['no_cuda'] and torch.cuda.is_available()
  device = torch.device('cuda' if use_cuda else 'cpu')

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(),
                        lr=args['lr'],
                        momentum=args['momentum'])

  best_acc = 0.0
  best_epoch = 0

  # Number of successive epochs that you want to wait before stopping training process
  patience = 20

  # Keps track of number of epochs during which the val_acc was less than best_acc
  wait = 0

  val_acc_list, train_acc_list = [], []
  for epoch in tqdm(range(args['epochs'])):

    # train the model
    ...

    # calculate training accuracy
    train_acc = ...

    # calculate validation accuracy
    val_acc = ...

    if (val_acc > best_acc):
      best_acc = val_acc
      best_epoch = epoch
      best_model = copy.deepcopy(model)
      wait = 0
    else:
      wait += 1

    if (wait > patience):
      print(f'early stopped on epoch: {epoch}')
      break

    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)

  return val_acc_list, train_acc_list, best_model, best_epoch


args = {
    'epochs': 200,
    'lr': 5e-4,
    'momentum': 0.99,
    'no_cuda': False,
}

model = AnimalNet()

## Uncomment to test
# val_acc_earlystop, train_acc_earlystop, _, best_epoch = early_stopping_main(args, model, train_loader, val_loader)
# print(f'Maximum Validation Accuracy is reached at epoch: {best_epoch:2d}')
# early_stop_plot(train_acc_earlystop, val_acc_earlystop, best_epoch)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_dd5edfb8.py)

*Example output:*

<img alt='Solution hint' align='left' width=1120.0 height=832.0 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W1D5_Regularization/static/W1D5_Tutorial1_Solution_dd5edfb8_2.png>



## Think! 3: Early Stopping

Discuss among your pod why or why not:

*   Do you think early stopping can be harmful for training your network?

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial1_Solution_683d27d3.py)

