<a href="https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W1D5_Regularization/student/W1D5_Tutorial2.ipynb" target="_blank"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

# Tutorial 2: Regularization techniques part 2
**Week 1, Day 5: Regularization**

**By Neuromatch Academy**


__Content creators:__ Ravi Teja Konkimalla, Mohitrajhu Lingan Kumaraian, Kevin Machado Gamboa, Kelson Shilling-Scrivo, Lyle Ungar

__Content reviewers:__ Piyush Chauhan, Kelson Shilling-Scrivo

__Content editors:__ Roberto Guidotti, Spiros Chavlis

__Production editors:__ Saeed Salehi, Spiros Chavlis

**Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs**

<p align='center'><img src='https://github.com/NeuromatchAcademy/widgets/blob/master/sponsors.png?raw=True'/></p>

---
# Tutorial Objectives

1.   Regularization as shrinkage of overparameterized models: L1, L2
2.   Regularization by Dropout
3.   Regularization by Data Augmentation
4.   Perils of Hyper-Parameter Tuning
5.   Rethinking generalization

In [None]:
#@markdown Tutorial slides

#@markdown **Do not read them now.**
# you should link the slides for all tutorial videos here (we will store pdfs on osf)

from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/1N9aguIPiBSjzo0ToqPi5uwwG8ChY6_E3/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

---
# Setup
Note that some of the code for today can take up to an hour to run. We have therefore "hidden" that code and shown the resulting outputs.


**Ensure you're running a GPU notebook:**
From "Runtime" in the drop-down menu above, click "Change runtime type". Ensure that "Hardware Accelerator" says "GPU".

**Ensure you can save:** From "File", click "Save a copy in Drive"

In [None]:
# Imports
from __future__ import print_function

import time
import copy
import torch
import random
import pathlib

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.nn.utils.prune as prune

from torchvision import datasets, transforms
from torchvision.datasets import ImageFolder
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import DataLoader, TensorDataset

from tqdm.auto import tqdm
from IPython.display import HTML, display

##  Figure Settings


In [None]:
# @title Figure Settings
import ipywidgets as widgets
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")

##  Loading Animal Faces data


In [None]:
# @title Loading Animal Faces data
%%capture
!rm -r AnimalFaces32x32/
!git clone https://github.com/arashash/AnimalFaces32x32
!rm -r afhq/
!unzip ./AnimalFaces32x32/afhq_32x32.zip

##  Loading Animal Faces Randomized data


In [None]:
# @title Loading Animal Faces Randomized data
%%capture
!rm -r Animal_faces_random/
!git clone https://github.com/Ravi3191/Animal_faces_random.git
!rm -r afhq_random_32x32/
!unzip ./Animal_faces_random/afhq_random_32x32.zip
!rm -r afhq_10_32x32/
!unzip ./Animal_faces_random/afhq_10_32x32.zip

##  Plotting functions


In [None]:
# @title Plotting functions
def imshow(img):
  img = img / 2 + 0.5     # unnormalize
  npimg = img.numpy()
  plt.imshow(np.transpose(npimg, (1, 2, 0)))
  plt.axis(False)
  plt.show()

def plot_weights(norm, labels, ws, title='Weight Size Measurement'):
  plt.figure(figsize=[8, 6])
  plt.title(title)
  plt.ylabel('Frobenius Norm Value')
  plt.xlabel('Model Layers')
  plt.bar(labels, ws)
  plt.axhline(y=norm,
              linewidth=1,
              color='r',
              ls='--',
              label='Total Model F-Norm')
  plt.legend()
  plt.show()

##  Helper functions


In [None]:
# @title Helper functions

##Network Class - Animal Faces
class AnimalNet(nn.Module):
  def __init__(self):
    super(AnimalNet, self).__init__()
    self.fc1 = nn.Linear(3 * 32 * 32, 128)
    self.fc2 = nn.Linear(128, 32)
    self.fc3 = nn.Linear(32, 3)

  def forward(self, x):
    x = x.view(x.shape[0], -1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output


def train(args, model, device, train_loader, optimizer, epoch,
          reg_function1=None, reg_function2=None, criterion=F.nll_loss):
  """
  Trains the current inpur model using the data
  from Train_loader and Updates parameters for a single pass
  """
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    if reg_function1 is None:
      loss = criterion(output, target)
    elif reg_function2 is None:
      loss = criterion(output, target)+args['lambda']*reg_function1(model)
    else:
      loss = criterion(output, target)+args['lambda1']*reg_function1(model)+args['lambda2']*reg_function2(model)
    loss.backward()
    optimizer.step()


def test(model, device, test_loader, loader = 'Test', criterion=F.nll_loss):
  """
  Tests the current Model
  """
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      data, target = data.to(device), target.to(device)
      output = model(data)
      test_loss += criterion(output, target, reduction='sum').item()  # sum up batch loss
      pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
      correct += pred.eq(target.view_as(pred)).sum().item()

  test_loss /= len(test_loader.dataset)
  return 100. * correct / len(test_loader.dataset)


def main(args, model, train_loader, val_loader, test_data,
         reg_function1=None, reg_function2=None, criterion=F.nll_loss):
  """
  Trains the model with train_loader and tests the learned model using val_loader
  """

  use_cuda = not args['no_cuda'] and torch.cuda.is_available()
  device = torch.device('cuda' if use_cuda else 'cpu')

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

  val_acc_list, train_acc_list,param_norm_list = [], [], []
  for epoch in tqdm(range(args['epochs'])):
    train(args, model, device, train_loader, optimizer, epoch,reg_function1=reg_function1,reg_function2=reg_function2)
    train_acc = test(model,device,train_loader, 'Train')
    val_acc = test(model,device,val_loader, 'Val')
    param_norm = calculate_frobenius_norm(model)
    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)
    param_norm_list.append(param_norm)

  return val_acc_list, train_acc_list, param_norm_list, model, 0


def calculate_frobenius_norm(model):

  norm = 0.0

  # Sum the square of all parameters
  for param in model.parameters():
      norm += torch.sum(param**2)

  # Take a square root of the sum of squares of all the parameters
  norm = norm**0.5
  return norm


def calculate_frobenius_norm(model):
    norm = 0.0

    for name,param in model.named_parameters():
        norm += torch.norm(param).data**2
    return norm**0.5


class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()

    self.fc1 = nn.Linear(1, 300)
    self.fc2 = nn.Linear(300, 500)
    self.fc3 = nn.Linear(500, 1)

  def forward(self, x):
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    output = self.fc3(x)
    return output

# Network Class - Animal Faces
class BigAnimalNet(nn.Module):
  def __init__(self):
    super(BigAnimalNet, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 124)
    self.fc2 = nn.Linear(124, 64)
    self.fc3 = nn.Linear(64, 3)

  def forward(self, x):
    x = x.view(x.shape[0],-1)
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output


def visualize_data(dataloader):

  for idx, (data,label) in enumerate(dataloader):
    plt.figure(idx)
    # Choose the datapoint you would like to visualize
    index = 22

    # choose that datapoint using index and permute the dimensions
    # and bring the pixel values between [0,1]
    data = data[index].permute(1, 2, 0) * \
           torch.tensor([0.5, 0.5, 0.5]) + \
           torch.tensor([0.5, 0.5, 0.5])

    # Convert the torch tensor into numpy
    data = data.numpy()

    plt.imshow(data)
    plt.axis(False)
    image_class = classes[label[index].item()]
    print(f'The image belongs to : {image_class}')

  plt.show()


def early_stopping_main(args, model, train_loader, val_loader, test_data):

  device = set_device()

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

  best_acc  = 0.0
  best_epoch = 0

  # Number of successive epochs that you want to wait before stopping training process
  patience = 20

  # Keps track of number of epochs during which the val_acc was less than best_acc
  wait = 0

  val_acc_list, train_acc_list = [], []
  for epoch in tqdm(range(args['epochs'])):
    train(args, model, device, train_loader, optimizer, epoch)
    train_acc = test(model,device,train_loader, 'Train')
    val_acc = test(model,device,val_loader, 'Val')
    if (val_acc > best_acc):
      best_acc = val_acc
      best_epoch = epoch
      best_model = copy.deepcopy(model)
      wait = 0
    else:
      wait += 1
    if (wait > patience):
      print('early stopped on epoch:',epoch)
      break
    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)

  return val_acc_list, train_acc_list, best_model, best_epoch

##  Seeding for Reproducibility


In [None]:
#@title Seeding for Reproducibility
def set_seed(seed=None, seed_torch=True):
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')


# In case that `DataLoader` is used
def seed_worker(worker_id):
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)


set_seed(seed=90108, seed_torch=True)

##  Set device (GPU or CPU). Execute `set_device()`


In [None]:
#@title Set device (GPU or CPU). Execute `set_device()`

# inform the user if the notebook uses GPU or CPU.

def set_device():
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device


set_device()

In [None]:
# Seed parameter
# Notice that changing this values some results may not be identical
# with the solutions
SEED = 90108
DEVICE = set_device()

##  Dataloaders for the Dataset


In [None]:
#@title Dataloaders for the Dataset
##Dataloaders for the Dataset
batch_size = 128
classes = ('cat', 'dog', 'wild')

train_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
data_path = pathlib.Path('.')/'afhq' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)


####################################################
g_seed = torch.Generator()
g_seed.manual_seed(SEED)


##Dataloaders for the  Original Dataset
img_train_data, img_val_data,_ = torch.utils.data.random_split(img_dataset,
                                                               [100, 100, 14430])

#Creating train_loader and Val_loader
train_loader = torch.utils.data.DataLoader(img_train_data,batch_size=batch_size,
                                           worker_init_fn=seed_worker,
                                           generator=g_seed)
val_loader = torch.utils.data.DataLoader(img_val_data,batch_size=1000,
                                         worker_init_fn=seed_worker,
                                         generator=g_seed)

#creating test dataset
test_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
img_test_dataset = ImageFolder(data_path/'val', transform=test_transform)


####################################################

##Dataloaders for the  Random Dataset

#splitting randomized data into training and validation data
data_path = pathlib.Path('.')/'afhq_random_32x32/afhq_random' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
random_img_train_data, random_img_val_data,_ = torch.utils.data.random_split(img_dataset, [100,100,14430])

#Randomized train and validation dataloader
rand_train_loader = torch.utils.data.DataLoader(random_img_train_data,
                                                batch_size=batch_size,num_workers = 0,
                                                worker_init_fn=seed_worker,
                                                generator=g_seed)
rand_val_loader = torch.utils.data.DataLoader(random_img_val_data,
                                              batch_size=1000,
                                              num_workers = 0,
                                              worker_init_fn=seed_worker,
                                              generator=g_seed)

####################################################

##Dataloaders for the Partially Random Dataset

#Splitting data between training and validation dataset for partially randomized data
data_path = pathlib.Path('.')/'afhq_10_32x32/afhq_10' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
partially_random_train_data, partially_random_val_data, _ = torch.utils.data.random_split(img_dataset, [100,100,14430])

#Training and Validation loader for partially randomized data
partial_rand_train_loader = torch.utils.data.DataLoader(partially_random_train_data,
                                                        batch_size=batch_size,num_workers = 0,
                                                        worker_init_fn=seed_worker,
                                                        generator=g_seed)
partial_rand_val_loader = torch.utils.data.DataLoader(partially_random_val_data,
                                                      batch_size=1000,
                                                      num_workers = 0,
                                                      worker_init_fn=seed_worker,
                                                      generator=g_seed)

---
# Section 1: L1 and L2 Regularization


##  Video 1: L1 and L2 regression


In [None]:
#@title Video 1: L1 and L2 regression
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"oQNdloKdysM", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

Some of you might have already come across L1 and L2 regularization before in other courses. L1 and L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term.

***Cost function = Loss (say, binary cross entropy) + Regularization term***

This regularization term makes the parameters smaller, giving simpler models that will overfit less.

Discuss among your teammates whether the above assumption is good or bad?

### Section 1.1: Unregularized Model

 #### Dataloaders for Regularization


In [None]:
#@markdown #### Dataloaders for Regularization
data_path = pathlib.Path('.')/'afhq' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)

# Splitting dataset
reg_train_data, reg_val_data,_ = torch.utils.data.random_split(img_dataset,
                                                               [30, 100, 14500])
g = torch.Generator()
g.manual_seed(SEED)

# Creating train_loader and Val_loader
reg_train_loader = torch.utils.data.DataLoader(reg_train_data,
                                               batch_size=batch_size,
                                               worker_init_fn=seed_worker,
                                               generator=g)
reg_val_loader = torch.utils.data.DataLoader(reg_val_data,
                                             batch_size=1000,
                                             worker_init_fn=seed_worker,
                                             generator=g)

Now let's train a model without any regularization and keep it aside as our benchmark for this section.

In [None]:
args = {
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'no_cuda': False,
}

acc_dict = {}
model = AnimalNet()

val_acc_unreg, train_acc_unreg,param_norm_unreg,_ ,_ = main(args,
                                                            model,
                                                            reg_train_loader,
                                                            reg_val_loader,
                                                            img_test_dataset)

# Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_unreg, label='Val Accuracy', c='red', ls='dashed')
plt.plot(train_acc_unreg, label='Train Accuracy', c='red', ls='solid')
plt.axhline(y=max(val_acc_unreg), c='green', ls='dashed')
plt.title('Unregularized Model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print('maximum Validation Accuracy reached:%f' % max(val_acc_unreg))

## Section 1.2: L1 Regularization

L1 (or "LASSO") Regularization uses a penalty which is the sum of the absolute value of all the weights in the DLN, resulting in the following loss function (L  is the usual Cross Entropy loss):

\begin{equation}
L_R=L+λ∑|w^{(r)}_{ij}|
\end{equation}

At a high level, L1 Regularization is similar to L2 Regularization since it leads to smaller weights. (You will see the analogy in the next subsection.) It results in the following weight update equation when using Stochastic Gradient Descent (where  sgn  is the sign function, such that  sgn(w)=+1  if  w>0 ,  sgn(w)=−1  if  $w<0$ , and sgn(0)=0 ):

\begin{equation}
w^{(r)}_{ij}←w^{(r)}_{ij}−ηλsgn(w^{(r)}_{ij})−η\frac{\partial L}{\partial w_{ij}^{r}} 
\end{equation}

### Coding Exercise 1.1: L1 Regularization

Write a function which calculates the L1 norm of all the tensors of a Pytorch model.

In [None]:
def l1_reg(model):
  """
    Inputs: Pytorch model
    This function calculates the l1 norm of the all the tensors in the model
  """
  l1 = 0.0
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Complete the l1_reg function")
  ####################################################################
  for param in model.parameters():
    l1 += ...

  return l1


set_seed(seed=SEED)
## uncomment to test
# net = nn.Linear(20, 20)
# print(f'L1 norm of the model: {l1_reg(net)}')

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial2_Solution_a6c2ad69.py)



```
Random seed 2021 has been set.
L1 norm of the model: 48.445133209228516
```

Now, let's train a classifier which uses L1 regularization. Tune the hyperparameter `lambda` such that the validation accuracy is higher than that of the unregularized model.

In [None]:
args = {
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'no_cuda': False,
    'lambda': 0.0  # <<<<<<<< Tune the hyperparameter lambda
}

acc_dict = {}
model = AnimalNet()

val_acc_l1reg, train_acc_l1reg, param_norm_l1reg, _, _ = main(args,
                                                              model,
                                                              reg_train_loader,
                                                              reg_val_loader,
                                                              img_test_dataset,
                                                              reg_function1=l1_reg)

# Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_l1reg, label='Val Accuracy L1 Regularized',
         c='red', ls='dashed')
plt.plot(train_acc_l1reg, label='Train Accuracy L1 regularized',
         c='red', ls='solid')
plt.axhline(y=max(val_acc_l1reg), c='green', ls='dashed')
plt.title('L1 regularized model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print('maximum Validation Accuracy reached:%f'%max(val_acc_l1reg))

What value of Lambda worked for L1 Regularization?

## Section 1.3: L2 / Ridge Regularization

L2 Regularization, sometimes referred to as “Weight Decay”, is widely used. It works by adding a quadratic penalty term to the Cross Entropy Loss Function  L, which results in a new Loss Function  LR  given by:

\begin{equation}
LR=L+λ∑(w^{(r)}_{ij})^2
\end{equation}

In order to get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations for the weight and bias parameters. Taking the derivative on both sides of the above equation, we obtain

\begin{equation}
\frac{\partial L_r}{\partial w^{(r)}_{ij}}=\frac{\partial L}{\partial w^{(r)}_{ij}}+λw^{(r)}_{ij}
\end{equation}
Thus the weight update rule becomes:

\begin{equation}
w^{(r)}_{ij}←w^{(r)}_{ij}−η\frac{\partial L}{\partial W^{(r)}_{ij}}−ηλw^{(r)}_{ij}=(1−ηλ)w^{(r)}_{ij}−η\frac{\partial L}{\partial w^{(r)}_{ij}}
\end{equation}

where, $\eta$ is learning rate.

### Coding Exercise 1.2: L2 Regularization

Write a function which calculates the L2 norm of all the tensors of a Pytorch model. (What did we call this before?)

In [None]:
def l2_reg(model):

  """
    Inputs: Pytorch model
    This function calculates the l2 norm of the all the tensors in the model
  """

  l2 = 0.0
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Complete the l2_reg function")
  ####################################################################
  for param in model.parameters():
    l2 += ...

  return l2


set_seed(SEED)
## uncomment to test
# net = nn.Linear(20, 20)
# print(f'L2 norm of the model: {l2_reg(net)}')

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main//tutorials/W1D5_Regularization/solutions/W1D5_Tutorial2_Solution_984c4088.py)



```
Random seed 2021 has been set.
L2 norm of the model: 7.328375816345215
```

Now we'll train a classifier which uses L2 regularization. Tune the hyperparameter `lambda` such that the val accuracy is higher than that of the unregularized model.

In [None]:
args = {
    'test_batch_size': 1000,
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'no_cuda': False,
    'lambda': 0.0  # <<<<<<<< Tune the hyperparameter lambda
}

acc_dict = {}
model = AnimalNet()

val_acc_l2reg, train_acc_l2reg, param_norm_l2reg, model, _ = main(args,
                                                                  model,
                                                                  train_loader,
                                                                  val_loader,
                                                                  img_test_dataset,
                                                                  reg_function2=l2_reg)

##Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_l2reg, label='Val Accuracy L2 regularized',
         c='red', ls='dashed')
plt.plot(train_acc_l2reg, label='Train Accuracy L2 regularized',
         c='red', ls='solid')
plt.axhline(y=max(val_acc_l2reg), c='green', ls='dashed')
plt.title('L2 Regularized Model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print('maximum Validation Accuracy reached:%f'%max(val_acc_l2reg))

What value of Lambda worked for L2 Regularization?

 #### Visualize all of them together (Run Me!)


In [None]:
#@markdown #### Visualize all of them together (Run Me!)
args = {'test_batch_size': 1000,
        'epochs': 150,
        'lr': 5e-3,
        'momentum': 0.99,
        'no_cuda': False,
        'lambda1': 0.001,
        'lambda2': 0.001
        }
model = AnimalNet()
val_acc_l1l2reg, train_acc_l1l2reg, param_norm_l1l2reg, _, _ = main(args,
                                                  model,
                                                  train_loader,
                                                  val_loader,
                                                  img_test_dataset,
                                                  reg_function1=l1_reg,
                                                  reg_function2=l2_reg)

plt.figure(figsize=(12, 6))
plt.plot(val_acc_l2reg,c='red',ls = 'dashed')
plt.plot(train_acc_l2reg,label='L2 regularized',c='red',ls = 'solid')
plt.axhline(y=max(val_acc_l2reg),c = 'red',ls = 'dashed')
plt.plot(val_acc_l1reg,c='green',ls = 'dashed')
plt.plot(train_acc_l1reg,label='L1 regularized',c='green',ls = 'solid')
plt.axhline(y=max(val_acc_l1reg),c = 'green',ls = 'dashed')
plt.plot(val_acc_unreg,c='blue',ls = 'dashed')
plt.plot(train_acc_unreg,label='Unregularized',c='blue',ls = 'solid')
plt.axhline(y=max(val_acc_unreg),c = 'blue',ls = 'dashed')
plt.plot(val_acc_l1l2reg,c='orange',ls = 'dashed')
plt.plot(train_acc_l1l2reg,label='L1+L2 regularized',c='orange',ls = 'solid')
plt.axhline(y=max(val_acc_l1l2reg),c = 'orange',ls = 'dashed')

plt.title('Unregularized Vs L1-Regularized vs L2-regularized Vs L1+L2 regularized')
plt.xlabel('epoch')
plt.ylabel('Accuracy(%)')
plt.legend()
plt.show()

Now, let's visualize what these different regularization does to the parameters of the model. We observe the effect by computing the size (technically, the Frobenius norm) of the model parameters

 #### Visualize Norm of the Models (Train Me!)


In [None]:
#@markdown #### Visualize Norm of the Models (Train Me!)
plt.figure(figsize=(8, 6))
plt.plot(param_norm_unreg,label='Unregularized',c = 'blue')
plt.plot(param_norm_l1reg,label = 'L1 Regularized', c='green')
plt.plot(param_norm_l2reg,label='L2 Regularized',c='red')
plt.plot(param_norm_l1l2reg,label='L1+L2 Regularized',c='orange')
plt.title('Parameter Norm as a function of training Epoch')
plt.xlabel('epoch')
plt.ylabel('Parameter Norms')
plt.legend()
plt.show()

In the above plots, you should have seen that even after the model acheives 100% train accuracy the val accuracies are fluctuating This suggests that the model is still trying to learn something. Why whould this be the case?

---
# Section 2: Dropout


##  Video 2: Dropout


In [None]:
#@title Video 2: Dropout
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"UZfUzawej3A", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

In dropout, we literally drop out (zero out) some neurons during training. Throughout training, on each iteration, standard dropout zeros out some fraction (usually 1/2) of the nodes in each layer before calculating the subsequent layer. Randomly selecting different subsets to dropout introduces noise into the process and reduces overfitting.

<center><img src="https://d2l.ai/_images/dropout2.svg" alt="Dropout" width="600"/></center>


Now let's revisit the toy dataset that we generated above to visualize how the dropout stabilizes training on a noisy dataset. We will slightly modify the architecture we used above to add dropout layers.

In [None]:
# Network Class - 2D
class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()

    self.fc1 = nn.Linear(1, 300)
    self.fc2 = nn.Linear(300, 500)
    self.fc3 = nn.Linear(500, 1)
    self.dropout1 = nn.Dropout(0.4)
    self.dropout2 = nn.Dropout(0.2)

  def forward(self, x):
    x = F.leaky_relu(self.dropout1(self.fc1(x)))
    x = F.leaky_relu(self.dropout2(self.fc2(x)))
    output = self.fc3(x)
    return output

 #### Run to train the default network


In [None]:
#@markdown #### Run to train the default network

#creating train data
X = torch.rand((10,1))
X.sort(dim = 0)
Y = 2*X + 2*torch.empty((X.shape[0],1)).normal_(mean=0,std=1) #adding small error in the data

X = X.unsqueeze_(1)
Y = Y.unsqueeze_(1)

#creating test dataset
X_test = torch.linspace(0, 1, 40)
X_test = X_test.reshape((40, 1, 1))

#train the network on toy dataset
model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(),lr = 1e-4)
max_epochs = 10000
iters = 0

running_predictions = np.empty((40,(int)(max_epochs/500 + 1)))

train_loss = []
test_loss = []
model_norm = []

for epoch in tqdm(range(max_epochs)):

  #training
  model_norm.append(calculate_frobenius_norm(model))
  model.train()
  optimizer.zero_grad()
  predictions = model(X)
  loss = criterion(predictions,Y)
  loss.backward()
  optimizer.step()

  train_loss.append(loss.data)
  model.eval()
  Y_test = model(X_test)
  loss = criterion(Y_test,2*X_test)
  test_loss.append(loss.data)

  if (epoch % 500 == 0 or epoch == max_epochs - 1):
    running_predictions[:,iters] = Y_test[:,0,0].detach().numpy()
    iters += 1


model = BigAnimalNet()

args = {'test_batch_size': 1000,
        'epochs': 200,
        'lr': 5e-3,
        'momentum': 0.9,
        'no_cuda': False,
        }

val_acc_pure, train_acc_pure, _, model ,_ = main(args,
                                                 model,
                                                 train_loader,
                                                 val_loader,
                                                 img_test_dataset)

In [None]:
# train the network on toy dataset
model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
max_epochs = 10000
iters = 0

running_predictions_dp = np.empty((40, (int)(max_epochs / 500)))

train_loss_dp = []
test_loss_dp = []
model_norm_dp = []

for epoch in tqdm(range(max_epochs)):

  # training
  model_norm_dp.append(calculate_frobenius_norm(model))
  model.train()
  optimizer.zero_grad()
  predictions = model(X)
  loss = criterion(predictions, Y)
  loss.backward()
  optimizer.step()

  train_loss_dp.append(loss.data)
  model.eval()
  Y_test = model(X_test)
  loss = criterion(Y_test, 2*X_test)
  test_loss_dp.append(loss.data)

  if (epoch % 500 == 0 or epoch == max_epochs):
    running_predictions_dp[:, iters] = Y_test[:, 0, 0].detach().numpy()
    iters += 1

Now that we have finished training, let's see how the model has evolved over the training process.

 #### Visualization (Run Me!)


In [None]:
#@markdown #### Visualization (Run Me!)
fig = plt.figure(figsize=(8, 6))
ax = plt.axes()
def frame(i):
    ax.clear()
    ax.scatter(X[:,0,:].numpy(),Y[:,0,:].numpy())
    plot = ax.plot(X_test[:,0,:].detach().numpy(),running_predictions_dp[:,i])
    title = "Epoch: " + str(i * 500)
    plt.title(title)
    ax.set_xlabel("X axis")
    ax.set_ylabel("Y axis")
    return plot
anim = animation.FuncAnimation(fig, frame, frames=range(20), blit=False, repeat=False, repeat_delay=10000)
html_anim = HTML(anim.to_html5_video());
plt.close()
display(html_anim)

 #### Plot the train and test losses


In [None]:
#@markdown #### Plot the train and test losses
plt.figure(figsize=(8, 6))
plt.plot(test_loss_dp,label='test_loss dropout',c = 'blue',ls='dashed')
plt.plot(test_loss, label='test_loss',c = 'red',ls='dashed')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.title('dropout vs without dropout')
plt.legend()
plt.show()

 #### Plot the train and test losses


In [None]:
#@markdown #### Plot the train and test losses
plt.figure(figsize=(8, 6))
plt.plot(train_loss_dp,label='train_loss dropout',c = 'blue',ls='dashed')
plt.plot(train_loss, label='train_loss',c = 'red',ls='dashed')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.title('dropout vs without dropout')
plt.legend()
plt.show()

 #### Plot model weights with epoch


In [None]:
#@markdown #### Plot model weights with epoch
plt.figure(figsize=(8, 6))
plt.plot(model_norm_dp, label = 'dropout')
plt.plot(model_norm, label = 'no dropout')
plt.ylabel('norm of the model')
plt.xlabel('epochs')
plt.legend()
plt.title('Size of the model vs Epochs')
plt.show()

Do you think this performed better than the initial model?

## Section 2.1: Dropout Implementation Caveats: 


*  Dropout is used only during training, during testing the complete model weights are used and hence it is important to use model.eval() before testing the model. 

* Dropout reduces the capacity of the model during training and hence as a general practice wider networks are used when using dropout. If you are using a dropout with a random probability of 0.5 then you might want to double the number of hidden neurons in that layer.

Now, let's see how dropout fares on the Animal Faces Dataset. We first modify the existing model to include dropout and then train the model.

In [None]:
# Network Class - Animal Faces
class AnimalNetDropout(nn.Module):
  def __init__(self):
    torch.manual_seed(32)
    super(AnimalNetDropout, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 248)
    self.fc2 = nn.Linear(248, 210)
    self.fc3 = nn.Linear(210, 3)
    self.dropout1 = nn.Dropout(p=0.5)
    self.dropout2 = nn.Dropout(p=0.3)

  def forward(self, x):
      x = x.view(x.shape[0], -1)
      x = F.leaky_relu(self.dropout1(self.fc1(x)))
      x =F.leaky_relu(self.dropout2(self.fc2(x)))
      x = self.fc3(x)
      output = F.log_softmax(x, dim=1)
      return output

In [None]:
args = {
    'test_batch_size': 1000,
    'epochs': 200,
    'lr': 5e-3,
    'batch_size': 32,
    'momentum': 0.9,
    'no_cuda': False,
    'seed': 1,
    'log_interval': 100
}

acc_dict = {}
model = AnimalNetDropout()

val_acc_dropout, train_acc_dropout, _, model ,_ = main(args,
                                                       model,
                                                       train_loader,
                                                       val_loader,
                                                       img_test_dataset)

##Train and Test accuracy plot

plt.plot(val_acc_pure, label='Val', c='blue', ls='dashed')
plt.plot(train_acc_pure, label='Train', c='blue', ls='solid')
plt.plot(val_acc_dropout, label='Val - DP', c='red', ls='dashed')
plt.plot(train_acc_dropout, label='Train - DP', c='red', ls='solid')
plt.title('Dropout')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

When do you think dropouts can perform bad and do you think their placement within a model matters?

---
# Section 3: Data Augmentation


##  Video 3: Data Augmentation


In [None]:
#@title Video 3: Data Augmentation
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"nm44FhjL3xc", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

Data augmentation is often used to increase the number of training samples. Now we will explore the effects of data augmentation on regularization. Here regularization is acheived by adding noise into training data after every epoch.

Pytorch's torchvision module provides a few built-in data augmentation techniques, which we can use on image datasets. Some of the techniques we most frequently use are:


*   Random Crop
*   Random Rotate
*   Vertical Flip
*   Horizontal Flip



 ####  Data Loader without Data Augmentation


In [None]:
#@markdown ####  Data Loader without Data Augmentation
train_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
data_path = pathlib.Path('.')/'afhq' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)

#Splitting dataset
img_train_data, img_val_data,_ = torch.utils.data.random_split(img_dataset, [250,100,14280])

#Creating train_loader and Val_loader
train_loader = torch.utils.data.DataLoader(img_train_data,batch_size=batch_size,worker_init_fn=seed_worker)
val_loader = torch.utils.data.DataLoader(img_val_data,batch_size=1000,worker_init_fn=seed_worker)

Define a DataLoader using [torchvision.transforms](https://pytorch.org/docs/stable/torchvision/transforms.html) which randomly augments the data for us. 

In [None]:
##Data Augmentation using transforms
new_transforms = transforms.Compose([
                                     transforms.RandomHorizontalFlip(p=0.1),
                                     transforms.RandomVerticalFlip(p=0.1),
                                     transforms.ToTensor(),
                                     transforms.Normalize((0.5, 0.5, 0.5),
                                                          (0.5, 0.5, 0.5))
])

data_path = pathlib.Path('.')/'afhq' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=new_transforms)
#Splitting dataset
new_train_data, _,_ = torch.utils.data.random_split(img_dataset,
                                                    [250, 100, 14280])

#Creating train_loader and Val_loader
new_train_loader = torch.utils.data.DataLoader(new_train_data,
                                               batch_size=batch_size,
                                               worker_init_fn=seed_worker)

In [None]:
args = {
    'epochs': 250,
    'lr': 1e-3,
    'momentum': 0.99,
    'no_cuda': False,
}


acc_dict = {}
model = AnimalNet()

val_acc_dataaug, train_acc_dataaug, param_norm_datadug, _ ,_ = main(args,
                                                                    model,
                                                                    new_train_loader,
                                                                    val_loader,
                                                                    img_test_dataset)
model = AnimalNet()
val_acc_pure, train_acc_pure, param_norm_pure, _, _ = main(args,
                                                           model,
                                                           train_loader,
                                                           val_loader,
                                                           img_test_dataset)


##Train and Test accuracy plot
plt.figure(figsize=(8, 6))
plt.plot(val_acc_pure, label='Val Accuracy Pure',
         c='red', ls='dashed')
plt.plot(train_acc_pure, label='Train Accuracy Pure',
         c='red', ls='solid')
plt.plot(val_acc_dataaug, label='Val Accuracy data augment',
         c='blue', ls='dashed')
plt.plot(train_acc_dataaug, label='Train Accuracy data augment',
         c='blue', ls='solid')
plt.axhline(y=max(val_acc_pure), c='red', ls='dashed')
plt.axhline(y=max(val_acc_dataaug), c='blue', ls='dashed')
plt.title('Data Augmentation')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(param_norm_pure, c='red', label='Without Augmentation')
plt.plot(param_norm_datadug, c='blue', label='With Augmentation')
plt.title('Norm of parameters as a function of training epoch')
plt.xlabel('epoch')
plt.ylabel('Norm of model parameters')
plt.legend()
plt.show()

Can you think of more ways of augmenting training data? (Think of other problems beyond object recogition.)

### Think! 3.1: Thought Question
Why does it work better to regularize an overparameterized ANN than to start with a smaller one? Think about  the regularization  methods you know.
Each group has a 10 min discussion.

---
# Section 4: Stochastic Gradient Descent


##  Video 4: SGD


In [None]:
#@title Video 4: SGD
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"rjzlFvJhNqE", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

## Section 4.1: Learning Rate
In this section, we will see how learning rate can act as regularizer while training a neural network. In summary:


*   Smaller learning rates regularize less. They slowly converge to deep minima. 
*   Larger learning rates regularizes more by missing local minima and converging to broader, flatter minima, which often generalize better.

But beware, a very large learning rate may result in overshooting or finding a really bad local minimum.



In the block below, we will train the Animal Net model with different learning rates and see how that affects the regularization.

 #### Generating Data Loaders


In [None]:
#@markdown #### Generating Data Loaders
batch_size = 128
train_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])

data_path = pathlib.Path('.')/'afhq' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
img_train_data, img_val_data, = torch.utils.data.random_split(img_dataset, [11700,2930])

full_train_loader = torch.utils.data.DataLoader(img_train_data,
                                                batch_size=batch_size,
                                                num_workers=2,
                                                worker_init_fn=seed_worker,
                                                generator=g)
full_val_loader = torch.utils.data.DataLoader(img_val_data,
                                              batch_size=1000,
                                              num_workers=4,
                                              worker_init_fn=seed_worker,
                                              generator=g)

test_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))    # [TO-DO]
     ])
img_test_dataset = ImageFolder(data_path/'val', transform=test_transform)
# img_test_loader = DataLoader(img_test_dataset, batch_size=batch_size,shuffle=False, num_workers=1)
classes = ('cat', 'dog', 'wild')

In [None]:
args = {
    'test_batch_size': 1000,
    'epochs': 350,
    'batch_size': 32,
    'momentum': 0.99,
    'no_cuda': False
}

lr = [5e-4, 1e-3, 5e-3]
acc_dict = {}

for i in range(len(lr)):
    model = AnimalNet()
    args['lr'] = lr[i]
    val_acc, train_acc, param_norm,_,_ = main(args,
                                              model,
                                              train_loader,
                                              val_loader,
                                              img_test_dataset)
    acc_dict['val_'+str(i)] = val_acc
    acc_dict['train_'+str(i)] = train_acc
    acc_dict['param_norm'+str(i)] = param_norm

 #### Plot Train and Validation accuracy (Run me)


In [None]:
#@markdown #### Plot Train and Validation accuracy (Run me)
plt.figure(figsize=(8, 6))
plt.plot(acc_dict['val_0'], linestyle='dashed',label='lr = 5e-4 - validation', c = 'blue')
plt.plot(acc_dict['train_0'],label = '5e-4 - train', c = 'blue')
plt.plot(acc_dict['val_1'], linestyle='dashed',label='lr = 1e-3 - validation', c = 'green')
plt.plot(acc_dict['train_1'],label='1e-3 - train', c = 'green')
plt.plot(acc_dict['val_2'], linestyle='dashed',label='lr = 5e-3 - validation', c = 'purple')
plt.plot(acc_dict['train_2'],label = '5e-3 - train', c = 'purple')
plt.title('Optimal Learning Rate')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
print('Maximum Test Accuracy obtained with lr = 5e-4: '+str(max(acc_dict['val_0'])))
print('Maximum Test Accuracy obtained with lr = 1e-3: '+str(max(acc_dict['val_1'])))
print('Maximum Test Accuracy obtained with lr = 5e-3: '+str(max(acc_dict['val_2'])))
plt.legend()
plt.show()

 #### Plot parametric norms (Run me)


In [None]:
#@markdown #### Plot parametric norms (Run me)
plt.figure(figsize=(8, 6))
plt.plot(acc_dict['param_norm0'],label='lr = 5e-4',c='blue')
plt.plot(acc_dict['param_norm1'],label = 'lr = 1e-3',c='green')
plt.plot(acc_dict['param_norm2'],label ='lr = 5e-3', c='red')
plt.legend()
plt.xlabel('epoch')
plt.ylabel('parameter norms')
plt.show()

In the model above, we observe something different from what we expected. Why do you think this is happening?

---
# Section 5: Hyperparameter Tuning


##  Video 5: Hyperparameter tuning


In [None]:
#@title Video 5: Hyperparameter tuning
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"HgkiKRYc-3A", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)



Hyper-Parameter tuning is often difficult and time consuming.  It is a key part of training any Deep Learning model to give good generalization. There are a few techniques that we can use to guide us during the search. 



*   Grid Search: Try all possible combinations of hyperparameters
*   Random Search: Randomly try different combinations of hyperparameters
*   Coordinate-wise Gradient Descent: Start at one set of hyperparameters and try changing one at a time, accept any changes that reduce your validation error
*   Bayesian Optimization/ Auto ML:  Start from a set of hyperparameters that have worked well on a similar problem, and then do some sort of local exploration (e.g. gradient descent) from there.

There are lots of choices, like what range to explore over, which parameter to optimize first, etc. Some hyperparameters don’t matter much (people use a dropout of either 0.5 or 0, but not much else).  Others can matter a lot more (e.g. size and depth of the neural net). The key is to see what worked on similar problems.

One can automate the process of tuning the network Architecture using "Neural Architecture Search", which designs new architectures using a few building blocks (Linear, Convolutional, Convolution Layers, etc.) and optimizes the design based on performance using a wide range of techniques such as Grid Search, Reinforcement Learning, GD, Evolutionary Algorithms, etc. This obviously requires very high computer power. Read this [article](https://lilianweng.github.io/lil-log/2020/08/06/neural-architecture-search.html) to learn more about NAS.    


Which regularization technique today do you think had the biggest effect on the network? Why might do you think so? Can you apply all of the regularization methods on the same network?

---
# Section 6: Adversarial  Attacks


##  Video 6: Adversarial Attacks


In [None]:
#@title Video 6: Adversarial Attacks
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
    def __init__(self, id, page=1, width=400, height=300, **kwargs):
      self.id=id
      src = "https://player.bilibili.com/player.html?bvid={0}&page={1}".format(id, page)
      super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id=f"", width=730, height=410, fs=1)
  print("Video available at https://www.bilibili.com/video/{0}".format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id=f"LzPPoiKi5jE", width=730, height=410, fs=1, rel=0)
  print("Video available at https://youtube.com/watch?v=" + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

Designing perturbations to the input data to trick a machine learning model is called an "adversarial attack". These attacks are an inevitable consequence of learning in high dimensional space with complex decision boundaries. Depending on the application, these attacks can be very dangerous.

![Adversarial Examples of a Stop Sign](https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13638-020-01775-5/MediaObjects/13638_2020_1775_Fig1_HTML.png?as=webp)

Hence, it is important for us to build models which can defend against such attcks. One possible way to do it is by regularizing the networks, which smooths the decision boundaries. A few ways of building models robust to such attachs are:



*   [Defensive Distillation](https://deepai.org/machine-learning-glossary-and-terms/defensive-distillation) : Models trained via distillation are less prone to such attacks as they are trained on soft labels as there is an element of randomness in the training process.
*   [Feature Squeezing](https://evademl.org/squeezing/): Identifies adversarial attacks for on-line classifiers whose model is being used by comparing model's perdiction before and after squeezing the input. 
* [SGD](https://arxiv.org/abs/1706.06083) You can also pick weight to minimize what the adversary is trying to maximize via SGD.

In the optional supplemental project, you can design an attack and defend your model against it using regularization techniques you learned this week. 


---
# Optional Supplements

1.   [Understanding Generalization](https://docs.google.com/document/d/1XOaTXYBleQlDNFM1-t512RHfJXRwA4-LIejuBA6pbLY/edit)
2.   [Adversarial Attacks](https://)