# Lecture 5: Overfitting and Dropout


In this notebook, you'll find various tasks encompassing both theoretical and coding exercises. Each exercise corresponds to a specific number of points, which are explicitly indicated within the task description.

Always use the Jupyter kernel associated with the dedicated environment when compiling the notebook and completing your exercises.

## Excercise 1 (Theory) (20/100)

### Bias Variance Tradeoff

Consider the squared loss function commonly used in regression problems, defined as $L(y, \hat{y}) = [(y + \epsilon) - \hat{y}]^2$, where $y$ is the true target value, $\epsilon$ is some random noise and $\hat{y}$ is the predicted value by the model.

- **Task (1.a)** **(10 pts.)** Derive the decomposition of the expected squared error $\mathrm{Err}=E[(y + \epsilon - \hat{y})^2]$ into bias, variance, and irreducible error terms (refer to the lecture for more details).
- **Task (1.b)** **(10 pts.)** Discuss this decomposition in light of the bias-variance tradeoff in machine learning models and in the context of overfitting. What can you say of each single term in the final version you obtain?


> #### Your solution here

## Excercise 2 (Theory) (15/100)

### Bias Variance Tradeoff (part 2)

Consider a regression problem where you have a dataset consisting of $n$ data points where each point consists of a pair $(\mathbf{x}, y)$ with $y=f(\mathbf{x})$. You decide to fit a linear regression model to this dataset. After training the model, you evaluate its performance using mean squared error (MSE). You notice that the model has a high MSE on both the training and test datasets.

- **Task (2.a)** **(5 pts.)** Explain whether the model is suffering from high bias, high variance, or both.
- **Task (2.b)** **(5 pts.)** Propose one approach to decrease the bias and one approach to decrease the variance of the model.
- **Task (2.c)** **(5 pts.)** Provide a justification for each proposed approach.

> #### Your solution here

## Excercise 1 (Programming) (65/100)

### Dropout

The code below loads the CIFAR10 dataset and splits it into train and test datasets

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

transform = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

batch_size = 64


train_dataset = torchvision.datasets.CIFAR10(root="./data", train=True, 
                                             download=True, transform=transform)
                                             
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size,
                                           shuffle=True, num_workers=2)

test_dataset = torchvision.datasets.CIFAR10(root="./data", train=False, 
                                             download=True, transform=transform)
                                             
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size,
                                           shuffle=False, num_workers=2)
                                           
# the CIFAR10 classes
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


### **Task (1.a)** **(10 pts.)** 
From the experience you gained in the previous excercise sheets, implement a neural network choosing arbitrary (based on your intuition) the number of layers, activation functions etc.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # ----------------------------------
        #  Your code here
        # ----------------------------------

    def forward(self, x):
        # ----------------------------------
        #  Your code here
        # ----------------------------------
        pass

net = Net()

### **Task (1.b)** **(10 pts.)** 
For the next task, we want to use the cross-entropy loss as the objective function. Implement the cross-entropy loss from scratch. This should take predicted values and true values as input. 

In [None]:
def my_cross_entropy(y_pred, y_true):
    # ----------------------------------
    #  Your code here
    # ----------------------------------
    pass

### **Task (1.c)** **(20 pts.)** 
The cell below gives you an implementation for computing the loss on the test data. Implement an appropriate training loop and train the neural network for a sufficient number of epochs. How many training steps you can achieve likely depends on you laptop/device.   

Print the train and test loss after every $n$ steps. In the cell below $n$ is set to 1000 but you are free to change it to any value you think is informative to appreciate the change in the loss. 

> Note: For the more experienced students, you can leverage GPUs (if you have them) to achieve faster computation. Checkout how to achieve this by using the `.to(device)` function in Pytorch.

In [None]:
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

def get_test_loss(net, loss_criterion, data_loader):
  testing_loss = []
  with torch.no_grad():
    for data in data_loader:
      inputs, labels = data
      outputs = net(inputs)
      # calculate the loss for this batch
      loss = loss_criterion(outputs, labels)
      # add the loss of this batch to the list
      testing_loss.append(loss.item())
  # calculate the average loss
  return sum(testing_loss) / len(testing_loss)

In [None]:
training_loss, testing_loss = [], []
running_loss = []
i = 0

## To run on GPUs
# device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# 
# then call any_tensor = any_tensor.to(device)

for epoch in range(150): # 150 epochs
  for data in train_loader:
    # TODO: get the data 

    # TODO: forward pass

    # TODO: backward pass

    # TODO: update gradients
    
    if i % 1000 == 0:
      # TODO: store and print avg_train_loss and avg_train_loss every 1000 steps.
  

### **Task (1.d)** **(5 pts.)** 
What are the final values for train and test loss that you achieved? What can you observe regarding e.g., overfitting? Explain (in your own words, no implementation required) how one could in principle improve one loss (or the other, or both) by tuning your network structure, e.g., using regularization, splitting data differently, tuning hyperparameters etc.?

### **Task (1.e)** **(20 pts.)**

**Task (1.e.1)** **(15 pts.)**  Pytorch has a class called `torch.nn.Dropout()` which implements Dropout, an efficient regularization technique for preventing overfitting. You can refer to the [pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) for more information about it. You have to do some hyperparameters tuning, e.g., dropout probability, amount of dropout layers etc., to identify the best setup.
Using the dropout class, modify the neural network you implemented above in order to incorporate dropout into its structure. Once you have done that, train the modified NN again, similarly to what you have done above. 

**Task (1.e.2)** **(5 pts.)** What are the train and test losses now? What can you observe? Try to plot the losses as a function of epochs/training steps using `matplotlib` or `seaborn` to support your argument. 

In [None]:
class NetDropout(nn.Module):
    def __init__(self):
        super().__init__()
        # ----------------------------------
        #  Your code here
        # ----------------------------------

    def forward(self, x):
        # ----------------------------------
        #  Your code here
        # ----------------------------------
        pass

net_dropout = NetDropout()

Train the new `net_dropout` model

In [None]:
# ----------------------------------
#  Your code here
# ----------------------------------

plot the results of the losses as a function of training steps 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# ----------------------------------
#  Your code here
# ----------------------------------