# Lecture 5: Overfitting and Dropout
Benedikt Auer, Paul Ludwig, Jannik Niebling 

In this notebook, you'll find various tasks encompassing both theoretical and coding exercises. Each exercise corresponds to a specific number of points, which are explicitly indicated within the task description.

Always use the Jupyter kernel associated with the dedicated environment when compiling the notebook and completing your exercises.

## Excercise 1 (Theory) (20/100)

### Bias Variance Tradeoff

Consider the squared loss function commonly used in regression problems, defined as $L(y, \hat{y}) = [(y + \epsilon) - \hat{y}]^2$, where $y$ is the true target value, $\epsilon$ is some random noise and $\hat{y}$ is the predicted value by the model.

- **Task (1.a)** **(10 pts.)** Derive the decomposition of the expected squared error $\mathrm{Err}=E[(y + \epsilon - \hat{y})^2]$ into bias, variance, and irreducible error terms (refer to the lecture for more details).
- **Task (1.b)** **(10 pts.)** Discuss this decomposition in light of the bias-variance tradeoff in machine learning models and in the context of overfitting. What can you say of each single term in the final version you obtain?


> #### Your solution here

## Excercise 2 (Theory) (15/100)

### Bias Variance Tradeoff (part 2)

Consider a regression problem where you have a dataset consisting of $n$ data points where each point consists of a pair $(\mathbf{x}, y)$ with $y=f(\mathbf{x})$. You decide to fit a linear regression model to this dataset. After training the model, you evaluate its performance using mean squared error (MSE). You notice that the model has a high MSE on both the training and test datasets.

- **Task (2.a)** **(5 pts.)** Explain whether the model is suffering from high bias, high variance, or both.
- **Task (2.b)** **(5 pts.)** Propose one approach to decrease the bias and one approach to decrease the variance of the model.
- **Task (2.c)** **(5 pts.)** Provide a justification for each proposed approach.

> #### Your solution here

## Excercise 1 (Programming) (65/100)

### Dropout

The code below loads the CIFAR10 dataset and splits it into train and test datasets

In [2]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

transform = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

batch_size = 64


train_dataset = torchvision.datasets.CIFAR10(root="./data", train=True, 
                                             download=True, transform=transform)
                                             
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size,
                                           shuffle=True, num_workers=2)

test_dataset = torchvision.datasets.CIFAR10(root="./data", train=False, 
                                             download=True, transform=transform)
                                             
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size,
                                           shuffle=False, num_workers=2)
                                           
# the CIFAR10 classes
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


Files already downloaded and verified
Files already downloaded and verified


### **Task (1.a)** **(10 pts.)** 
From the experience you gained in the previous excercise sheets, implement a neural network choosing arbitrary (based on your intuition) the number of layers, activation functions etc.

Before implementig the NN we take a look at the data:

In [15]:
print(train_dataset[0][0].size())
print(train_dataset[0][1])

torch.Size([3, 32, 32])
6


As we can see each picture has a resolution of 32x32 pixel, the 3 comes probably from the three colour channels of a picture. The second line is the label, aka. the class of the picture. It is scaler which is importent for the implementation of the corss entropy.
Since we are working in the field of image recognition we use CNN and pooling layers. Since it is a classification task we apply a softmax functio at the end.

In [4]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.Conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2,2)
        self.Conv2 = nn.Conv2d(6, 16, 5)
        self.lin1 = nn.Linear(16 * 5 * 5, 2^7)
        self.lin2 = nn.Linear(2^7, 2^6)
        self.lin3 = nn.Linear(2^6, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.Conv1(x))) 
        x = self.pool(F.relu(self.Conv2(x)))  
        x = x.view(-1, 16 * 5 * 5)            # falttens the array 
        x = F.relu(self.lin1(x))               
        x = F.relu(self.lin2(x))  
        x = self.lin3(x)             
        # x = F.softmax(  x) # used when using my_cross_entropy
                         
        return x
        pass

net = Net()

### **Task (1.b)** **(10 pts.)** 
For the next task, we want to use the cross-entropy loss as the objective function. Implement the cross-entropy loss from scratch. This should take predicted values and true values as input. 

We implement it as given in the lecture, i.e. between two probability distributions. 

In [54]:
def my_cross_entropy(y_pred, y_true):
    log_preds = torch.log(y_pred)
    targets_log_probs = log_preds.gather(1, y_true.unsqueeze(1)) # we use the gather function to select the the predicted value that correspond to the right class 
    loss = -targets_log_probs.sum()
    return loss


We test the our implementation against the buld in function.
The given example is already noramlized (i.e. after a softmax layer), so the build in `nn.CrossEntropyLoss()` can't be used. Insted the `NLLLoss()` function is used. The socalled negative log likelihood loss, essetially it does the summing over the values of the true class. When combined with a predicted input of `log(y_pred)`, it mimmics the cross entropy as given in the lecture.

In [70]:
my_test_true = torch.LongTensor([1,0])
my_test_pred =  torch.tensor([[0.001, 0.09, 0.99], [0.99,0.05,0.05] ]) 

test_cross_entropy = nn.NLLLoss(reduction ='sum')
print(test_cross_entropy(torch.log(my_test_pred),my_test_true))
print(my_cross_entropy(my_test_pred,my_test_true))

tensor(2.4180)
tensor(2.4180)


### **Task (1.c)** **(20 pts.)** 
The cell below gives you an implementation for computing the loss on the test data. Implement an appropriate training loop and train the neural network for a sufficient number of epochs. How many training steps you can achieve likely depends on you laptop/device.   

Print the train and test loss after every $n$ steps. In the cell below $n$ is set to 1000 but you are free to change it to any value you think is informative to appreciate the change in the loss. 

> Note: For the more experienced students, you can leverage GPUs (if you have them) to achieve faster computation. Checkout how to achieve this by using the `.to(device)` function in Pytorch.

In [67]:
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

def get_test_loss(net, loss_criterion, data_loader):
  testing_loss = []
  with torch.no_grad():
    for data in data_loader:
      inputs, labels = data
      outputs = net(inputs)
      # calculate the loss for this batch
      loss = loss_criterion(outputs, labels)
      # add the loss of this batch to the list
      testing_loss.append(loss.item())
  # calculate the average loss
  return sum(testing_loss) / len(testing_loss)

In [69]:
training_loss, testing_loss = [], []
running_loss = []
n =1000
epochs = 150
net = Net()
net.train()
## To run on GPUs
# device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# 
# then call any_tensor = any_tensor.to(device)

for epoch in range(epochs): # 150 epochs
  for i, data  in enumerate(train_loader):
      inputs, targets = data
      # TODO: forward pass
      outputs = net(inputs)
      print(targets)
      loss = criterion(outputs,targets)



      # TODO: backward pass & update gradients
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      running_loss.append( loss.item())
      if i % n == 0:
        net.eval()
        # testing_loss.append( get_test_loss(net, criterion, test_loader))
        training_loss.append( sum(running_loss)/len(running_loss))

        print (f'Epoch [{epoch+1}/{epochs}], Step [{i+1}], Loss: {training_loss[-1]:.8f}')
        # print(f'Epoch [{epoch + 1}/{epochs}], Step [{i + 1}], Loss: {training_loss[i]}, Test Loss: {testing_loss[i]:.4f}')
        # TODO: store and print avg_train_loss and avg_train_loss every 1000 steps.
        running_loss= []
        net.train()
  

tensor([5, 3, 4, 3, 0, 8, 6, 1, 5, 2, 6, 5, 3, 8, 8, 3, 0, 7, 5, 9, 2, 0, 4, 6,
        8, 5, 8, 2, 1, 5, 2, 5, 0, 1, 5, 6, 4, 5, 9, 0, 2, 5, 7, 3, 6, 8, 9, 7,
        5, 0, 3, 2, 1, 2, 1, 1, 4, 9, 5, 5, 6, 2, 4, 6])
Epoch [1/150], Step [1], Loss: 2.31376672
tensor([5, 1, 4, 8, 5, 8, 3, 3, 8, 8, 3, 1, 4, 4, 5, 1, 8, 5, 3, 2, 4, 1, 6, 5,
        6, 1, 4, 4, 2, 0, 8, 4, 7, 7, 0, 0, 2, 9, 0, 3, 5, 7, 8, 2, 9, 8, 6, 0,
        6, 2, 6, 7, 1, 8, 3, 7, 7, 2, 5, 4, 1, 5, 2, 8])
tensor([6, 7, 5, 1, 7, 3, 6, 1, 6, 6, 2, 1, 8, 8, 9, 6, 6, 5, 2, 8, 7, 1, 4, 1,
        5, 6, 4, 5, 3, 0, 8, 0, 7, 9, 6, 8, 9, 6, 5, 2, 0, 2, 1, 3, 1, 0, 9, 4,
        2, 0, 1, 2, 8, 9, 6, 2, 4, 2, 0, 4, 7, 4, 8, 0])
tensor([6, 5, 0, 6, 1, 9, 2, 0, 9, 3, 0, 4, 3, 9, 4, 7, 3, 2, 1, 3, 7, 0, 6, 3,
        2, 4, 9, 1, 3, 2, 7, 0, 3, 3, 2, 4, 9, 9, 1, 5, 1, 9, 6, 1, 4, 3, 8, 6,
        3, 9, 4, 2, 6, 5, 3, 2, 4, 7, 5, 6, 9, 6, 5, 0])
tensor([4, 1, 2, 5, 0, 8, 5, 0, 9, 1, 5, 2, 0, 8, 9, 1, 9, 9, 8, 9, 8, 7, 4, 7,
        7,

KeyboardInterrupt: 

### **Task (1.d)** **(5 pts.)** 
What are the final values for train and test loss that you achieved? What can you observe regarding e.g., overfitting? Explain (in your own words, no implementation required) how one could in principle improve one loss (or the other, or both) by tuning your network structure, e.g., using regularization, splitting data differently, tuning hyperparameters etc.?

### **Task (1.e)** **(20 pts.)**

**Task (1.e.1)** **(15 pts.)**  Pytorch has a class called `torch.nn.Dropout()` which implements Dropout, an efficient regularization technique for preventing overfitting. You can refer to the [pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) for more information about it. You have to do some hyperparameters tuning, e.g., dropout probability, amount of dropout layers etc., to identify the best setup.
Using the dropout class, modify the neural network you implemented above in order to incorporate dropout into its structure. Once you have done that, train the modified NN again, similarly to what you have done above. 

**Task (1.e.2)** **(5 pts.)** What are the train and test losses now? What can you observe? Try to plot the losses as a function of epochs/training steps using `matplotlib` or `seaborn` to support your argument. 

In [None]:
class NetDropout(nn.Module):
    def __init__(self):
        super().__init__()
        # ----------------------------------
        #  Your code here
        # ----------------------------------

    def forward(self, x):
        # ----------------------------------
        #  Your code here
        # ----------------------------------
        pass

net_dropout = NetDropout()

Train the new `net_dropout` model

In [None]:
# ----------------------------------
#  Your code here
# ----------------------------------

plot the results of the losses as a function of training steps 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# ----------------------------------
#  Your code here
# ----------------------------------