<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>PhysTech School of Applied Mathematics and Informatics (FPMI) MIPT </b> </h3>

---

<h2 style="text-align: center;"><b>Homework assignment: Dropout on a regression task</b></h2>

---

A reminder how a fully-connected neural network works:

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/assets/tfdl_0402.png")

## Problem

Overfitting is one of the biggest problems of deep networks. It's nature is the following: model overadapts to the training set, becoming extremely good at explaining samples from training dataset, but, meanwhile, it lacks the ability to generalize to the real nature of data, resulting in poor performance on val and test datasets.

A good example of overfitting is on the following image: black dots symbolize training examples, the blue line is a model trained function and the black line is a real function where data comes from (data is just noise)

In [2]:
Image(url= "https://upload.wikimedia.org/wikipedia/commons/6/68/Overfitted_Data.png")

## Dropout 

As we learned from the lecture, one of the great techniques to fight overfitting is dropout.

#### How it works

Each training iteration some nodes on a layer with dropout are pretended to be eliminated. Quantity of eliminated nodes on each iteration depend on a dropout parameter -- for example, if you choose it to be 0.2, then 20% of layer nodes will be thrown away each iteration (but each iteration they will be different, chosen randomly). 

"Eliminated" nodes do not take part in computing forwand pass, nor backward pass.

On the image below you can see the same network without dropout (left) and with dropout applied (right)

In [3]:
Image(url= "https://hsto.org/web/dd8/171/16f/dd817116fc2348e78272577153e31d2d.jpeg")

During inference no nodes are thrown away. But the network with dropout is bigger than each network got after dropout, so during inference we need to make calibration of layers outputs. More about that was in the lecture.

Let's recall main features of dropout:

<ul> 
    <li>kind of model ensembling (throwing out some nodes on each iteration results in "different" networks trained each iteration)</li>

    <li>prevents co-adaptations of the network’s units so they learn more robust features</li>

    <li>can be considered as a kind of regularization for a neural network</li>
</ul> 

P.S. In practice (in PyTorch code) we do not need to worry about "swithing off" dropout during inference or making a calibration of layers outputs -- it's all done inside torch. In code we only need to tell torch that we wanna use dropout on a specific layer -- that's all. Well, you'll se it in practice)

-----------------------------------------------------------------------------

We will now try to solve a **regression** task. We'll generate two datasets $test\_y$, $y$ from one distribution and will train our network on $y$, and use $test\_y$ as a test set. Of course, we will compare two types of network training: with dropout and without =)

In [3]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from time import sleep
%config InlineBackend.figure_format = 'svg' 
torch.manual_seed(2)   
np.random.seed(2)

We show you how $y$ and $test\_y$ were generated, but in order to provide correct answers to homework assignments, we've fixed particular arrats of $y$ and $test\_y$ below:

In [4]:
# Amount of dots (data samples)
N_SAMPLES = 20

# Train set
x = torch.linspace(-1, 1, N_SAMPLES).view(N_SAMPLES, -1)
# y = x + 0.3 * torch.normal(torch.zeros(N_SAMPLES, 1), torch.ones(N_SAMPLES, 1))
y = torch.tensor([[-1.3122],[-0.6198],[-1.1807],[-1.0171],[-0.5700],[-0.4886],[-0.0489],[ 0.0027],[-0.4012],[ 0.1495]
                  ,[-0.2844],[ 0.1303],[ 0.3053],[ 0.7042],[ 0.5683],[ 1.1048],[ 0.4623],[ 0.4167],[ 0.8422],[ 1.2097]])

# Test set
test_x = torch.linspace(-1, 1, N_SAMPLES).view(N_SAMPLES, -1)
#test_y = test_x + 0.3 * torch.normal(torch.zeros(N_SAMPLES, 1), torch.ones(N_SAMPLES, 1))
test_y = torch.tensor([[-1.3883],[-0.9020],[-0.8601],[-0.8968],[-0.2950],[-1.0038],[-0.5611],[ 0.4919],[-0.4285],[-0.0572],
                       [ 0.4683],[ 0.9315],[ 0.1591],[ 0.3946],[ 0.1438],[ 0.7278],[ 1.1072],[ 0.6730],[ 1.0141],[ 0.6687]])

Let's look at our generated data:

In [None]:
plt.scatter('''your code here''', c='magenta', s=20, alpha=0.6, label='train')
plt.scatter('''your code here''', c='cyan', s=20, alpha=0.6, label='test')
plt.legend(loc='upper left')
plt.ylim((-2.5, 2.5))
plt.show()

Let's build a 3-layer neural network with ReLu as activation function between layers:

In [None]:
# Hidden layers size
N_HIDDEN = 300
net_overfitting = torch.nn.Sequential(
    torch.nn.Linear(1, N_HIDDEN),
    <Your code here>
    torch.nn.Linear(N_HIDDEN, 1),
)
print(net_overfitting)  # prints network architecture

To the second network let's add Dropout after 1 and 2 layers with keep_rate=0.5 (that means each iteration we'll have 50% nodes eliminated)

In [None]:
net_dropped = torch.nn.Sequential(
    torch.nn.Linear(1, N_HIDDEN),
    <Your code here>
    torch.nn.Linear(N_HIDDEN, 1),
)
print(net_dropped) # prints network architecture

Optimizers for each of the nets:

In [None]:
optimizer_ofit = torch.optim.Adam(<Your code here>, lr=0.01)
optimizer_drop = torch.optim.Adam(<Your code here>, lr=0.01)

We'll have mse as a loss function:

In [None]:
loss_func = torch.nn.<Your code here>

In [None]:
# Here we'll store errors on y and test_y
error = []
error_test = []
for t in range(500):
    # forward pass of x
    pred_ofit = <Your code here>
    pred_drop = <Your code here>
    
    loss_ofit = loss_func(pred_ofit, y)
    loss_drop = loss_func(pred_drop, y)

    # gradient descent step
    optimizer_ofit.zero_grad()
    optimizer_drop.zero_grad()
    <Your code here>

    # we'll check performance of our nets each 20 steps
    if t % 20 == 0:
        # we need to switch to eval mode for evaluating
        net_overfitting.eval()
        net_dropped.eval()

        # plotting
        clear_output(wait=True)
        sleep(0.05)
        
         # forward pass of x
        <Your code here>
        
        # Plot our data and nets' predictions on test data:
        plt.scatter(<Your code here>, c='magenta', s=20, alpha=0.6, label='train')
        plt.scatter(<Your code here>, c='cyan', s=20, alpha=0.6, label='test')
        plt.plot(test_x.data.numpy(), test_pred_ofit.data.numpy(), 'r-', lw=3, label='overfitting')
        plt.plot(test_x.data.numpy(), test_pred_drop.data.numpy(), 'b--', lw=3, label='dropout(50%)')
        
        # Errors of nets on y and test_y
        error.append((loss_func(test_pred_ofit, y).data.numpy(), loss_func(test_pred_drop, y).data.numpy()))
        error_test.append(<Your code here>)
        
        plt.text(-0.5, -1.8, 'overfitting loss=%.4f' % <Your code here>, fontdict={'size': 20, 'color':  'red'})
        plt.text(-0.5, -2.3, 'dropout loss=%.4f' % <Your code here>, fontdict={'size': 20, 'color': 'blue'})
        plt.legend(loc='upper left'); plt.ylim((-2.5, 2.5));#plt.pause(0.1)
        plt.show()
        
        # switch back to train mode
        net_overfitting.train()
        net_dropped.train()

# Print losses of the networks on y
print('overfitting loss train', 'dropout loss train')
for i in range(len(error)):
    print('%.4f\t\t\t%.4f' % (error[i][0], error[i][1]))

print()
# Print losses of the networks on test_y
print('overfitting loss test', '   dropout loss test')
for i in range(len(error)):
    print('%.4f\t\t\t%.4f' % '''your code here''')

# Questions for homework assignments:
1. What is an mse error between $y$ и $test\_y$? (using loss_func)
<br> Answer: ...
2. How many trainable parameters linear hidden layer has? (don't forget about b)?
<br> Answer: ...
3. What $overfitting\:loss\:train$ error did you get after all (round to 2 digits after .)?
<br> Answer: ...
4. What $dropout\:loss\:train$ error did you get after all (round to 2 digits after .)?
<br> Answer: ...
5. What $overfitting\:loss\:test$ error did you get after all (round to 2 digits after .)?
<br> Answer: ...
6. What $dropout\:loss\:test$ error did you get after all (round to 2 digits after .)?
<br> Answer : ...

<br> Compare 3, 4 and think why you got suchc result
<br> Compare 4, 5 and think why you got suchc result
<br> Also compare 4, 5 to previous two errors.