<img src="https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/_static/img/pytorch-logo-dark.png" width="600px"/>

# **PyTorch** (1/2): the world's simplest neural net from scratch

In this notebook, we will build a simple neural network to learn a linear equation with one variable a.k.a. a _line_.

## What is PyTorch?
* open-source machine learning library
* developed by Facebook AI Research + community
* used in most state-of-the-art research
* _Python first_: deeply integrated into Python, it should _feel familiar_ to use

### Main Components
* ```torch```: tensor library (like NumPy but with GPU support)
* ```torch.autograd```: automatic differentiation library
* ```torch.jit```: compile PyTorch code to TorchScript for deployment (e.g. in standalone C++ program)
* ```torch.nn```: neural network library

Useful Links
* [🔗 Official PyTorch Documentation](https://pytorch.org/docs/stable/index.html)

Tutorials:
* [🔗 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
* [🔗 PyTorch Tutorials with Examples](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)

In [None]:
%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt

import numpy as np
import torch

# Training a Neural Network to Learn a Line

## Creating our Dataset

As we have seen in the ML introduction, we need some data to train our model.

In our case, we can just generate data points for an arbitrary line function. In order to do this we first need to define our _target_, i.e. the line equation that we want our model to learn. 

As you might recall, a _line_ is defined by the equation $$ y = ax + b $$

It has one variable ($x$) and two parameters:
* $a$ (_slope_) and 
* $b$ (_intercept_). 

The _parameters_ are what we want our model to learn.

Let's create our target line function by generating some random values for $a$ and $b$.


> When developing algorithms that contain any kind of randomness, for reproducibility it is always a good idea to set a _manual seed_ before you
start. This will initialize the random number generators to the same state every time you run your experiment, i.e. you will always get the same results. 
>
> In NumPy, you can do this by passing an arbitrary number to the function `numpy.random.seed()`.

In [None]:
seed = 42
np.random.seed(seed) # for reproducibility

In [None]:
a, b = np.random.uniform(low=-2.0, high=2.0, size=2)
a, b

We use these to define our target function $f$:

In [None]:
def f(x):
    return a*x + b

Next, we create our _independent variable_ $x$ by generating 1000 evenly spaced numbers over the interval $[-\pi, \pi]$.

In [None]:
n_samples = 1000

x = np.linspace(-np.pi, np.pi, n_samples, dtype=np.float32)

> Note that the number of samples we generate will influence our results: the more data points we allow the model to learn from, the better our final performance will be.

Let's create our _dependent variable_ $y$ by passing $x$ to our function $f$:

In [None]:
y_true = f(x)

Let's have a look what our line looks like.

In [None]:
fig, ax = plt.subplots()

ax.plot(x, y_true, color='#455a64', label='y_true')
ax.grid(True, alpha=0.4)
ax.legend()
fig.tight_layout();

In the real world, you usually don't know _true_ process that you are trying to model. You only have some data obtained from observing the process.

Any measured data will include some kind of noise (e.g. slight deviations in temperature readings due to a given precision of the sensor itself). Luckily, neural networks can handle noisy data quite well. Actually, introducing noise during training (especially for small datasets) can improve the robustness of the network and result in a better generalization (see e.g. [here](https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-overfitting/)).

So, let's go ahead and add some noise to our _true_ $y$ data.

In [None]:
e = np.random.normal(size=y_true.shape, scale=0.5).astype(np.float32)
y_noisy = y_true + e

In [None]:
ax.scatter(x, y_noisy, s=1, color='#90a4ae', alpha=0.75, label='y_noisy')
ax.legend();

## Defining our Model

If we want a neural network to learn this function, we need a model that also has (at least) two parameters. 

But first we need to decide what _kind_ of layers our neural network is made of. Since we want to model a linear equation, `torch.nn.Linear` might be a good fit. Let's have a look at [what this layer does](https://pytorch.org/docs/master/generated/torch.nn.Linear.html#torch.nn.Linear).

This is exactly what we need! Conceptually, it looks like this:

<img src="../assets/images/linear_unit.svg" alt="Linear unit" width="400px"/>



We have $n$ inputs $x_1, x_2, \dots, x_n$, each of which is multiplied by its corresponding weight $w_1, w_2, \dots, w_n$. The output $y$ is the weighted sum of all inputs plus a single bias term $b$.

For our specific case, it looks even simpler:

<img src="../assets/images/linear_unit_single.svg" alt="Linear unit with a single input" width="400px"/>

We just have a single input ($x$), the weight corresponds to the slope $a$ and the bias $b$ corresponds to the intercept.

To construct an instance of `torch.nn.Linear`, we need specify the sizes of our input and output samples. In our case, these are just scalar values (i.e. shape `1`).

Again, we set a manual seed here because the layer object takes care of randomly initializing its parameters. The PyTorch way of doing this is by calling `torch.manual_seed()`.

In [None]:
torch.manual_seed(seed); # for reproducibility

model = torch.nn.Sequential(
    torch.nn.Linear(1, 1),
    torch.nn.Flatten(0, 1)
)

We also add a `Flatten` layer at the end. This will flatten the output to a 1D tensor to match the shape of our labels (i.e. the `y_noisy` vector).

We can have have a look at our model's parameters:

In [None]:
[(name, tensor.item()) for (name, tensor) in model.named_parameters()]

As expected, we have two scalar parameters: a _weight_ and a _bias_.

Currently, our training data is stored in NumPy arrays. To be able pass the data to our model, we have to convert them to PyTorch tensors:

In [None]:
xx = torch.from_numpy(x).unsqueeze(-1)
yy = torch.from_numpy(y_noisy)

## Constructing the Training Loop

As mentioned before, our model parameters are initialized to random values. 

Let's have a look at what our model predicts with these intital parameters.

In [None]:
n_training_steps = 200

In [None]:
fig, axes = plt.subplots(nrows=2, figsize=(6, 6), gridspec_kw={'height_ratios':[0.7, 0.3]})
axes[0].scatter(x, y_noisy, s=0.5, color='#90a4ae', alpha=0.75, label='y_noisy')
axes[0].plot(x, y_true, linewidth=1, label='y_true', color='#455a64')
axes[0].set_title('Epoch 0')
line_pred = axes[0].plot(x, model(xx).detach().numpy(), label='y_pred', color='#ffa000')[0]
line_loss = axes[1].semilogy(np.arange(n_training_steps)[0], 1000, label='training loss')[0]
axes[1].set_xlim([0, n_training_steps])
axes[1].set_xlabel('# epoch')
axes[1].set_ylabel('mse loss')
for ax in axes:
    ax.grid(True, which='both', alpha=0.25)
    ax.legend()
fig.canvas.header_visible = False
fig.tight_layout()

def update_every(iteration, every = 50):
    return iteration % every == every - 1

def set_plot_data(y_pred, loss):
    line_pred.set_ydata(y_pred)
    line_loss.set_data(np.arange(loss.shape[0]), loss)
    axes[0].set_title(f'Epoch {loss.size - 1}: loss {loss[-1]:.5f}')
    axes[1].set_xlim([0, n_training_steps])
    axes[1].relim()
    axes[1].autoscale_view()
    fig.canvas.draw()

Let's remind ourselves of the <a href="#detailed_loop">Detailed Training Loop Diagram</a> and compare what we have achieved so far:

* [x] Collect input data (our tensor `xx`) and
* [x] Corresponding label data (our tensor `yy`)
* [x] Define a model architecture
* [x] Initialize model parameters to random values
* [x] Calculate predictions (our tensor `y_pred`) based on current model parameters and input data
* [ ] Define a loss function to determine the performance
* [ ] Update model parameters based on loss using automated process (Stochastic Gradient Descent)


To update our model parameters, we need to define a suitable loss function. 

We want `y_pred` to exactly match `y_true`, so we need a mathematical way of figuring out how _similar_ our prediction is to our target line.

We can use the _mean squared error (MSE)_ to achieve this. It is defined like this:

$\text{MSE} = \frac{1}{n} \sum_{i = 1}^{n} (Y_i - \hat{Y_i})^2$

For every data sample $i$, we compute the squared difference of the target value $Y_i$ and the predicted value $\hat{Y_i}$. The squared differences are then added up accross all sample and divided by the number of samples to yield a single scalar value, the MSE.

#### Stochastic Gradient Descent (SGD)

Furthermore, we need the automated process that will find values for our model parameters $w$ and $b$, such that the predictions closely resemble our target line. This is what the SGD algorithm does for us.

Specifically, these are the steps that are required to make our model learn from its experience:

<br>
<img src="../assets/images/sgd.svg" alt="Stochastic Gradient Descent process" width="700px"/>
<a href="https://colab.research.google.com/github/fastai/fastbook/blob/master/04_mnist_basics.ipynb">(Image Source)</a>
<br>

1. *Initialize* the weights.
1. For each input sample, use these weights to *predict* the output value.
1. Based on these predictions, calculate how good the model is (its *loss*).
1. Calculate the *gradient*, which measures for each weight, how changing that weight would change the loss
1. *Step* (that is, change) all the weights based on that calculation.
1. Go back to the step 2, and *repeat* the process.
1. Iterate until you decide to *stop* the training process (for instance, because the model is good enough or you don't want to wait any longer).

Essentially, we want SGD to find the (_a_) minimum of our loss function.

<img src="https://mlfromscratch.com/content/images/2019/12/gradient-descent-optimized--1-.gif" alt="SGD" width="600px"/>

Example from [here](https://mlfromscratch.com/optimizers-explained/#/) (animation originally from [3blue1brown](https://www.youtube.com/watch?v=IHZwWFHWa-w)).

# TODO

* overfitting/underfitting https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76

In [None]:
learning_rate = 0.0175
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
loss_fn = torch.nn.MSELoss()
losses = np.ones(n_training_steps) * np.infty

params_and_loss = torch.zeros((n_training_steps, 3))

torch.manual_seed(seed)
model[0].reset_parameters()
set_plot_data(model(xx).detach().numpy(), losses[:1])

In [None]:
for t in range(n_training_steps):
    y_pred = model(xx) # run forward pass
    
    loss = loss_fn(y_pred, yy) # calculate loss

    # record current parameter values and loss
    params_and_loss[t] = torch.tensor([*[p.detach() for p in model.parameters()], loss.detach()])
    
    # update plots if needed
    losses[t] = loss.item()
    if update_every(t, 5):
        set_plot_data(y_pred.detach().numpy(), losses[:t+1])

    optimizer.zero_grad() # zero out any previous gradients
    
    loss.backward() # compute gradient of loss w.r.t model parameters
    
    optimizer.step() # update parameters
    
set_plot_data(y_pred.detach().numpy(), losses)

If we have a look at the model parameters again, we can verfiy that they now closely resemble our _target_ parameters:

In [None]:
[param.item() for param in model.parameters()]

In [None]:
[a, b] # target parameters

# Plotting the Weight Landscape

In [None]:
n_mesh = 50

xv, yv = torch.meshgrid(torch.linspace(a - 2, a + 2, n_mesh), torch.linspace(b - 2, b + 2, n_mesh))
zv = torch.zeros_like(xv)

In [None]:
with torch.no_grad():
    for i, (x, y) in enumerate(zip(xv.flatten(), yv.flatten())):
        pa, pb = model.parameters()
        pa.copy_(x.reshape(pa.shape))
        pb.copy_(y.reshape(pb.shape))
        loss = loss_fn(model(xx), yy)
        zv.flatten()[i] = loss

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

min_loss = zv.min().item()
max_loss = zv.max().item()
ax.plot_surface(
    xv.numpy(), yv.numpy(), zv.numpy(),
    cmap=mpl.cm.inferno,
    norm=mpl.colors.LogNorm(vmin=min_loss, vmax=max_loss * 2),
    antialiased=False)
ax.plot(*params_and_loss.T, zorder=20, color='white')

ax.set_xlabel('a')
ax.set_ylabel('b')
ax.set_zlabel('MSE')
ax.view_init(elev=30, azim=-110)
fig.tight_layout()

In [None]:
fig, ax = plt.subplots()
cs = ax.contourf(
    xv.numpy(), yv.numpy(), zv.numpy(), 
    levels=np.logspace(np.log10(min_loss), np.log10(max_loss), n_mesh),
    norm=mpl.colors.LogNorm(vmin=min_loss, vmax=max_loss * 2),
    cmap=mpl.cm.inferno)
ax.plot(*params_and_loss.T[:2], color='white', label='model params')
ax.scatter(a, b, color='red', label='target')
ax.text(*(params_and_loss[0, :2] + 0.01), 'initial', color='white')
ax.text(*(params_and_loss[-1, :2] - 0.01), 'final', color='white', ha='right')

ax.set_xlabel('a')
ax.set_ylabel('b')
ax.set_title('Model parameter trajectory during training')
ax.legend();
fig.tight_layout()

# A Closer Look: Automatic differentiation with ```torch.autograd```

`torch.autograd` is PyTorch’s automatic differentiation engine that powers neural network training.

Adapted from [here](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py).

Let’s take a look at how `autograd` collects gradients. We create two tensors `a` and `b` with `requires_grad=True`. This signals to `autograd` that every operation on them should be tracked.

In [None]:
a = torch.tensor([4.], requires_grad=True)
b = torch.tensor([2.], requires_grad=True)

We create another tensor ```y``` from ```a``` and ```b```:

$ y = 2a - b$

In [None]:
y = 2 * a - b
y

Let's assume ```a``` and ```b``` to be parameters of a neural network, and ```y``` to be the error. In NN training, we want gradients of the error w.r.t the parameters, i.e.

$ \frac{\partial y}{\partial a} = 2 $ and $\frac{\partial y}{\partial b} = -1 $

When we call ```.backward()``` on ```y```, autograd calculates these gradients and stores them in the respective tensors' ```.grad``` attribute.

In [None]:
y.backward()

The gradients are now deposited in ```a.grad``` and ```b.grad```.

In [None]:
a.grad, b.grad