## Loss Functions, Optimizers, & The Training Loop

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [53]:
import gzip
import pickle
import numpy as np
import pandas as pd

In [60]:
import torch.nn.functional as F
import torch
import torch.nn as nn
from torch.nn import init
from torch import tensor

In [55]:
from fastai import datasets

In [31]:
def get_data(MNIST_URL = 'http://deeplearning.net/data/mnist/mnist.pkl'):
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((X_train, y_train), (X_val, y_val), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (X_train, y_train, X_val, y_val))

def normalize(x, m, s):
    return (x-m)/s

In [32]:
torch.nn.modules.conv._ConvNd.reset_parameters??

[0;31mSignature:[0m [0mtorch[0m[0;34m.[0m[0mnn[0m[0;34m.[0m[0mmodules[0m[0;34m.[0m[0mconv[0m[0;34m.[0m[0m_ConvNd[0m[0;34m.[0m[0mreset_parameters[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;32mdef[0m [0mreset_parameters[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0minit[0m[0;34m.[0m[0mkaiming_uniform_[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mweight[0m[0;34m,[0m [0ma[0m[0;34m=[0m[0mmath[0m[0;34m.[0m[0msqrt[0m[0;34m([0m[0;36m5[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mself[0m[0;34m.[0m[0mbias[0m [0;32mis[0m [0;32mnot[0m [0;32mNone[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mfan_in[0m[0;34m,[0m [0m_[0m [0;34m=[0m [0minit[0m[0;34m.[0m[0m_calculate_fan_in_and_fan_out[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mweight[0m[0;34m)[0m[0;34m[0m
[0;34m[0m

In [33]:
X_train, y_train, X_test, y_test = get_data()

In [34]:
train_mean, train_std = X_train.mean(), X_train.std()

In [35]:
X_train = normalize(X_train, train_mean, train_std)
X_test = normalize(X_test, train_mean, train_std)

In [37]:
X_train = X_train.view(-1, 1, 28, 28)
X_test = X_test.view(-1, 1, 28, 28)
X_train.shape, X_test.shape

(torch.Size([50000, 1, 28, 28]), torch.Size([10000, 1, 28, 28]))

In [39]:
n = X_train.shape[0]
c = y_test.max() + 1
nh = 32
n, c

(50000, tensor(10))

Let's create a `Conv2d` layer:

In [40]:
l1 = nn.Conv2d(in_channels=1, out_channels=nh, kernel_size=5)

In [42]:
x = X_test[:100]
x.shape

torch.Size([100, 1, 28, 28])

In [43]:
def stats(x):
    return x.mean(), x.std()

In [44]:
stats(l1.weight), stats(l1.bias)

((tensor(-0.0031, grad_fn=<MeanBackward0>),
  tensor(0.1149, grad_fn=<StdBackward0>)),
 (tensor(0.0149, grad_fn=<MeanBackward0>),
  tensor(0.1316, grad_fn=<StdBackward0>)))

Let's check the output:

In [45]:
t = l1(x)

In [46]:
stats(t)

(tensor(0.0071, grad_fn=<MeanBackward0>),
 tensor(0.6753, grad_fn=<StdBackward0>))

We would like the outputs to have a mean of 0 and a standard deviation of 1. The mean is fine but the standard diviation is not quite there.

Let's compare this to the normal Kaiming init with a leak of 1 because we're not using an activation function, remember:

$$LeakyReLU(x,\alpha)=\begin{cases}
x,  & \text{if $x \ge 0$} \\
\alpha x, & \text{if $x < 0$}
\end{cases}$$

When we switch to a normal kiming initialization with $a=1$, we get an output with $\mu \approx 0$ and $\sigma \approx 1$. So far so good.

Let's define `LeakyReLU` which defaults to `ReLU`:

In [61]:
def f1(x, a=0):
    return F.leaky_relu(l1(x), a)

In [64]:
init.kaiming_normal_(l1.weight, a=0)
stats(f1(x))

(tensor(0.5330, grad_fn=<MeanBackward0>),
 tensor(1.0288, grad_fn=<StdBackward0>))

Due to the relu function, the mean is no longer 0, it shifts to $\approx 1/2$

Let's go back to look at how the `Conv2d` layer handle's it:

In [65]:
l1 = nn.Conv2d(1, nh, 5)

In [66]:
stats(f1(x))

(tensor(0.1968, grad_fn=<MeanBackward0>),
 tensor(0.3751, grad_fn=<StdBackward0>))

In [67]:
l1.weight.shape

torch.Size([32, 1, 5, 5])

In [68]:
rec_fs = l1.weight[0,0].numel()
rec_fs

25

In [69]:
nf, ni, *_ = l1.weight.shape
nf, ni

(32, 1)

Let's calculate the number of projections (mappings) **in** and **out**:

In [70]:
fan_in = ni * rec_fs
fan_out = nf * rec_fs
fan_in, fan_out

(25, 800)