<a href="https://colab.research.google.com/github/BHouwens/DL_From_Foundations/blob/main/02_fully-connected.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Forward and Backward Passes**

In this notebook we'll be looking at setting up the forward and backward passes for the fully connected model architecture.

## **Colab Setup**
The setup structure for this will depend on the environment. I'm assuming a Google Colab environment in this case, which will require the following setup from the Github repo, and assumes that the repo has already been cloned into Google Drive:

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

%cd gdrive/My Drive/git_folder/DL_From_Foundations
! git pull

Mounted at /content/gdrive
/content/gdrive/My Drive/git_folder/DL_From_Foundations
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 9 (delta 6), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.
From https://github.com/BHouwens/DL_From_Foundations
   d75ee4b..dca6ac7  main       -> origin/main
Updating d75ee4b..dca6ac7
Fast-forward
 02_fully-connected.ipynb | 258 [32m++++++++++++++++++++++++++++++++++++++++++[m[31m-----[m
 1 file changed, 231 insertions(+), 27 deletions(-)


In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline



---



## **Normalize Data**

In [4]:
#export
from exp.nb_01 import *
from keras.datasets import mnist

def get_data():
    # Load the data into the train and validation sets
    (x_train, y_train), (x_valid, y_valid) = mnist.load_data()

    # Map the sets to tensors
    x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid))

    # Flatten the 28 * 28 to match the course
    x_train = torch.flatten(x_train, 1)
    x_valid = torch.flatten(x_valid, 1)

    return x_train, x_valid, y_train, y_valid

def normalize(x, m, s): return (x-m)/s

In [5]:
x_train,y_train,x_valid,y_valid = get_data()

train_mean,train_std = x_train.float().mean(),x_train.float().std()
train_mean,train_std

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


(tensor(33.3184), tensor(78.5675))

In [6]:
x_train = normalize(x_train, train_mean, train_std)

# NB: Use training, not validation mean for validation set
x_valid = normalize(x_valid, train_mean, train_std)

In [7]:
train_mean,train_std = x_train.mean(),x_train.std()
train_mean,train_std

(tensor(1.8892e-08), tensor(1.))

In [8]:
# Let's check out the shapes
n,m = x_train.shape
c = y_train.max()+1
n,m,c

(60000, 784, tensor(0, dtype=torch.uint8))

In [None]:
x_valid.shape

torch.Size([60000])

Now we're talking!



---



## **Foundations (v1.0)**

In [9]:
# num hidden nodes
nh = 50

The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all (think of what this looks like in a geometric sense, with gradients along a curve).

We can read more about weight initialisation specifically [here](https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79).

In [10]:
# 2 layers, so we need 2 weights and 2 biases (kaiming init)
w1 = torch.randn(m,nh)/math.sqrt(m)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

What's happening above is a bit of a cheat on regular Kaiming initialisation, since `torch.randn` returns a uniformly distributed random set. Kaiming is a modification of Xavier initialisation, which sets a layer's input weights to values from a uniformly random distribution bounded between 

$$\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}}$$

where $n_i$ is the number of incoming network connections, or “fan-in,” to the layer, and $n_{i + 1}$ is the number of outgoing network connections from that layer, also known as the “fan-out.”

In [11]:
def linear_layer(x, w, b):
  """
  Creates a linear layer with a matrix multiplication of 
  x and w, plus the bias
  """
  return x@w + b

In [None]:
layer_activation = linear_layer(x_train, w1, b1)

Because we used Kaiming initialisation for our weights and biases we should also get a normalized mean and $\sigma$

In [None]:
# we used kaiming init, which is designed to do this
layer_activation.mean(), layer_activation.std()

(tensor(0.0142), tensor(0.9757))

Now let's define ReLU for our activation:

In [12]:
# clamp_min replaces negatives with zeros
def relu(x): return x.clamp_min(0.)

But think about this geometrically. Let's say we plotted all our activations on the $x$ and $y$ axes. We know the activations already have a mean of 0 and a $\sigma$ of 1, but now ReLU went and replaced all the values in the negative portions of the $x$ and $y$ axes with 0 (so it "pulled" them toward the origin). 

We can't reasonably expect that the mean and $\sigma$ will still be the normalised values we want, can we?

In [None]:
layer_activation = relu(linear_layer(x_train, w1, b1))

# but now we don't have the right mean and stdev
layer_activation.mean(), layer_activation.std()

(tensor(0.3912), tensor(0.5861))

No dice!

From pytorch docs: `a: the negative slope of the rectifier used after this layer (0 for ReLU by default)`

$$\text{std} = \sqrt{\frac{2}{(1 + a^2) \times \text{fan_in}}}$$

This was introduced in the paper that described the Imagenet-winning approach from *He et al*: [Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852), which was also the first paper that claimed "super-human performance" on Imagenet (and, most importantly, it introduced resnets!).

The interesting insight the team made in this paper, regarding normalisation in particular, was to replace the 1 in the numerator with a 2!

In [None]:
# We can just change the weight initialisation from before by "adding a 2"
w1 = torch.randn(m,nh)*math.sqrt(2/m)

w1.mean(), w1.std()

(tensor(-0.0003), tensor(0.0504))

In [None]:
# Let's try again (although these runs can be random)
layer_activation = relu(linear_layer(x_train, w1, b1))

layer_activation.mean(), layer_activation.std()

(tensor(0.5439), tensor(0.8457))

That's closer, although not quite at the 0 mean and 1 $\sigma$ we're looking for. We can replace our little process with Pytorch's built in Kaiming initialiser and, just to make sure we're not cheating, we'll run the same params again and see if we get the same result:

In [1]:
from torch.nn import init

In [13]:
w1 = torch.zeros(m,nh)
init.kaiming_normal_(w1, mode='fan_out')

layer_activation = relu(linear_layer(x_train, w1, b1))
layer_activation.mean(), layer_activation.std()

(tensor(0.4811), tensor(0.7848))

Great! But what does `mode='fan_out'` really mean? According to Pytorch docs, `fan_out` ensures that the variance is kept consistent in the backward pass of the net. `fan_in`, by contrast, is used to ensure a consistent variance on the forward pass.

That's kind of strange though, because shouldn't we be doing the forward pass (or `fan_in`) first? Let's take a look at Pytorch's linear layer.

In [15]:
torch.nn.functional.linear??

It turns out that Pytorch's linear layer first performs a transpose on the weights before actually doing the forward pass!