## Data Augmentation:



- Random image cropping:
    - Taking multiple croppings of the same image (without losing the meaningful data in them), passing each through the network, then averaging the resultant loss can improve the knowledge gained from an image
    - In the case of AlexNet, convolutional filters with a stride of 4 are used in their first convolutional layer. Even if some input features are skipped for one input, applying the convolution filtering on multiple random croppings should factor in those input features that were skipped 

## Dealing with deeper networks:
- 10+ layers:
    - Weight initialisation &mdash; choosing intial weights that will make backpropagation work better. Normally, we initialise weights by drawing values at random from a normal distribution with $\mu = 0$. With weight initialisation, we're selecting the value for standard deviation $\sigma$ more carefully. 
    - Batch normalisation &mdash;
- 30+ layers:
    - Skip connections &mdash;
- 100+ layers:
    - Identity skip connections &mdash;



### Weight Initialisation


Suppose $y = \sum_{k=1}^{n} w_k x_k$. If $x_k, x_k$ are independent, then 
$$
    \text{Var}(y) = n\text{Var}(w)\text{Var}(x). \tag{1}
$$

Consider one layer $(i)$ of a deep network, with values:
- weights $w_{jk}^{(i)}$
- activations $x_{k}^{(i)}$ coming in from the previous layer from $1 \leq k \leq n_{i}$ with $n_i$ being the number of nodes in layer $(i)$
- activations $x_{j}^{(i+1)}$ at the next layer from $1 \leq j \leq n_{i+1}$ with $n_{i+1}$ being the number of nodes in layer $(i+1)$

The activation value at each node in the next layer is given by:

$$
    x_j^{(i+1)} = \texttt{activation}(\sum_{k=1}^{n_i} w_{jk}^{(i)} x_{k}^{(i)}). \tag{2}
$$

From $(1)$ and $(2)$, we have 

$$
    \text{Var}(\sum_{k=1}^{n_i} w_{jk}^{(i)} x_{k}^{(i)}) = n_i \text{Var}(w_{jk}^{(i)}) \text{Var}(x_{k}^{(i)}),
$$

$$
    \text{Var}(x^{(i+1)}) \approx G_0 n_i \text{Var}(w^{(i)}) \text{Var}(x^{(i)}), \tag{3}
$$
where $G_0$ is a constant to account for the activation function &mdash; normally we just set it to $G_0 = \frac{1}{2}$.

Suppose a network has $D$ layers, with input vector $x^{(i)}$ and output $z$. From $(3)$ collected across $D$ layers, we have

$$
    \text{Var}(z) \approx \underbrace{\prod_{i=1}^{D} \big( G_0 n_i \text{Var}(w_i) \big)}_{\text{We want this to be } \approx \space 1} \cdot \text{Var}(x). \tag{4}
$$

If the product in $(4)$ is less than 1, then the activations going from input towards output will exponentially decay. If the product in $(4)$ is greater than 1, then the activations will exponentially grow going from input layer to output layer.

Similarly, for backpropagation we have

$$
    \text{Var}(\frac{\partial}{\partial x}) \approx \underbrace{\prod_{i=1}^{D} \big( G_1 n_i \text{Var}(w_i) \big)}_{\text{We want this to be } \approx \space 1} \cdot \text{Var}(\frac{\partial}{\partial z}), \tag{5}
$$

where we want the differentials to also not exponentially grow/decay too much across layers.

The basis for deciding weights in *weight initialisation* is choosing weights $w_{jk}^{(i)}$ for all layers $i$ such that: 

$$
    G_1 n_i \text{Var}(w^{(i)}) = 1, \tag{6}
$$
where $G_1 = \frac{1}{2}$, usually.


#### Example:
<img src="images/weight-initialisation-1.png" width="50%">
<em><p style="text-align: center;">Error on the y axis</p></em>

The above graph shows the difference in training speed for a 22-layer $\texttt{ReLU}$ network for when suboptimal initial weights are chosen (blue line) vs. when weights are chosen according to $(6)$.


### Batch Normalisation:

40:00
