# Weights initialization

## Numerical problems

* Vanishing gradients
* Exploding gradients
* Breaking symmetry

For details see {cite:p}`zhang2023dive`, [section 5.4](https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html).

## Random initialization

[ML Handbook chapter](https://education.yandex.ru/handbook/ml/article/beta-tonkosti-obucheniya)

How to initialize weights $\boldsymbol W \in \mathbb R^{n_{\mathrm{in}} \times n_{\mathrm{out}}}$ of linear layer? 

* To preserve zero mean:
    
    $$
        \mathbb Ew_{ij} = 0
    $$

* To preserve variance during forward pass:
    
    $$
        \mathbb V w_{ij} = \frac 1{n_{\mathrm{in}}}
    $$
    
* To preserve variance during backward pass:
    
    $$
        \mathbb V w_{ij} = \frac 1{n_{\mathrm{out}}}
    $$

Both last two conditions can be satisfied only if $n_{\mathrm{in}} = n_{\mathrm{out}}$. In {cite:p}`Glorot2010UnderstandingTD` a compromise was suggested:

$$
    \mathbb V w_{ij} = \frac 2{n_{\mathrm{in}} + n_{\mathrm{out}}}.
$$

In particular, they took

```{math}
:label: glorot-init
    w_{ij} \sim U\bigg[-\sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}, \sqrt{\frac 6{n_{\mathrm{in}} + n_{\mathrm{out}}}}\bigg]
```

**Q.** Why so strange numbers?

Initialization {eq}`glorot-init` suits well if activation functions have symmetric range (e.g., $\psi(x) = \tanh(x)$). For ReLU activation {cite:p}`He2015DelvingDI` suggests

$$
    w_{ij} \sim \mathcal N\Big(0, \frac 2{n_{\mathrm{in}}}\Big).
$$