Why you need good init
---

Original material [link](https://github.com/fastai/course-v3/blob/master/nbs/dl2/02b_initializing.ipynb)

1. Suppose you have two tensors, one for data and the other for efficient. To the calculation for 100 times recursively
2. check mean and standard deviation
3. check when your value reaches none.(how many loops)
4. multiply activation by 0.01 and see what happens


In [1]:
import torch

In [2]:
n = 512
x = torch.randn(n)
a = torch.randn(n,n)

In [10]:
def describe(x): return(x.type(), f"mean: {x.mean()}, std: {x.std()}")

In [5]:
for i in range(100): x = a @ x

In [11]:
describe(x)

('torch.FloatTensor', 'mean: nan, std: nan')

In [16]:
x = torch.randn(n)
a = torch.randn(n,n)

for i in range(100):
    x = a @ x
    if not x.std() == x.std(): print(i); break

28


[^2]: WHY DID THEY CHECK WITH STD? NOT A MEAN?

In [17]:
x = torch.randn(n)
a = torch.randn(n,n)

for i in range(100):
    x = a @ x
    if not x.mean() == x.mean(): print(i); break

27


---

Wait, how we do we compaer variables if we have "Nan"?

In [None]:
a = float('NaN')

In [None]:
b = a
a == b

False

In [18]:
a = torch.randn(n, n) * 0.1
x = torch.randn(n)

for i in range(100):
    x = a @ x
    if not x.mean()==x.mean(): print(i); break

In [19]:
describe(x)

('torch.FloatTensor',
 'mean: 7.452759524637429e+34, std: 2.1790210271840137e+36')

Write-out Answer
---

1. write down 3 strategies to initialize weight matrix
2. explain which one Xavier did use

1. Answer
- Orthgonal
- make x and a@x have same scale (xavier used)
- ???

2. Xavier divided initialized value with number of scales(i.e. dimensions)


The magic number for scaling
---

5. scale activation by xavier magic number, see what happens with 100 loop
6. suppose you have $$y_{i} = a_{i,0} x_{0} + a_{i,1} x_{1} + \cdots + a_{i,n-1} x_{n-1} = \sum_{k=0}^{n-1} a_{i,k} x_{k}$$
 equation, and represent $y_i$ by code


In [21]:
from math import sqrt

In [27]:
a = torch.randn(n, n) / sqrt(n)
x = torch.randn(n)

for i in range(100):
    x = a @ x
describe(x)

('torch.FloatTensor', 'mean: 0.07117890566587448, std: 1.7585952281951904')

In [28]:
y_1 = sum([a*x for a, x in zip(a[1,],x)])

In [29]:
(a @ x)[1]

tensor(-1.8616)

In [30]:
y_1

tensor(-1.8616)

7. suppose you do the $y = a @ x$ for 100 times, initializing a and x for every computation. Get the average mean and average variance of the $y$
8. get the average of mean and variance for $a_{i,k} x_{k}$

In [31]:
mean, std = 0., 0.
for i in range(100):
    x, a = torch.randn(n), torch.randn(n,n)
    y = a @ x
    mean += y.mean()
    std += y.std()
mean / 100, std / 100    

(tensor(-0.0004), tensor(22.6223))

In [32]:
std.pow(2)

tensor(5117677.)

In [42]:
mean, std, var = 0., 0., 0.
for i in range(100):
    x, a = torch.randn(1), torch.randn(1)
    y = a * x
    mean += y.mean()
    std += y.std()
    var += y.pow(2).mean()

mean / 100, std / 100, var/100

(tensor(-0.2147), tensor(nan), tensor(0.8951))

[^1]: why standard deviation is not a number?

Adding ReLU in the mix
---

9. do the same thing with No.8, except this time we apply relu
    - Can you explain why that number came out? (hint: you can appy kaiming initialization)

10. same with No.9, but this time use the whole matrix (i.e. $y$, when $y_i = \sum_{k=0}^{n-1} a_{i,k} x_{k}$)

11. same with No.10, but this time use Kaiming scaling(initialization)

No. 8

$a_{i,j} * x_i$ without kaiming

In [52]:
mean, std, var = 0., 0., 0.
for i in range(100):
    x = torch.randn(1)
    a = torch.randn(1) # * sqrt(2) without kaiming
    y = (a * x).clamp_min(0.)
    mean += y.mean()
    std += y.std()
    var += y.pow(2).mean()
mean / 100, std / 100, var / 100    

(tensor(0.2762), tensor(nan), tensor(0.3445))

with kaiming

In [53]:
mean, std, var = 0., 0., 0.
for i in range(100):
    x = torch.randn(1)
    a = torch.randn(1) * sqrt(2)
    y = (a * x).clamp_min(0.)
    mean += y.mean()
    std += y.std()
    var += y.pow(2).mean()
mean / 100, std / 100, var / 100    

(tensor(0.4582), tensor(nan), tensor(1.1150))

No. 9

In [54]:
mean, std, var = 0., 0., 0.
for i in range(100):
    x = torch.randn(n)
    a = torch.randn(n, n)
    y = (a * x).clamp_min(0.)
    mean += y.mean()
    std += y.std()
    var += y.pow(2).mean()
mean / 100, std / 100, var / 100    

(tensor(0.3193), tensor(0.6327), tensor(0.5027))

No. 10

In [51]:
mean, std, var = 0., 0., 0.
for i in range(100):
    x = torch.randn(n)
    a = torch.randn(n, n) * sqrt(2/n)
    y = (a@x).clamp_min(0.)
    mean += y.mean()
    std += y.std()
    var += y.pow(2).mean()
mean / 100, std / 100, var / 100

(tensor(0.5656), tensor(0.8269), tensor(1.0061))