In [1]:
import numpy as np

In [2]:
dh = np.array([[[ 0.0047309 , -0.02851535, -0.13561962,  0.07165096,
          0.01057472, -0.03511244],
        [ 0.04032968, -0.01704817,  0.07002992, -0.04101618,
         -0.05707668, -0.03169758],
        [-0.02885697,  0.04073668, -0.04297836, -0.02013535,
          0.04352404,  0.03589717],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ]],

       [[ 0.05022935, -0.02123297,  0.08722008, -0.05108438,
         -0.07108724, -0.03947835],
        [-0.05051402,  0.07130938, -0.07523346, -0.03524686,
          0.07618864,  0.06283784],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ]]])

In [3]:
rng = np.random.default_rng(1000)

## Dive in math
---
from norm formula

$$ LN(x) = \frac{x - \mu}{\sigma + \epsilon} \gamma + \beta$$

from the cross-entropy and softmax gradient we got

$$dZ = \frac {\partial \mathcal {L}} {\partial Z} = \frac {\partial \mathcal {L}}{\partial P} \frac {\partial P} {\partial Z} = \frac{P - Y} {N}$$

where 
* $Z$ is logits
* $P$ is $softmax\left(Z\right)$
* $N$ is token count in loss after mask
* $\mathcal {L}$ is loss function cross entropy

at the LM Head we got

$$Z = h\ @\ W_{vocab} + b_{vocab}$$

where 
* $h \in \mathbb{R} ^ {B \times T \times d_{model}}$
* $W \in \mathbb{R} ^ {d_model \times V}$
* $b \in \mathbb{R} ^ {V}$

$$\frac {\partial \mathcal {L}}{\partial W} = dW = \sum_{b, t} h_{b,t}^\intercal dZ_{b, t}$$

$$\frac {\partial \mathcal {L}}{\partial b} = db = \sum_{b, t} dZ_{b, t}$$

and $dh$

$$ \frac {\partial \mathcal {L}} {\partial h} = \frac {\partial \mathcal {L}} {\partial Z} \cdot \frac {\partial Z} {\partial h} = dh$$

$$dh_{b, t} = dZ_{b, t} W^\intercal$$

We will start from here ->

$$h = LN(y)$$
when $y$ is an output matrix from transformer stack



$$LN(\mathbf{X}) = (\hat{x} \odot \gamma) \oplus \beta$$

where 

$$\hat{x} = \frac {\mathbf{X} - \mu} {\sqrt{\sigma^2 + \epsilon}}$$

What is $\odot$ operator do?

give 
$$\mathbf{A} = \begin{bmatrix} a_{1, 1} & a_{1, 2} \\ a_{2, 1} & a_{2, 2} \end{bmatrix}$$
and
$$\mathbf{B} = \begin{bmatrix} b_{1, 1} \\ b_{2, 1} \end{bmatrix}$$
then 
$$ \mathbf{A} \odot \mathbf{B} = \begin{bmatrix} 
    a_{1, 1} \times b_{1, 1} & a_{1, 2} \times b_{1, 1}\\
    a_{2, 1} \times b_{2, 1} & a_{2, 2} \times b_{2, 1}
    \end{bmatrix}$$

In [4]:
B = 2
V = 20
d_model = 6
T = 4

In [27]:
X = rng.random((B, T, d_model), np.float64)
print(X)

[[[0.07660225 0.09861362 0.06647744 0.7077515  0.90849204 0.40254213]
  [0.50306421 0.24188559 0.69874299 0.88569365 0.93542321 0.19316749]
  [0.95909555 0.67499364 0.74070019 0.43406363 0.61999626 0.52964891]
  [0.65987263 0.79797313 0.13226049 0.86629113 0.70724855 0.34756816]]

 [[0.41495181 0.27558004 0.46345484 0.44044984 0.10794388 0.56698408]
  [0.21903772 0.38334926 0.80146845 0.90795037 0.3352147  0.15266463]
  [0.65710443 0.2512089  0.88560038 0.17242145 0.4099706  0.47180624]
  [0.13481341 0.54750085 0.2043635  0.77804228 0.54646899 0.63532663]]]


In [28]:
gamma = rng.random((d_model), np.float64);
beta = rng.random((d_model), np.float64);
print("gamma =>", gamma)
print("beta =>", beta)

gamma => [0.06913433 0.95613202 0.19942924 0.28350887 0.36286223 0.44302021]
beta => [0.42059962 0.04916507 0.43676247 0.17128328 0.36089499 0.67962496]


บางกรณีจะมีการแปลงเวคเตอร์ $\gamma$ เป็น diagonal matrix ตามด้านล่างเพื่อให้คูณได้ตามวิธีมาตรฐานของ math

In [29]:
gamma_diag = np.diag(gamma)
print(gamma_diag)

[[0.06913433 0.         0.         0.         0.         0.        ]
 [0.         0.95613202 0.         0.         0.         0.        ]
 [0.         0.         0.19942924 0.         0.         0.        ]
 [0.         0.         0.         0.28350887 0.         0.        ]
 [0.         0.         0.         0.         0.36286223 0.        ]
 [0.         0.         0.         0.         0.         0.44302021]]


In [30]:
print(X @ gamma_diag)

[[[0.00529585 0.09428764 0.01325755 0.20065383 0.32965745 0.1783343 ]
  [0.03477901 0.23127455 0.13934978 0.251102   0.33942975 0.0855771 ]
  [0.06630643 0.64538304 0.14771727 0.12306089 0.22497322 0.23464517]
  [0.04561985 0.76296766 0.02637661 0.24560121 0.25663379 0.15397972]]

 [[0.02868742 0.2634909  0.09242645 0.12487143 0.03916876 0.2511854 ]
  [0.01514303 0.36653251 0.15983624 0.25741198 0.12163675 0.06763351]
  [0.04542847 0.24018887 0.17661461 0.04888301 0.14876285 0.2090197 ]
  [0.00932023 0.5234831  0.04075606 0.22058188 0.19829296 0.28146254]]]


แต่ใน Python เราสามารถใช้เครื่องหมาย * ในการคูณตำแหน่งต่อตำแหน่งได้เลย (ฺBroadcasting)

$\mathbf{X} \odot \gamma = $

In [31]:
print(X * gamma) 

[[[0.00529585 0.09428764 0.01325755 0.20065383 0.32965745 0.1783343 ]
  [0.03477901 0.23127455 0.13934978 0.251102   0.33942975 0.0855771 ]
  [0.06630643 0.64538304 0.14771727 0.12306089 0.22497322 0.23464517]
  [0.04561985 0.76296766 0.02637661 0.24560121 0.25663379 0.15397972]]

 [[0.02868742 0.2634909  0.09242645 0.12487143 0.03916876 0.2511854 ]
  [0.01514303 0.36653251 0.15983624 0.25741198 0.12163675 0.06763351]
  [0.04542847 0.24018887 0.17661461 0.04888301 0.14876285 0.2090197 ]
  [0.00932023 0.5234831  0.04075606 0.22058188 0.19829296 0.28146254]]]


In [37]:
xhat = X * gamma + beta
print(xhat)

[[[0.42589547 0.14345271 0.45002002 0.37193711 0.69055243 0.85795926]
  [0.45537863 0.28043962 0.57611226 0.42238528 0.70032474 0.76520206]
  [0.48690605 0.69454811 0.58447975 0.29434417 0.58586821 0.91427013]
  [0.46621948 0.81213273 0.46313908 0.4168845  0.61752877 0.83360468]]

 [[0.44928704 0.31265597 0.52918892 0.29615471 0.40006375 0.93081036]
  [0.43574265 0.41569758 0.59659871 0.42869526 0.48253174 0.74725847]
  [0.4660281  0.28935394 0.61337708 0.22016629 0.50965784 0.88864466]
  [0.42991986 0.57264817 0.47751853 0.39186516 0.55918794 0.9610875 ]]]


from previous episode (LM-Head gradient)

$$ h = LM(\mathbf{X}) = (\hat{x}\odot\gamma) \oplus \beta $$

We got dh from previous episode &rarr;

In [38]:
dh

array([[[ 0.0047309 , -0.02851535, -0.13561962,  0.07165096,
          0.01057472, -0.03511244],
        [ 0.04032968, -0.01704817,  0.07002992, -0.04101618,
         -0.05707668, -0.03169758],
        [-0.02885697,  0.04073668, -0.04297836, -0.02013535,
          0.04352404,  0.03589717],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ]],

       [[ 0.05022935, -0.02123297,  0.08722008, -0.05108438,
         -0.07108724, -0.03947835],
        [-0.05051402,  0.07130938, -0.07523346, -0.03524686,
          0.07618864,  0.06283784],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ]]])

In [39]:
dh.shape

(2, 4, 6)

In [40]:
dh.reshape(B * T , d_model)

array([[ 0.0047309 , -0.02851535, -0.13561962,  0.07165096,  0.01057472,
        -0.03511244],
       [ 0.04032968, -0.01704817,  0.07002992, -0.04101618, -0.05707668,
        -0.03169758],
       [-0.02885697,  0.04073668, -0.04297836, -0.02013535,  0.04352404,
         0.03589717],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.05022935, -0.02123297,  0.08722008, -0.05108438, -0.07108724,
        -0.03947835],
       [-0.05051402,  0.07130938, -0.07523346, -0.03524686,  0.07618864,
         0.06283784],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

$$d\beta = \frac{\partial \mathcal {L}} {\partial \beta} = \sum_{batch}\sum_{sequence} \frac{\partial \mathcal {L}}{\partial h} =\sum_{batch}\sum_{sequence} dh$$

In [41]:
dBeta = np.sum(dh.reshape(B*T, d_model), axis=0, keepdims=True)
dBeta

array([[ 0.01591894,  0.04524957, -0.09658144, -0.07583181,  0.00212348,
        -0.00755336]])

$$ d\gamma = \frac {\partial \mathcal {L}}{\partial \gamma} = \sum_{batch} \sum_{sequence} \left(\frac{\partial \mathcal {L}} {\partial h} \odot \hat{x}\right)$$

In [44]:
dh.shape, xhat.shape

((2, 4, 6), (2, 4, 6))

In [50]:
dgamma = np.sum((dh * xhat).reshape(B * T, d_model), axis=0, keepdims = True)
print(dgamma)
print("dgamma shape =", dgamma.shape)

[[ 0.00688579  0.04242652 -0.04453472 -0.02684074  0.00115355 -0.01135114]]
dgamma shape = (1, 6)


$$d \hat{x} = \frac{\partial \mathcal{L}}{\partial \hat{x}} = \left( \frac{\partial \mathcal{L}} {\partial h} \odot \frac{\partial h}{\partial \hat{x}} \right) = \frac{\mathcal{L}} {\partial h} \odot \gamma$$

In [52]:
dxhat = dh * gamma
print(dxhat)

[[[ 0.00032707 -0.02726444 -0.02704652  0.02031368  0.00383717
   -0.01555552]
  [ 0.00278817 -0.0163003   0.01396601 -0.01162845 -0.02071097
   -0.01404267]
  [-0.00199501  0.03894964 -0.00857114 -0.00570855  0.01579323
    0.01590317]
  [ 0.          0.          0.          0.          0.
    0.        ]]

 [[ 0.00347257 -0.02030152  0.01739423 -0.01448287 -0.02579487
   -0.01748971]
  [-0.00349225  0.06818118 -0.01500375 -0.0099928   0.02764598
    0.02783843]
  [ 0.          0.          0.          0.          0.
    0.        ]
  [ 0.          0.          0.          0.          0.
    0.        ]]]


next is the hardest part of the transformer model &rarr;
$$\hat{x} = \frac {\mathbf{X} - \mu} {\sqrt{\sigma^2 + \epsilon}}$$

$$\frac{\partial \hat{x}}{\partial X} = ?$$

given :
$$u = \mathbf{X} - \mu$$
$$v = \sqrt{\sigma^2 + \epsilon}$$

$$\frac {\partial \hat{x}} {\partial \mathbf{X}} = \frac{\frac{\partial \left(\mathbf{X} - \mu\right) } {\partial \mathbf{X}} \cdot \sqrt{\sigma^2 + \epsilon} -  \left(\mathbf{X} - \mu\right)\cdot \frac{\partial \sqrt{\sigma^2 + \epsilon}}{\partial x}}{\sigma^2 + \epsilon}$$

#### first part
---
$$\frac{\partial \left(\mathbf{X} - \mu\right) } {\partial \mathbf{X}}$$

$$ \frac{\partial \mathbf{X}}{\partial \mathbf{X}} = \begin{bmatrix} 
    \frac{\partial x_1}{\partial x_1} & \frac{\partial x_1}{\partial x_2} & \cdots & \frac{\partial x_1}{\partial x_n} \\
    \frac{\partial x_2}{\partial x_1} & \frac{\partial x_2}{\partial x_2} & \cdots & \frac{\partial x_2}{\partial x_n} \\
    \vdots & \vdots & \ddots & \vdots \\
    \frac{\partial x_m}{\partial x_1} & \frac{\partial x_m}{\partial x_2} & \cdots & \frac{\partial x_m}{\partial x_n} \end{bmatrix}$$

where $m = n$

$$\frac {\partial x_i}{\partial x_j} = \delta_{i, j}$$

and $$\frac{\partial \mu} {\partial \mathbf{x_j}} = \frac{\partial}{\partial x_j} \frac {x_1 + x_2 + \cdots + x_n}{d} = \frac{1}{d}$$

then $$\frac{\partial \left(x_i - \mu\right)}{\partial x_j} = \delta_{i, j} - \frac{1}{d}$$

#### Second part
---

$$\frac{\partial \sqrt{\sigma^2 + \epsilon}}{\partial x_j}$$

use chain rule:
$$\frac{\partial \sqrt{\sigma^2 + \epsilon}}{\partial x_j} = \frac{1}{2 \cdot \sqrt{\sigma^2 + \epsilon}} \cdot \frac{\partial \sigma^2}{\partial x_j}$$


where $$\sigma^2 = \frac{1}{d}\sum_i \left( x_i - \mu \right)^2$$


$$\frac{\partial \sigma^2}{\partial x_j} = \frac{1}{d} \cdot 2\left(x_i - \mu\right) \frac{\partial \left(x_i - \mu\right)}{\partial x_j}



$$\frac{\partial \sigma^2}{\partial x_j} = \frac{2 \left(x_i - \mu\right)}{d}\left(\delta_{i, j} - \frac{1}{d}\right)$$

then finally we got:
$$\frac{\partial \sqrt{\sigma^2 + \epsilon}}{\partial x_j} = \frac{1}{2 \cdot \sqrt{\sigma^2 + \epsilon}} $$
