In [None]:
'''
Key concepts:
* RMS Norm
* KV-cache
* Rotary Positional embedding
* Grouped query attention
* SwiGLU

Notes:
- Layer normalization is done primarily to deal with `internal co-variate shift`, which is to avoid excessive changes in the distribution of the neuron's values due to drastic adjustments made by SGD. This slows down training.
- LayerNorm: a unique (mean, variance) pair for each sample.
- BatchNorm: a unique (mean, variance) pair for each feature.
- RMSNorm: Hypothesizes that the scaling is mostly responsible for the success of the normalization. The re-centering is thus not needed and the mean doesn't have to be calculated.
- Rotary positional embedding: Uses a slightly different representation (relative) for the analysis.
    - They are parametrically efficient to compute.
    - They show better invariance to permutations.
    - Better generalization.
    - Easy to implement.
- Rotary positional embedding is only applied to the query and keys.
- Rotary positional embedding is only applied after Q and K are multiplied by W.
- KV-cache is an important concept that can be applied to all Transformer models (only during INFERENCE)
- KV-cache caches the K and V values from previous steps. but our Q is only one vector. 
  Instead of doing a (N,F)x(N,F).T self-attention computation each time, we only do a 
  (1,F)x(N,F).T each time where (1,F) is the dimension of Q. This allows us to skip the re-calculation of the entire self-attention matrix each time.
- In GPUs, memory transfer is really expensive and about 20x slower than matrix multiplication operations.
- Total operations -> O(bnd**2)
- Total memory accesses -> O(bnd + bhn**2 + d**2)
- Memory access is not the botleneck here since (Total memory)/(Total operations) <<< 1
- multi-query attention: When using group query attention, we only calculate the multi-heads on the query. N-heads per query for each key and value.
- GROUPED multi-query attention: When using GMQA, we reduce the no. of heads for the K and V values but don't remove them completely. A good compromise between speed and quality.
- SwiGLU = x * (1/(1+exp(-beta * x))) -> Works due to divine benevolence
'''

In [ ]:
'''
Layer Normalization
'''