In [3]:
import torch
import torch.nn as nn 
from torch.nn import functional as F 
torch.manual_seed(1337)

<torch._C.Generator at 0x14fe5a22250>

## Contents

1. Preparing training data
2. Definition of a first simple model, ony using embeddings
3. Introduction to self attention
4. Adding a Feed forward layer

## Data

Loading in the data

In [6]:
with open("./data/input.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [7]:
len(text)

1115394

Unique characters present in the text

In [8]:
characters = sorted(list(set(text)))
vocab_size = len(characters)
vocab_size, "".join(characters)

(65, "\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")

Index to character and character to index functions for encoding and decoding

In [9]:
stoi = {ch:i for i,ch in enumerate(characters)}
itos = {i:ch for i,ch in enumerate(characters)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

encode("hello"), decode(encode("hello"))

([46, 43, 50, 50, 53], 'hello')

Representation of the original as a sequence of integers

In [10]:
data = torch.tensor(encode(text),dtype=torch.long)

Splitting

In [11]:
n = int(0.9*len(data))
train_data = data[:n]
test_data = data[n:]

Only Chunks of the dataset are fed to the model. They have a max length called block size, also called context length.

In [12]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

 As the model is always predicting the next word, x,y sets can be created by sampling chunks from the dataset which are offset by one character. The transformer can be trained on all the examples below. 

In [13]:
x= train_data[:block_size]
y= train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"input {context}, target {target}")

input tensor([18]), target 47
input tensor([18, 47]), target 56
input tensor([18, 47, 56]), target 57
input tensor([18, 47, 56, 57]), target 58
input tensor([18, 47, 56, 57, 58]), target 1
input tensor([18, 47, 56, 57, 58,  1]), target 15
input tensor([18, 47, 56, 57, 58,  1, 15]), target 47
input tensor([18, 47, 56, 57, 58,  1, 15, 47]), target 58


In [24]:
batch_size=4

In [27]:
def get_batch(split):
    data= train_data if split == "train" else test_data
    ix = torch.randint(len(data) -block_size, (batch_size,)) #Random sampling batch_size indexes from the dataset to use as starting points for chunks
    x = torch.stack([data[i:i+block_size] for i in ix]) # Stacking the results up in a batch_size x chunk_size tensor
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # Offset by one because they are the targets of x
    return x, y

In [26]:
xb, yb = get_batch("train")
print("inputs:")
print(xb.shape)
print(xb)
print("targets:")
print(yb.shape)
print(yb)


inputs:
torch.Size([4, 8])
tensor([[58, 46, 43,  1, 43, 39, 56, 57],
        [39, 58, 47, 53, 52, 12,  1, 37],
        [53, 56, 43,  1, 21,  1, 41, 39],
        [50, 39, 52, 63,  1, 47, 58, 57]])
targets:
torch.Size([4, 8])
tensor([[46, 43,  1, 43, 39, 56, 57, 10],
        [58, 47, 53, 52, 12,  1, 37, 53],
        [56, 43,  1, 21,  1, 41, 39, 51],
        [39, 52, 63,  1, 47, 58, 57, 43]])


Representation of the individual training samples of a block and their respective targets. Each training example can be used block_size times for training. 

In [19]:
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"Context: {context.tolist()}, target: {target}")

Context: [24], target: 43
Context: [24, 43], target: 58
Context: [24, 43, 58], target: 5
Context: [24, 43, 58, 5], target: 57
Context: [24, 43, 58, 5, 57], target: 1
Context: [24, 43, 58, 5, 57, 1], target: 46
Context: [24, 43, 58, 5, 57, 1, 46], target: 43
Context: [24, 43, 58, 5, 57, 1, 46, 43], target: 39
Context: [44], target: 53
Context: [44, 53], target: 56
Context: [44, 53, 56], target: 1
Context: [44, 53, 56, 1], target: 58
Context: [44, 53, 56, 1, 58], target: 46
Context: [44, 53, 56, 1, 58, 46], target: 39
Context: [44, 53, 56, 1, 58, 46, 39], target: 58
Context: [44, 53, 56, 1, 58, 46, 39, 58], target: 1
Context: [52], target: 58
Context: [52, 58], target: 1
Context: [52, 58, 1], target: 58
Context: [52, 58, 1, 58], target: 46
Context: [52, 58, 1, 58, 46], target: 39
Context: [52, 58, 1, 58, 46, 39], target: 58
Context: [52, 58, 1, 58, 46, 39, 58], target: 1
Context: [52, 58, 1, 58, 46, 39, 58, 1], target: 46
Context: [25], target: 17
Context: [25, 17], target: 27
Context: [

## Model definition and training

A simple bigram model, only using embeddings for the paramters.

For the embeddings, the indexes are used. Each index has its own embedding, so when i.e. the 24th index is queried, the 24th row from the embedding table is returned. In the model below, the next token is only predicted based on a single token. 

In [17]:
class BigramModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx) # Dimensions: Batch, Time, Channel

        if targets is None:
            loss=None
        else:
            # Torch expects a different dimension for crossentropy
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):
            logits, loss = self(idx)

            # Only the last time step will be considered 
            logits = logits[:, -1, :]

            # Get probablities
            probs = F.softmax(logits, dim=-1)

            # sampling from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)

            #appeding the results
            idx = torch.cat((idx, idx_next), dim=1)

        return idx    
    


Some output of an untrained model.

In [18]:
m = BigramModel(vocab_size=vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
loss

idx = torch.zeros((1,1), dtype=torch.long)
print(decode(m.generate(idx=idx, max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


Training a model

In [19]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [20]:
batch_size = 32

for steps in range(10000):
    xb, yb = get_batch("train")

    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.704006195068359
4.721118927001953
4.653193473815918
4.706261157989502
4.780904293060303
4.751267910003662
4.8395490646362305
4.667973041534424
4.743716716766357
4.774043083190918
4.6908278465271
4.789142608642578
4.61777925491333
4.650947093963623
4.886447429656982
4.703796863555908
4.757591724395752
4.65510892868042
4.709283828735352
4.6745147705078125
4.760501384735107
4.7892632484436035
4.653748512268066
4.6619181632995605
4.673007488250732
4.66577672958374
4.7301106452941895
4.755304336547852
4.712186813354492
4.745501518249512
4.726755619049072
4.735108375549316
4.777461051940918
4.643350601196289
4.6651835441589355
4.79764461517334
4.717412948608398
4.683647155761719
4.81886100769043
4.613771915435791
4.573785781860352
4.560741901397705
4.81563138961792
4.6061553955078125
4.619696140289307
4.725419521331787
4.650487899780273
4.5941481590271
4.7202863693237305
4.699342250823975
4.6724138259887695
4.727972984313965
4.66152286529541
4.616766929626465
4.599857807159424
4.6533403396

Output of a trained model

In [21]:
print(decode(m.generate(idx=idx, max_new_tokens=400)[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht anjx?

DUThinqunt.

LaZAnde.
athave l.
KEONH:
ARThanco be y,-hedarwnoddy scace, tridesar, wnl'shenou


# Building a tranformer

In the following we build on the model from before and add the elements of a transformer decoder to it.
![](images/decoder.png)

## 1. Positional embeddings

Self-attention mechanisms, while powerful for modelling relationships between tokens in a sequence, lack an inherent understanding of the order of those tokens. This is because attention operates over a set of vectors without any inherent notion of their spatial arrangement. To address this limitation, positional embeddings are introduced.

An example of how positional embeddings could be implemented.

In [None]:
class BigramModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed) # embedding for each position

    def forward(self, idx, targets=None):
        B, T = idx.shape
        token_embeddings = self.token_embedding_table(idx) # Dimensions: Batch, Time, vocab size
        positional_embedding = self.position_embedding_table(torch.arange(T)) # integers until T-1, Embeeded to create a T, C matrix
        x = token_embeddings + positional_embedding


## 2. Self attention

Self attention can be seen as a communication mechanism within a network of nodes, analogous to a directed graph. Each node in this network possesses a vector of information.

Affinities between nodes are determined by calculating dot products between the query vector of one node and the key vectors of all other nodes. Higher dot products indicate stronger affinities, suggesting the nodes find each other's information relevant.

The fundamental goal of self-attention is to allow these nodes to exchange information in a data-dependent manner, meaning the flow of information is determined by the content of the data itself. <br>

### Masking

A triangular masking mechanism is used to prevent future tokens (words) from influencing past tokens. This ensures a unidirectional flow of information, essential for predicting upcoming words in a sequence.

● B: Batch size - the number of independent sequences being processed simultaneously. <br>
● T: Time - the maximum context length for making predictions.<br>
● C: Channels - in this context, refers to the embedding dimensions.<br>


In [22]:
B, T, C = 4,8,2
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

Calculate the average value of the context of each token. They are a lossy compression of the contextual relevance and only used as an example

In [26]:
xbow = torch.zeros((B,T,C)) #bow = bag of words, init at 0
for b in range(B): # iteration over batch
    for t in range(T): # terating over time
        xprev = x[b, :t+1] # context including current token itself 
        xbow[b,t] = torch.mean(xprev, 0) 
xbow[0], x[0]

(tensor([[0.4241, 2.3706],
         [0.2898, 1.2700],
         [0.1273, 1.0406],
         [0.1800, 0.6874],
         [0.3613, 0.5511],
         [0.2946, 0.1472],
         [0.2726, 0.4710],
         [0.0927, 0.5778]]),
 tensor([[ 0.4241,  2.3706],
         [ 0.1555,  0.1694],
         [-0.1978,  0.5817],
         [ 0.3381, -0.3723],
         [ 1.0866,  0.0062],
         [-0.0391, -1.8723],
         [ 0.1407,  2.4141],
         [-1.1666,  1.3247]]))

torch.tril creates an lower triangular matrix, perfect for vectorization and filtering out future tokens.

In [12]:
wei = torch.tril(torch.ones(T,T))
wei

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

Using matmul to make this process more efficient. By normalizing the values of torch.tril(), the averaging process can be vectorized

In [28]:
wei = wei / wei.sum(1, keepdim = True)
xbow2 = wei @ x # (T, T) @ (B, T, C) -> torch creates batch dimension -> (B, T, T) @ (B, T, C) ----> (B, T, C)

In [30]:
xbow2[0], xbow[0]

(tensor([[0.4241, 2.3706],
         [0.2898, 1.2700],
         [0.1273, 1.0406],
         [0.1800, 0.6874],
         [0.3613, 0.5511],
         [0.2946, 0.1472],
         [0.2726, 0.4710],
         [0.0927, 0.5778]]),
 tensor([[0.4241, 2.3706],
         [0.2898, 1.2700],
         [0.1273, 1.0406],
         [0.1800, 0.6874],
         [0.3613, 0.5511],
         [0.2946, 0.1472],
         [0.2726, 0.4710],
         [0.0927, 0.5778]]))

Another way of achieving the same result using softmax

In [38]:
tril = torch.tril(torch.ones(T,T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [43]:
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=1) # Softmax is a normalization operation, thats why the result is the same. 
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

In [45]:
xbow3

tensor([[[ 0.4241,  2.3706],
         [ 0.2898,  1.2700],
         [ 0.1273,  1.0406],
         [ 0.1800,  0.6874],
         [ 0.3613,  0.5511],
         [ 0.2946,  0.1472],
         [ 0.2726,  0.4710],
         [ 0.0927,  0.5778]],

        [[ 0.7175,  0.1091],
         [ 0.6630,  0.0273],
         [ 0.5679,  0.2643],
         [ 0.0765,  0.2857],
         [-0.0240,  0.3696],
         [ 0.1839,  0.0726],
         [-0.0977, -0.0122],
         [-0.0985,  0.0877]],

        [[ 0.6477,  0.4192],
         [ 0.5112,  0.4835],
         [ 0.3243,  0.2685],
         [ 0.0693,  0.2880],
         [ 0.2925,  0.1148],
         [ 0.1442,  0.1071],
         [ 0.0084,  0.4233],
         [ 0.1843,  0.5972]],

        [[ 0.6527,  0.0654],
         [ 0.1873, -0.7585],
         [-0.4758, -0.6699],
         [-0.3655, -0.6324],
         [-0.2404, -0.6654],
         [-0.4016, -0.6217],
         [-0.6164, -0.6241],
         [-0.4988, -0.5355]]])

### Self attention implementation

Implementation of actual self attention, instead of averages

for every single token/ node a query and a key vector is created. <br>
Query = what am I looking for <br>
Key = What do i contain <br>
Affinities/ relevancy between tokens = dot product of key and query <br>
The dot product of one query and all other keys will be calculated(wei in the case below). 

In [7]:
B, T, C = 4,8,32 
x = torch.randn(B, T, C) # Matrix of random embeddings

#single head of attention
head_size=16

key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # projection into B, T, head_size
q = query(x) # projection into  B, T, head_size

#Calculation of affinities through dot products
wei = q @ k.transpose(-2, -1) # Batch dimensions shouldnt be transposed. Produces # B, T, 16 @ B, 16, T ---> B, T, T

tril = torch.tril(torch.ones(T,T)) # mask
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=1) # Softmax is a normalization operation, thats why the result is the same. 

v = value(x) # projection into  B, T, head_size
out = wei @ v 

out.shape

torch.Size([4, 8, 16])

- In self attention the batches act independently of each other. Attention "communication" only happens inside a sequence
- Right now there is no notion of space
- This whole process can be seen as a directed graph where the weights describe the affinity
- The above example would be a "decoder" attention block which would be used in an autoregressive manner, therefore the masking.
- An encoder would not use this mask as use cases such as sentiment analysis, which dont work autoregressive, do not require that condition.
- This example is called self attention because they all come from the same source. In the tranformer model, cross attention uses several sources of information


Additionally, q @ k needs to be scaled by the square root of the headsize for normalization. this has the following reasons:
● Variance Control: When the query (Q) and key (K) vectors are initialised with random values, the dot product Q⋅K can result in values with high variance. This high variance can lead to unstable gradients during training, making it challenging for the model to learn effectively.

● Softmax Sensitivity: Softmax function, used to normalise attention scores into a probability distribution, is very sensitive to large input values. If the dot products are too large, the softmax function can create extremely peaky distributions, essentially focusing attention on a single token and ignoring others.

● Stabilising Attention Weights: Dividing by the square root of the head size helps control the variance of the dot products, ensuring a more stable distribution of attention weights. This prevents the softmax function from creating overly sharp peaks and allows for a more balanced flow of information between tokens. Without scaling, softmax results| will converge to one hot vectors

An example of the variances being very different.

In [8]:
k = torch.randn(B,T, head_size)
q = torch.randn(B,T, head_size)

#Calculation of affinities through dot products
wei = q @ k.transpose(-2, -1) 

In [9]:
k.var()

tensor(1.0449)

In [10]:
q.var()

tensor(1.0700)

In [11]:
wei.var()

tensor(17.4690)

Scaling wei

In [12]:
wei = wei * head_size ** -1
wei.var()

tensor(0.0682)

### Multi-head attention

Similar to feature maps in CNNs, transformers use multiple heads of attention, each potentially focussing on different aspects of a sequence. 

In [30]:
n_embed=32

class AttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.head_size = head_size
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, x):
        
        B, T, C = x.shape

        k = self.key(x) # projection into B, T, head_size
        q = self.query(x) # projection into  B, T, head_size
        v = self.value(x)

        self_attention = q @ k.transpose(-2, -1) * C**-0.5 # affinities and scaling

        self_attention = self_attention.masked_fill(self.tril[:T, :T]==0, float('-inf')) # applying the mask
        self_attention = F.softmax(self_attention, dim=-1) # Softmax is a normalization operation, thats why the result is the same. 

        self_attention = self_attention @ v

        return self_attention

class MultiheadAttention(nn.Module):
    def __init__(self,  num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(head_size=head_size) for _ in range(num_heads)])

    def forward(self, x):
        multi_head_attention = torch.cat([h(x) for h in self.heads], dim=-1) # cat over channel dim
        return multi_head_attention

Usually, the length of the embeddings is divided by the number of heads during the input, to get an n_embed sized output. <br>

e.g n_heads = 4, n_embed = 32. <br>
each attention head outputs 32 / 4  = 8 sized results <br>
concat. brings back a 32 sized result

In [None]:
self_attention_heads = MultiheadAttention(4, n_embed//4)

## Feed forward

After the tokens have communicated with each other during the self-attention stage, they need to "think" on the data they received. The feedforward layer allows this individual processing to occur.<br>
The feedforward layer is applied independently to all tokens. <br>
In the original Transformer paper, "Attention is All You Need", the feedforward network is referred to as a "position-wise feed-forward network", which is simply a multi-layer perceptron (MLP)

Crucially, the feedforward network operates on each token's representation independently. This allows individual tokens to process the gathered contextual information and refine their representations based on their specific role and meaning within the sequence.

In [28]:
class FeedForward(nn.Module):
    def __init__(self, n_embed) -> None:
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n_embed, n_embed),
                                nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)

## Putting everything together into a single transformer block

The attention mechanism and the feedforward neural net will be combined into a single transformer block. In the original paper, this block is repeated many times.

In [22]:
class TransformerBlock(nn.Module):
    def __init__(self, n_embed, n_head):
        super.__init__()
        head_size = n_embed // n_head
        self.self_attention = MultiheadAttention(n_head, head_size)
        self.ff = FeedForward(n_embed=n_embed)

    def forward(self, x):
        x = x + self.self_attention(x)
        x = x + self.ff(x)
        return x

## Optimization

The model below already implements a very deep neural network. These tend to suffer from optimization issues such as exploding or vanishing gradients. 

In [None]:
class BigramModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed) # embedding for each position

        self.blocks = nn.Sequential(
            TransformerBlock(n_embed, n_head=4),
            TransformerBlock(n_embed, n_head=4),
            TransformerBlock(n_embed, n_head=4)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        token_embeddings = self.token_embedding_table(idx) # Dimensions: Batch, Time, vocab size
        positional_embedding = self.position_embedding_table(torch.arange(T, device=device)) # integers until T-1, Embeeded to create a T, C matrix
        
        x = token_embeddings + positional_embedding
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss=None
        else:
            # Torch expects a different dimension for crossetropy
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

### Skip/ residual connections

Gradient Superhighway: During backpropagation, gradients are distributed equally at addition nodes. This means that a portion of the gradient from the loss function directly reaches earlier layers without being diminished by passing through intermediate layers. This creates a "gradient superhighway" that helps mitigate the vanishing gradient problem.<br>

Easier Optimization: Residual blocks, which are the blocks with skip connections, are often initialized to have little impact on the network's output initially. This allows the network to start with a simpler structure that is easier to optimize. As training progresses, these blocks gradually contribute more, allowing the network to learn more complex features.

In [24]:
class TransformerBlock(nn.Module):
    def __init__(self, n_embed, n_head):
        super.__init__()
        head_size = n_embed // n_head
        self.self_attention = MultiheadAttention(n_head, head_size)
        self.ff = FeedForward(n_embed=n_embed)

    def forward(self, x):
        x = x + self.self_attention(x) # Skip connection 
        x = x + self.ff(x) # Skip connection 
        return x

For this to work, the output of the heads and the feedforward net need to be projected back into the residual pathway. the paper also suggests to project to x4 inside the feedforward.|

In [25]:
class MultiheadAttention(nn.Module):
    def __init__(self,  num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(head_size=head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
    def forward(self, x):
        multi_head_attention = torch.cat([h(x) for h in self.heads], dim=-1) # cat over channel dim
        multi_head_attention = self.proj(x) # projection back into the residual pathway
        return multi_head_attention

class FeedForward(nn.Module):
    def __init__(self, n_embed) -> None:
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n_embed, n_embed*4),
                                nn.ReLU(),
                                nn.Linear(n_embed*4, n_embed) # projection back into the residual pathway
        )
    
    def forward(self, x):
        return self.net(x)

### Layernorm

How Layer Normalisation (LayerNorm) works
- LayerNorm, normalises activations for each individual training example independently.
- It calculates the mean and variance of activations across all neurons within a single layer for a given example.
- Then, it uses these statistics to normalise the activations within that layer for that specific example.

Gamma (γ) and beta (β) are learnable parameters in Layer Normalization that allow the network to undo the normalization if needed or to scale and shift the normalized values. 

Key Benefits
- Improved Optimisation: LayerNorm helps to smooth the optimisation landscape, making it easier for the optimiser to find a good solution.
- Faster Convergence: By stabilising training, LayerNorm can lead to faster convergence to a good set of weights.
- Better Generalisation: LayerNorm can improve the model's ability to generalise to unseen data.

In [26]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean, change 1 to 0 for norm across columns
    xvar = x.var(1, keepdim=True) # batch variance, hange 1 to 0 for norm across columns
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

Pytorch implementation of layernorm in the transformer block. It used to be applied after the transformations such as self attention or feed forward but it has become common practice to do it before (prenorm formulation). the layernorm is also added again at the end before the llm head. 

In [28]:
class TransformerBlock(nn.Module):
    def __init__(self, n_embed, n_head):
        super.__init__()
        head_size = n_embed // n_head
        self.self_attention = MultiheadAttention(n_head, head_size)
        self.ff = FeedForward(n_embed=n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.self_attention(self.ln1(x)) # Skip connection , Layernorm applied before self attention
        x = x + self.ff(self.ln2(x)) # Skip connection, Layernorm applied before feed forward
        return x

To prevent overfitting, dropout should be added to the attentionheads, feedforward. This prevents some of the node from communicating by randomly dropping some weights. Every forward and random, some random of weights are being dropped which causes an ensable of network being trained. During test time, dropout is disabled again.

## Final notes

- The version implemented here is a decoder only model because we are not using crossattention from an encoder and encoders also do not use triangular masking.
    - In cross attention, the keys and values come from the endcoder. 
- The original paper that includes both encoder and decoder, focussed on machine translation, sequence to sequence tasks. 