# **Example 6.3.1 (Text generation using a decoder -only transformer architecture)**

We have built a Multihead attention module to read a text file containing a poem as shown in another example. Now we can see how that structure using a combination of multihead attention and Bigram language model can be further improved to give a better text generation output. Let us look into the added modules

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing the libraries

Let us import the required linraries here. Note that the code setup is same as that we have in the multihead attention example. We are adding more transformer components and increasing the model complexity. This helps the model to extract more relevant features in a rigorous manner leading to better result.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F


We will be using the same Multihead class and self attention module as in example 6.2.2. Th eonly difference is the added projection layer which is a linear layer having the size of $n_{embd} \times n_{embd}$. We also add additional dropout layers to help with the training.

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size): #This has a linear transformation for query, key and value, a dropout module and a triangular matrix for the decoder
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x): #This computes the attention coefficients, applies a dropout and then computes the attention representation, which is named "out"
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size): #This constructs num_heads heads inside nn.ModileList. It uses the class Head above
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x): #This concatenates the heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In the original transformer architecture there is a feed forward part which finally combines into a block that gets repeated again. It is a simple NN. **What does this mean?**

So far we have not implemented the feed forward part. So in the model architecture, we directly got the output from the heads and computed the logits based on this. We didn't quite give some time for the model to think about what the information it has obtained from multiple heads. So this is done by the feedforward network. **I do not inderstand this either**

Lets implement this. So here a small feedforward block is implemented with a linear layer followed by non-linearity using _ReLu_. Now in the model, we initialize the feedforward blocks to have n_embd nodes and the output of self-attention passes sequentially to feedforward part. Baasically, the feedforward part thinks on the data individually and the data was gathered from self-attention modules.

In [None]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd), # think on the data individually and the data was gathered from self-attention modules
            nn.ReLU(),
        )
    def forward(self, x):
        return self.net(x)

But according to the original paper on Transformers, the combination of Multihead attention and feedforward network runs multiple times. So for easing that operation we are creating a class below called _Block_ which will do the back to back multihead attention and feedforward multiple times. In this case we have $\times 3$

Also, since the model is pretty deep now there can be optimization issues due to this. Therefore, to make it work we will be looking into few optimizations that might help. There are two optimizations introduced that help with improving performance of this deep model.

- skip connection
- layer Normalization

Next we will implement the skip connections and layer normalization used in the paper.

As you can see we have added a linear feedforward layer followed by nonlinearity and dropout. In the _block_ class we have the multihead attention layer folowed by the feedforward layer with the layer normalizations done after the multihead attention and feed foward layers. Further, the residual connections/skip connections are implemented similar to the original tranformer architecture.

In [None]:
class FeedFoward(nn.Module): #The ffd module is a linear layer, followed by a relu followed by a linear followed by a dropout
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)



class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) #Add and norm after multi-head
        x = x + self.ffwd(self.ln2(x)) #Add and norm after a feed-forward
        return x



## Bigram language model

The model which we introduced in example 6.2.2 is used here. Note that the model has been modified to incorporate the feedforward layer and the layer normalization. Also, another hyperparameter called **n_layer** is introduced here which corresponds to the number of repetitions that the block goes through back to back.

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Define the hyperparameters

Now let us define all the hyperparameters needed for the code to run

In [None]:
# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 2001
eval_interval = 10
learning_rate = 5e-5
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

We use the sam _poem.txt_ dataset as before for performing the experiment. The setup used is the same. lets see the performance of this model.

In [None]:
loss1 = []
loss2_tr = []
loss2_val = []
torch.manual_seed(1337)

with open('/content/drive/MyDrive/DL_Book_Notebooks/Chapter 6: Attention Networks and transformers/Data/Rime_ancient_poem2.txt', 'r', encoding = 'utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        loss2_tr.append(losses['train'])
        loss2_val.append(losses['val'])
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    loss1.append(loss)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


10.762014 M parameters
step 0: train loss 3.5376, val loss 3.5295
step 10: train loss 2.7569, val loss 2.7632


In [None]:
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

In [None]:
import matplotlib.pyplot as plt

plt.plot(torch.tensor(loss1))