<a href="https://colab.research.google.com/github/Dhanasree-Rajamani/SpecialTopics_DeepLearning/blob/main/Assignment%203/297_Pytorch_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import statements

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

Hyperparameters

Defining hyperparameters like batch size, block size, learning rate, etc., that will govern the training and model architecture.

In [None]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
torch.manual_seed(1337)

<torch._C.Generator at 0x7ac9dea4bb10>

Load Data, create vocabulary

Encoder and Decoder

Loading text data, which is assumed to be a set of poems from the file 'poems_dataset.txt'.
This is not shakespere dataset, this is another custom dataset which consists of different poems of various genre

In [None]:
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('/content/poems_dataset.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

Train-test split

In [None]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

Batch data generator

The function get_batch generates a small batch of data for training or validation.

In [None]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


Estimate Loss

This function evaluates the model's performance over a number of iterations and returns average losses for training and validation sets.

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Transformer components

Several helper classes (Head, MultiHeadAttention, FeedFoward, and Block) define the different parts of the Transformer architecture.

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


Model Definition

The BigramLanguageModel is the main model class. It consists of a token embedding layer, position embedding, multiple Transformer blocks, and a linear layer (lm_head) for predictions.

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

Model Initialization and Training

The model is instantiated, moved to a GPU if available, and trained using the AdamW optimizer.

In [None]:
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


0.211277 M parameters
step 0: train loss 4.5418, val loss 4.5445
step 100: train loss 2.5777, val loss 2.6137
step 200: train loss 2.4446, val loss 2.4946
step 300: train loss 2.3710, val loss 2.4182
step 400: train loss 2.2922, val loss 2.3427
step 500: train loss 2.2533, val loss 2.2988
step 600: train loss 2.1942, val loss 2.2505
step 700: train loss 2.1483, val loss 2.2189
step 800: train loss 2.1026, val loss 2.1867
step 900: train loss 2.0633, val loss 2.1508
step 1000: train loss 2.0210, val loss 2.1285
step 1100: train loss 1.9749, val loss 2.0841
step 1200: train loss 1.9677, val loss 2.0881
step 1300: train loss 1.9428, val loss 2.0479
step 1400: train loss 1.9141, val loss 2.0435
step 1500: train loss 1.8870, val loss 2.0109
step 1600: train loss 1.8738, val loss 2.0009
step 1700: train loss 1.8521, val loss 1.9843
step 1800: train loss 1.8375, val loss 1.9833
step 1900: train loss 1.8174, val loss 1.9794
step 2000: train loss 1.8052, val loss 1.9745
step 2100: train loss 1.

Text Generation

After training, a context is provided (initialized to zeros) and the model generates new text based on this context.

In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


By the gives hose by a good ot
At not them as whirch,. Rest that
Long weeds to me. He poon allow it oused
On flying just things know, paber's my friend,
And seacheds and till how Give:
When them tread the charf of at he tell happes,
And shaultious leaved sput I remember by atter hoof
My hose hills ghakeV, or star and got with betwext to doth.

Thou, chart in
Wit thing'n be bellen sizarsizer that be two sprent
Deaven thinknes, to my thy 'mouther's from and his bed to good,
Or vent somed up by My greatess gone.'

There's you amigon's was one, a time's you shoulds,
And three the rises, that with the reaves.
Cut stoon my eye?'I haves for one and sylvarky.
I was thou

When I glades one to we made.

Now vergit, not come rogettist staren
That asintic far plowed, so where revers in the sended holdid
And Moved we to runter of Her cottives green.
They riefull;
Annythingth by equartly jerest from loteling to to make. Se too that.Deft she it lanced Now,
And this not 'serever pace by the hath cour