# Generative Artificial Intelligence

* How to create a model that can generate images/text/audio/etc. from user prompts?
* Unlabelled data availbale in abundance. Learn hidden sructure from the unlabelled data.

In this notebook we specifically address the task of text generation.

### What is text generation?
Given a sequence, predict what is the next token in the sequence. The sequence can be a series of words or characters and the objective is to predict next word or character respectively in sequence.
$$P(w_{t} | w_{t-1}, w_{t-2}, w_{t-3},...,w_{1})$$

#### Some basic terminology:
**Tokens/Tokenization** - Given a sequence of characters, tokenization is the process of dividing the sequence into smaller units called tokens. Tokens can be individual characters, segments of words, complete words or portions of sentences. Tokens obtained are converted into 1-hot vectors to be fed into the model.

**Generative Model** - A model that learns to sample from the probability distribution to generate data that seem to be from the same probability distribution as training data.

**Discriminative Model** - In contrast to generative models, discriminative models are trained to differentiate between classes or categories.


### Dataset

In [1]:
# Read the input corpus
with open('tiny_shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("Length of text: ", len(text))
print(f"\nSample text:\n{text[:400]}")

Length of text:  1115393

Sample text:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it 


### Tokenization
One of the easiest language model to start with is the character level model where each character is a token. It encodes minimum token level information but is easy to implement.

In [2]:
# Create characters as vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Vocabulary: ", ''.join(chars))
print("Vocabulary size: ", vocab_size)

Vocabulary:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size:  65


To feed characters into a model they need to converted into numbers that can be processed by a model.

In [3]:
# Encoder and decoder function for idx to char and back
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [4]:
print(encode('Shakespeare'))
print(decode(encode('Shakespeare')))

[31, 46, 39, 49, 43, 57, 54, 43, 39, 56, 43]
Shakespeare


In [5]:
# Convert text to torch tensor
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
data = torch.tensor(encode(text), dtype=torch.long)

In [6]:
# Split data into train and validation
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# Get single batch of data for training
def get_batch(split='train', block_size=8, batch_size=4):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [7]:
x, y  = get_batch(split='train')
x[0], y[0]

(tensor([14, 59, 58,  6,  1, 51, 39, 42]),
 tensor([59, 58,  6,  1, 51, 39, 42, 39]))

### Generative Model
**Possible generative models**:
1. **N-gram model** - Given n-previous tokens in the sequence, predict the next token. Most common approaches are bigram or trigram model with bayes estimation. Larger the value of **N**, more context information can be incorporated.
2. **Recurrent neural networks** - A goto neural network achitecture for working with sequential data. Behind the scenes, just a neural network that processes each token of the sequential input one at a time. 

<p align="center">
<img src="assets/rnn.webp" width="700">
</p>

Condenses entire history of the sequence into a single vecctor. Theoretically RNNs can process infinite history but this is limited proctically by computational constraints and memory requirements. Even with a large enough history, RNNs struggle with long term dependencies.

3. **Transformer models** - Introduced in 2017 by the paper [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf). The paper introduces an architecture that provides a differentiable lookup method for the called `Attention` that potentially solves the problem of long term dependencies by allowing the model to lookup specific information from the history as required.

e.g., Prompt - Where is Eiffel Tower located? Answer - It is located in Paris. _Here `It` is related to `Eiffel Tower` and `Paris` to `Where`_.

In this notebook we will start with a simple Bigram model and slowly build our way towards a Transformer model.

### Bigram Model

**Embedding layer** - Converts from an index-based representation to a vector representation i.e., each index is mapped to a vector.

In [8]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

example_layer = nn.Embedding(vocab_size, vocab_size).to(device)
print(x[0])
print(example_layer(x[0]).shape)

tensor([14, 59, 58,  6,  1, 51, 39, 42])
torch.Size([8, 65])


In [9]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx) # (B, T, C) (4, 8, 65)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)  # calls forward function
            logits = logits[:, -1, :] # only consider the last output
            probs = F.softmax(logits, dim=-1) # normalize it to a probabilty distribution
            idx_next = torch.multinomial(probs, num_samples=1) # Sample from the distribution
            idx = torch.cat((idx, idx_next), dim=1) # Add it to the generated sequence
            
        return idx

#### Instantiate Bigram model

In [10]:
m = BigramLanguageModel(vocab_size).to(device)
xb, yb = get_batch(split='train')
logits, loss = m(xb, yb)
print(logits.shape, yb.shape)

torch.Size([32, 65]) torch.Size([4, 8])


In [11]:
idx = torch.tensor(encode('Thou'), dtype=torch.long).unsqueeze(dim=0).to(device)
print(decode(m.generate(idx, 10)[0].tolist()))

ThoutMtoIQlNhv


#### Train the model

In [12]:
@torch.no_grad()
def estimate_loss(model, eval_iters=300):
    model.eval()
    out = {}
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split=split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = loss.mean()
    model.train()
    return out['train'], out['val']

def train_model(model, optimizer, block_size=8, batch_size=4, train_iters=10000):
    for step in range(train_iters):
        xb, yb = get_batch('train', batch_size=batch_size, block_size=block_size)
        logits, loss = m(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        if ((step + 1) % 1000 == 0):
            train_loss, val_loss = estimate_loss(model=model)
            print(f'Step {step + 1}: train loss {train_loss:.4f}, validation loss {val_loss:.4f}')

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
train_model(m, optimizer)
print(f'Trained model validation loss: {estimate_loss(m)[1]:.4f}')

Step 1000: train loss 3.7891, validation loss 3.9341
Step 2000: train loss 3.3235, validation loss 3.7550
Step 3000: train loss 3.1876, validation loss 3.2936
Step 4000: train loss 2.7422, validation loss 3.2484
Step 5000: train loss 2.7747, validation loss 2.7088
Step 6000: train loss 2.3555, validation loss 2.7764
Step 7000: train loss 2.4933, validation loss 2.5743
Step 8000: train loss 2.5843, validation loss 2.6449
Step 9000: train loss 2.6714, validation loss 2.8557
Step 10000: train loss 2.5463, validation loss 2.1739
Trained model validation loss: 2.6996


In [13]:
idx = torch.tensor(encode('Thou'), dtype=torch.long).unsqueeze(dim=0).to(device)
print(decode(m.generate(idx, 100)[0].tolist()))

Thou poandiseron; emarlY rewillio$ou he, s:

JRUCA send ayoulecont lspadoY cedshagave I whe s we,
d he b


### Transformer Models

In bigram model, we are only considering the last character of the sequence to generate a new character. With the help of transformers we will enable the model to look into the entire history i.e., all the characters (limited to block size of the data) in the sequence so far.

At the core of the transformer is a `single attention head` (referred to in the paper as Scaled Dot-Product Attention).

<p align="center">
<img src="assets/Single-head attention.png" width="200">
</p>

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt(d_k)})V$$

**Q, K, V** - Query, Key and Value vectors corresponding to the vector representations for each token.

In [14]:
xb, yb = get_batch(split='train')
print(xb[0, :4])

for b in range(1):
    for t in range(4):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"input: {context}, target: {target}")

tril = torch.tril(torch.ones(4, 4))
print(tril)

tensor([58,  1, 54, 50])
input: tensor([58]), target: 1
input: tensor([58,  1]), target: 54
input: tensor([58,  1, 54]), target: 50
input: tensor([58,  1, 54, 50]), target: 43
tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])


In [15]:
class SingleHeadAttention(nn.Module):
    def __init__(self, head_size=32, embed_dim=32, block_size=8, dropout=0.4) -> None:
        super().__init__()
        self.head_size = head_size
        self.key = nn.Linear(embed_dim, head_size, bias=False)
        self.query = nn.Linear(embed_dim, head_size, bias=False)
        self.value = nn.Linear(embed_dim, embed_dim, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        B, T, _ = x.shape
        k = self.key(x) # (B, T, head_size)
        q = self.query(x) # (B, T, head_size)
        # Attention scores
        weights = q @ k.transpose(-2, -1) * self.head_size**-0.5 # (B, T, head_size) @ (B, head_size, T) --> # (B, T, T)
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        weights = F.softmax(weights, dim=-1) # (B, T, T)
        weights = self.dropout(weights)
        # weighted aggregation of values
        v = self.value(x) # (B, T, embed_dim)
        context = weights @ v # (B, T, T) @ (B, T, embed_dim) -> (B, T, embed_dim)
        return context

    
class LanguageModelBase(nn.Module):
    def __init__(self, head_size=32, embed_dim=32, block_size=8, dropout=0.4) -> None:
        super().__init__()
        self.block_size = block_size
        self.token_embedding_table = nn.Embedding(vocab_size, embed_dim)
        self.positional_embedding_table = nn.Embedding(block_size, embed_dim)
        self.attention_head = SingleHeadAttention(head_size, embed_dim, block_size, dropout)
        self.lm_head = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, idx, targets = None):
        _, T = idx.shape
        token_embed = self.token_embedding_table(idx) # (B, T, embed_dim)
        pos_embed = self.positional_embedding_table(torch.arange(T, device=device)) # (T, embed_dim)
        x = token_embed + pos_embed # (B, T, embed_dim)
        x = self.attention_head(x) # (B, T, embed_dim)
        logits = self.lm_head(x) # (B, T, vocab_size)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss
        
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
            
        return idx

In [16]:
model = LanguageModelBase()
m = model.to(device)
print(sum(p.numel() for p in m.parameters()) / 1e3, 'K parameters')
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
train_model(model=m, optimizer=optimizer)
print(f'Trained model validation loss: {estimate_loss(m)[1]:.4f}')

7.553 K parameters
Step 1000: train loss 2.5853, validation loss 2.7865
Step 2000: train loss 2.8465, validation loss 2.7734
Step 3000: train loss 2.3241, validation loss 2.2861
Step 4000: train loss 2.9462, validation loss 2.6620
Step 5000: train loss 2.3600, validation loss 2.7450
Step 6000: train loss 2.3211, validation loss 2.7743
Step 7000: train loss 2.6133, validation loss 2.6025
Step 8000: train loss 2.5949, validation loss 2.6655
Step 9000: train loss 2.3887, validation loss 2.4869
Step 10000: train loss 2.1348, validation loss 2.3411
Trained model validation loss: 2.3338


In [17]:
idx = torch.tensor(encode('Thou'), dtype=torch.long).unsqueeze(dim=0).to(device)
print(decode(m.generate(idx, 100)[0].tolist()))

Thou carevau esun hee :Ath he mer ths.sliincutus bthr se acci kY'si:
Ml m nitoxc yawet betw shy s nt shr


#### Multi-head Attention

One possible way of understanding attention head is that it looks for specific speech information in the sequence. E.g., the head might look for relevant nouns in the sequence. The idea of multi-head attention is that each head can look for different kind of speech information, e.g., one head for relevant nouns, one for relevant verbs, one for relevant prepositions, etc.

<p align="center">
<img src="assets/multihead_attention.png" width="250">
</p>

In [18]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads=4, head_size=32, embed_dim=32, dropout=0.4, block_size=8) -> None:
        super().__init__()
        self.heads = nn.ModuleList([SingleHeadAttention(head_size, embed_dim, block_size, dropout) for _ in range(num_heads)])
        self.proj = nn.Linear(embed_dim * num_heads, embed_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, num_heads*head_size)
        out = self.proj(out) # (B, T, embed_dim)
        out = self.dropout(out) # (B, T, embed_dim)
        return out
    
class LanguageModelMultiHead(LanguageModelBase):
    def __init__(self, head_size=32, embed_dim=32, block_size=8, dropout=0.4, num_heads=4) -> None:
        super().__init__(head_size, embed_dim, block_size, dropout)
        self.attention_head = MultiHeadAttention(num_heads, head_size, embed_dim, dropout, block_size)

In [19]:
model = LanguageModelMultiHead()
m = model.to(device)
print(sum(p.numel() for p in m.parameters()) / 1e3, 'K parameters')
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
train_model(model=m, optimizer=optimizer)
print(f'Trained model validation loss: {estimate_loss(m)[1]:.4f}')

20.897 K parameters
Step 1000: train loss 2.7360, validation loss 2.7585
Step 2000: train loss 2.4645, validation loss 2.2319
Step 3000: train loss 2.4067, validation loss 2.6485
Step 4000: train loss 2.8033, validation loss 2.4727
Step 5000: train loss 2.5226, validation loss 2.2189
Step 6000: train loss 2.5559, validation loss 2.5681
Step 7000: train loss 2.3045, validation loss 2.6631
Step 8000: train loss 2.6865, validation loss 2.5036
Step 9000: train loss 2.4763, validation loss 2.4336
Step 10000: train loss 2.5803, validation loss 2.2242
Trained model validation loss: 2.2565


In [20]:
idx = torch.tensor(encode('Thou'), dtype=torch.long).unsqueeze(dim=0).to(device)
print(decode(m.generate(idx, 100)[0].tolist()))

Thoug umerod g you i't clcn,
M ilsemyofrdy
Hhead ou
Aranomm,
vos,
Amec tt br?
obleplwardat lorig I ot
Al


# Attention Is All You Need

So far we implemented a multi-head attention module. In the paper the author suggests to stack multiple such blocks, thereby increasing the depth of the network such that multiple attention blocks can interact with each other.

<p align="center">
<img src="assets/transformers.png" width="300">
</p>

In [21]:
class FeedForward(nn.Module):
    def __init__(self, embed_dim=32, dropout=0.4) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.ReLU(),
            nn.Linear(4 * embed_dim, embed_dim),
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.net(x)
    

class Block(nn.Module):
    def __init__(self, embed_dim=32, n_head=4, head_size=32, dropout=0.4, block_size=8) -> None:
        super().__init__()
        self.ma = MultiHeadAttention(n_head, head_size, embed_dim, dropout, block_size)
        self.ffn = FeedForward(embed_dim)
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)
        
    def forward(self, x):
        x = x + self.ma(self.ln1(x)) # Communication layer (B, T, embed_dim)
        x = x + self.ffn(self.ln2(x)) # Computation layer (B, T, embed_dim)
        return x
    
class LanguageModelTransformer(LanguageModelBase):
    def __init__(self, head_size=32, embed_dim=32, block_size=8, dropout=0.4, num_head=4, num_blocks=4) -> None:
        super().__init__(head_size, embed_dim, block_size, dropout)
        self.attention_head = nn.Sequential(
            *[Block(embed_dim, num_head, head_size, dropout, block_size) for _ in range(num_blocks)],
            nn.LayerNorm(embed_dim)
        )

In [22]:
model = LanguageModelTransformer()
m = model.to(device)
print(sum(p.numel() for p in m.parameters()) / 1e3, 'K parameters')
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
train_model(model=m, optimizer=optimizer)
print(f'Trained model validation loss: {estimate_loss(m)[1]:.4f}')

104.129 K parameters


Step 1000: train loss 2.4669, validation loss 2.8703
Step 2000: train loss 2.5872, validation loss 2.0732
Step 3000: train loss 2.7851, validation loss 2.2142
Step 4000: train loss 2.3523, validation loss 2.4505
Step 5000: train loss 2.4850, validation loss 1.7209
Step 6000: train loss 2.3785, validation loss 2.4120
Step 7000: train loss 2.7865, validation loss 2.5848
Step 8000: train loss 2.3630, validation loss 2.3989
Step 9000: train loss 2.2790, validation loss 2.0910
Step 10000: train loss 2.0408, validation loss 2.3849
Trained model validation loss: 2.6399


In [23]:
idx = torch.tensor(encode('Thou'), dtype=torch.long).unsqueeze(dim=0).to(device)
print(decode(m.generate(idx, 100)[0].tolist()))

Thou not o I,
Drondy Mfut IN vioverr, latt, chagutere, is foriir!
3
FAIS:
Is!
In; le to not hacho Of by 


In [25]:
head_size = 512
embed_dim = 256
block_size = 512
dropout = 0.1
num_head = 12
num_blocks = 12
batch_size = 8
eval_iters = 300
train_iters = 10000


model = LanguageModelTransformer(head_size, embed_dim, block_size, dropout, num_head, num_blocks)
m = model.to(device)
print(sum(p.numel() for p in m.parameters()) / 1e6, 'M parameters')
# optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
# train_model(model, optimizer, block_size, batch_size, train_iters)
# print(f'Trained model validation loss: {estimate_loss(m)[1]:.4f}')

63.110209 M parameters


#### ChatGPT
We have so far trained a GPT (Generalized Pretrained Transformer). This is the first out of three steps executed by OpenAI to train ChatGPT.