## Exercício: Modelo de Linguagem com auto-atenção e máscaras causais

Seguimos na mesma linha de treinar um modelo de linguagem a partir dos textos do livro "O Guarani", de José de Alencar.

Neste exercício, vamos treinar um modelo de linguagem com auto-atenção e com máscara causal. A máscara causal é necessária para que o modelo não tenha acesso a palavras futuras, que é a abordagem usada por grandes modelos de linguagem, como o GPT.

Use a implementação matricial de auto-atenção da aula passada.

### Modificações necessárias

* Adicione a máscara causal na função `forward` da cabeça de auto-atenção.
* Modifique o nosso dataloader para retornar inputs (uma lista de tokens de tamanho $n$), targets (uma lista de tokens de tamanho $n$ deslocada para a esquerda em 1 token). Exemplo `input = [1, 2, 3, 4]`, `target = [2, 3, 4, 5]` para a sequência `[1, 2, 3, 4, 5]` com `seq_len=4`, por exemplo (Ver slide 50).

### Extra
* MultiHeadAttention: modifique a cabeça de auto-atenção para ter múltiplas cabeças. Isso não é obrigatório, mas pode ser interessante para ver como o modelo se comporta.
* Diagrama da geração: fazer diagrama que mostre os passos da geração de tokens (conforme slide 47).

### Dicas

* Use como base o vídeo do Karpathy: https://www.youtube.com/watch?v=kCc8FmEb1nY. Observe que, no vídeo, ele primeiro implementa um modelo bi-grama, depois um modelo de linguagem com auto-atenção. O modelo de auto-atenção é implementado por volta do minuto 40, mas vale a pena assistir o vídeo todo.
* Use esta implementação como base: https://colab.research.google.com/drive/1vFTg4MSXVJwNSzPjaCcvmqhxTP7gK7HA?usp=sharing. Observe como o modelo é organizado e como a máscara é implementada na classe MultiHeadAttention.
* Use `context_size=9`

### Aluno: Pedro Rodrigues Corrêa

In [None]:
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

# Parâmetros

In [None]:
context_size = 9
embedding_dim = 128
batch_size = 16
dropout = 0.1
lr = 0.001
n_heads = 8
n_layer = 4

## Faz download e carrega o dataset

In [None]:
text1 = open("pg67724.txt","r").read()
text1_pt = text1.split("\n\n")[35:2675]

text2 = open("pg67725.txt","r").read()
text2_pt = text2.split("\n\n")[32:2191]

paragraphs = text1_pt + text2_pt

cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

len(paragraphs), len(cleaned_paragraphs)

(4799, 4740)

## Análise do dataset

In [None]:
# Conta as palavras no dataset
from collections import Counter
import re

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(r'\w+', text.lower()))
    return word_counts

word_counts = count_words(cleaned_paragraphs)

len(word_counts)

11837

## Criando um vocabulário

In [None]:
vocab_size = 10000

UNK = '<UNK>'
most_frequent_words = [UNK] + [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 0)}
vocab_size = len(vocab)

In [None]:
import re

# Function to encode a sentence into a list of indices based on a vocabulary
def encode_sentence(sentence, vocab):
    # Tokenize the sentence into words and punctuation marks
    tokens = re.findall(r'\w+|[.,!?-]', sentence.lower())
    # Encode each token using the vocabulary, replacing unknown words with 0
    encoded_sentence = [vocab.get(word, 0) for word in tokens]
    return encoded_sentence

# Function to decode a list of indices into a sentence using a vocabulary
def decode_sentence(encoded_sentence, vocab):
    words = []
    # Iterate through each index in the encoded sentence
    for index in encoded_sentence:
        # Find the corresponding word in the vocabulary for the index
        # If the index is not found in the vocabulary, replace it with "<UNK>"
        word = next((word for word, code in vocab.items() if code == index), "<UNK>")
        words.append(word)
    return words

## Classe do dataset

In [None]:
def gera_input_target(text, context_size):
    # Initialize lists to store contexts and targets.
    contexts = []
    targets = []

    for paragraph in text:
        text_encoded = encode_sentence(paragraph, vocab)
        # Iterate over the text to generate contexts and corresponding targets.
        for i in range(len(text_encoded) - context_size):
            # Extract the context of size 'context_size' starting from index 'i'.
            context = text_encoded[i: i + context_size]
            # Retrieve the target element immediately following the context.
            target = text_encoded[i + 1: i + context_size +1]
            # Append the context and target to their respective lists.
            contexts.append(context)
            targets.append(target)

    # Return the lists of contexts and targets.
    return torch.stack((torch.tensor(contexts), torch.tensor(targets)))

In [None]:
class MyDataset(Dataset):
    def __init__(self, contexts, targets):
        # Initialize the dataset with contexts and targets
        self.contexts = contexts
        self.targets = targets

    def __len__(self):
        # Return the length of the dataset (number of samples)
        return len(self.contexts)

    def __getitem__(self, idx):
        # Get a sample at the given index
        return self.contexts[idx], self.targets[idx]

In [None]:
from sklearn.model_selection import train_test_split

contexts, targets = gera_input_target(cleaned_paragraphs, context_size)
X_train, X_test, y_train, y_test = train_test_split(contexts, targets, test_size=0.2, random_state=18)



In [None]:
# Gera os dataset de treino e validaçãov
train_dataset = MyDataset(X_train, y_train)
test_dataset = MyDataset(X_test, y_test)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
sample = next(iter(train_loader))

## Model

In [None]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, context_size, embedding_dim):
        super(PositionalEncoding, self).__init__()
        # Initialize positional encoding matrix
        self.pe = torch.zeros(context_size, embedding_dim)

        # Compute positional encodings
        position = torch.arange(0, context_size, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2).float() * (-np.log(10000.0) / embedding_dim))
        self.pe[:, 0::2] = torch.sin(position * div_term)
        self.pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = self.pe.unsqueeze(0)  # Add batch dimension

    def forward(self, x):
        # Add positional encodings to input embeddings
        if len(x.shape) == 3:
            _, seq_len, _ = x.size() # batch size, context size, embedding dim
        else:
            seq_len, _ = x.size()    # context size, embedding dim

        pe = self.pe[:, :seq_len, :]
        return x + pe.to(x.device)  # Return input embeddings with positional encodings added

In [None]:
class Head(nn.Module):
    def __init__(self, context_size, embedding_dim, head_size):
        super().__init__()
        # Linear transformations for Q, K, V, and output
        self.linearWQ = nn.Linear(embedding_dim, head_size, bias = False)
        self.linearWK = nn.Linear(embedding_dim, head_size, bias = False)
        self.linearWV = nn.Linear(embedding_dim, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(context_size, context_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape # batch, seq_len, emb_size
        k = self.linearWK(x)   # (B,T,C)
        q = self.linearWQ(x) # (B,T,C)
        v = self.linearWV(x) # (B,T,C)

        # compute attention scores ("affinities")
        scores = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)

        scores = scores.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        probs = F.softmax(scores, dim=-1) # (B, T, T)

        probs  = self.dropout(probs)

        # perform the weighted aggregation of the values
        out = probs @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.linearWO = nn.Linear(head_size * num_heads, embedding_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.linearWO(out)
        return out

In [None]:
class FeedFoward(nn.Module):
    def __init__(self, embedding_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embedding_dim, 4 * embedding_dim),
            nn.ReLU(),
            nn.Linear(4 * embedding_dim, embedding_dim),
            nn.Dropout(dropout)

        )

    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    def __init__(self, embedding_dim, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = embedding_dim // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(embedding_dim)
        self.ln1 = nn.LayerNorm(embedding_dim)
        self.ln2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
class Microtransformer(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.positional_embedding = PositionalEncoding(context_size, embedding_dim)
        self.blocks = nn.Sequential(*[Block(embedding_dim, n_head=n_heads) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(embedding_dim) # final layer norm
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.embedding(idx) # (B,T,C)
        pos_emb = self.positional_embedding(tok_emb) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        #print(x)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -context_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


In [None]:
class Head_ramon(nn.Module):
    """ one head of self-attention """

    def __init__(self, context_size, embedding_dim, head_size):
        super().__init__()
        self.key   = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(context_size, context_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        B,T,C = x.shape       # (B) batch, (T) context_size, (C) embedding_dim para 1 cabeça e (embedding_dim // n_head) para multiplas cabeças
        k     = self.key(x)   # (B,T,C)
        q     = self.query(x) # (B,T,C)
        v     = self.value(x) # (B,T,C)


        # calculando os scores de atenção ("affinities")
        scores = q @ k.transpose(-2,-1) * C**-0.5   # (B, T, C) @ (B, C, T) -> (B, T, T)

        # adicionando máscara causal
        scores = scores.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        probs  = F.softmax(scores, dim=-1) # (B, T, T)

        probs  = self.dropout(probs)

        # perform the weighted aggregation of the values
        out = probs @ v       # (B, T, T) @ (B, T, C) -> (B, T, C)

        return out

In [None]:
class MiniGPT(torch.nn.Module):

    def __init__(self, vocab_size, context_size, embedding_dim):
        super(MiniGPT, self).__init__()

        # Embedding layer to convert token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Positional encoding layer to provide positional information to the model
        self.pos_encoding = nn.Embedding(context_size, embedding_dim)

        # Self-attention layer
        self.self_attention_layer = Head(context_size, embedding_dim, embedding_dim)

        # Linear layer for weighted sum after self-attention
        self.linearWO = nn.Linear(embedding_dim, embedding_dim)

        # First fully connected layer for the feedforward network
        self.fc1 = nn.Linear(embedding_dim, 4 * embedding_dim)

        # ReLU activation function
        self.relu = nn.ReLU()

        # Second fully connected layer for the feedforward network
        self.fc2 = nn.Linear(4 * embedding_dim, embedding_dim)

        # Layer normalization for the self-attention and feedforward network outputs
        self.ln_1 = nn.LayerNorm(embedding_dim)
        self.ln_2 = nn.LayerNorm(embedding_dim)

        # Layer normalization for the final output
        self.ln_f = nn.LayerNorm(embedding_dim)

        # Linear layer for outputting logits
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

        # Dropout layer for regularization
        self.dropout = nn.Dropout(dropout) # Dropout rate should be defined somewhere


    def forward(self, input):
        B, T = input.shape

        # Embedding layer
        embedding = self.embedding(input)

        # Positional encoding
        pos_embedding = self.pos_encoding(torch.arange(T, device=device))
        x = embedding + pos_embedding

        # Self-attention block
        x_sa = self.ln_1(x)
        x_sa = self.self_attention_layer(x_sa)
        x_sa = self.linearWO(x_sa)
        x_sa = self.dropout(x_sa)
        x = x + x_sa

        # Feedforward block
        x_ffwd = self.ln_2(x)
        x_ffwd = self.fc1(x_ffwd)
        x_ffwd = self.relu(x_ffwd)
        x_ffwd = self.fc2(x_ffwd)
        x_ffwd = self.dropout(x_ffwd)
        x = x + x_ffwd

        # Final layer normalization
        x = self.ln_f(x)

        # Output logits
        logits = self.lm_head(x)

        return logits

    def generate(self, idx, max_new_tokens=10):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):

            # Get the last context_size tokens for the next prediction
            idx_cond = idx[:, -context_size:]

            # Predict logits for next token
            logits = self(idx_cond)   # (B, context_size, vocab_size)

            # Focus only on the last time step
            logits = logits[:, -1, :] # (B, vocab_size)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, vocab_size)

            # Exclude the <unk> token (encoded as 0) by assigning zero probability
            probs[:, 0] = 0.0

            # Normalize probabilities to ensure sum equals 1
            probs = probs / probs.sum(dim=-1, keepdim=True)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


In [None]:
model = MiniGPT(vocab_size, context_size, embedding_dim)
model

MiniGPT(
  (embedding): Embedding(10001, 128)
  (pos_encoding): Embedding(9, 128)
  (self_attention_layer): Head(
    (linearWQ): Linear(in_features=128, out_features=128, bias=False)
    (linearWK): Linear(in_features=128, out_features=128, bias=False)
    (linearWV): Linear(in_features=128, out_features=128, bias=False)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (linearWO): Linear(in_features=128, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=512, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=512, out_features=128, bias=True)
  (ln_1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (ln_2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (ln_f): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (lm_head): Linear(in_features=128, out_features=10001, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [None]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]

In [None]:
def get_num_params(model):
    n_params = sum(p.numel() for p in model.parameters())
    return n_params

print(get_num_params(model))

2769553


In [None]:
model.to(device)
output = model(input.to(device))

In [None]:
output.argmax(dim=1)

tensor([[6, 6, 7,  ..., 8, 2, 3],
        [3, 2, 7,  ..., 8, 2, 5],
        [6, 2, 1,  ..., 8, 2, 8],
        ...,
        [4, 2, 2,  ..., 3, 3, 8],
        [2, 2, 2,  ..., 8, 3, 4],
        [8, 2, 2,  ..., 3, 2, 5]], device='cuda:0')

In [None]:
target

tensor([[   6,   56,  233,   50,    0,    5,    6, 5931,   18],
        [6321,    0,    5, 3184,    1,  784,    8,  144,  354],
        [ 786,    0,    3,    2, 1070,    4,  102,   10, 1127],
        [   2,  776,   32,    3,  562,  629, 1297,    0,    5],
        [ 394,  140,  314,    0,    1, 3003,   11,  296,   32],
        [ 163,   31,    3,   16,    0, 2380,    4,    0,   89],
        [  55,    5,    3,   16,  163,   21,   42,    0,  710],
        [   2,   41,    7,   75, 1827,    7,  353,   22,    3],
        [3831,    0,   41,    3, 1235,    2,   82,  323,    1],
        [   3, 1340,    9, 4174,    4,    7, 2567,   18,  402],
        [  24,  393, 3226,  167,  135,    0, 1248,   21,   34],
        [ 789,    0,    7,  355, 4277,    4, 1006,    4, 1844],
        [   6,   20,   16,  170,    0,    3,  141,  512,    1],
        [   6,   22,    3,   69,    6,   27, 3659, 3619,    0],
        [   0,    6,  161,   61,  496,  386,    0,    1,   50],
        [   9, 4769, 4541,   30,  718,  

## Training

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=lr)
model.to(device)

# Calculate loss before training
model.eval()  # Set the model to evaluation mode
initial_loss = 0
with torch.no_grad():
    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Forward pass
        logits = model(inputs)

        B, T, C = logits.shape
        logits  = logits.view(B * T, C)
        targets = targets.view(B * T)

        initial_loss += criterion(logits, targets)

    avg_loss = initial_loss / len(train_loader)

initial_PPL = torch.exp(avg_loss)
print(f'Initial Loss: {avg_loss:.4f}, n\
        Initial Perplexity: {initial_PPL}')

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    running_loss = 0
    for inputs, targets in train_loader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Forward pass
        logits = model(inputs)

        B, T, C = logits.shape
        logits  = logits.view(B * T, C)
        targets = targets.view(B * T)

        loss_train = criterion(logits, targets)

        # Backward and optimize
        optimizer.zero_grad()
        loss_train.backward()
        optimizer.step()

        running_loss += loss_train.item()

        ppl_train = torch.exp(loss_train)

    train_loss = running_loss / len(train_loader)

    model.eval()

    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            logits = model(inputs)

            B, T, C = logits.shape
            logits  = logits.view(B * T, C)
            targets = targets.view(B * T)

    loss_test = criterion(logits, targets)
    ppl_test = torch.exp(loss_test)

    print(f'Epoch [{epoch+1}/{num_epochs}], \
            Loss Treinamento: {loss_train.item():.4f}, \
            PPL Treinamento: {ppl_train.item():.4f}, \
            Loss Teste: {loss_test.item():.4f}, \
            PPL Teste: {ppl_test.item():.4f}')

Initial Loss: 9.3449, n        Initial Perplexity: 11439.98046875
Epoch [1/10],             Loss Treinamento: 4.2310,             PPL Treinamento: 68.7869,             Loss Teste: 3.1590,             PPL Teste: 23.5476
Epoch [2/10],             Loss Treinamento: 3.6684,             PPL Treinamento: 39.1897,             Loss Teste: 2.6677,             PPL Teste: 14.4063
Epoch [3/10],             Loss Treinamento: 2.6844,             PPL Treinamento: 14.6489,             Loss Teste: 2.4348,             PPL Teste: 11.4133
Epoch [4/10],             Loss Treinamento: 3.0766,             PPL Treinamento: 21.6852,             Loss Teste: 2.0025,             PPL Teste: 7.4076
Epoch [5/10],             Loss Treinamento: 2.8853,             PPL Treinamento: 17.9089,             Loss Teste: 2.0793,             PPL Teste: 7.9987
Epoch [6/10],             Loss Treinamento: 2.3482,             PPL Treinamento: 10.4665,             Loss Teste: 2.0459,             PPL Teste: 7.7364
Epoch [7/10],      

## Exemplo de uso

In [None]:
input = "não"
idx = torch.tensor(encode_sentence(input, vocab), dtype=torch.long).unsqueeze(0)
idx = idx.to(device)
idx = model.generate(idx, 25)
output = decode_sentence(idx[0].tolist(), vocab)

for palavra in output:
    print(palavra)

não
truão
da
vossa
laia
és
pery
do
vencedor
da
guerra
não
contra
essa
esperança
que
poder
repellia
a
embriaguez
do
prazer
como
a
espada
conteve


# Referências

Ramon Simões Abilio