## Exercício: Modelo de Linguagem com auto-atenção e máscaras causais

Seguimos na mesma linha de treinar um modelo de linguagem a partir dos textos do livro "O Guarani", de José de Alencar.

Neste exercício, vamos treinar um modelo de linguagem com auto-atenção e com máscara causal. A máscara causal é necessária para que o modelo não tenha acesso a palavras futuras, que é a abordagem usada por grandes modelos de linguagem, como o GPT.

Use a implementação matricial de auto-atenção da aula passada.

### Modificações necessárias

* Adicione a máscara causal na função `forward` da cabeça de auto-atenção.
* Modifique o nosso dataloader para retornar inputs (uma lista de tokens de tamanho $n$), targets (uma lista de tokens de tamanho $n$ deslocada para a esquerda em 1 token). Exemplo `input = [1, 2, 3, 4]`, `target = [2, 3, 4, 5]` para a sequência `[1, 2, 3, 4, 5]` com `seq_len=4`, por exemplo (Ver slide 50).

### Extra
* MultiHeadAttention: modifique a cabeça de auto-atenção para ter múltiplas cabeças. Isso não é obrigatório, mas pode ser interessante para ver como o modelo se comporta.
* Diagrama da geração: fazer diagrama que mostre os passos da geração de tokens (conforme slide 47).

### Dicas

* Use como base o vídeo do Karpathy: https://www.youtube.com/watch?v=kCc8FmEb1nY. Observe que, no vídeo, ele primeiro implementa um modelo bi-grama, depois um modelo de linguagem com auto-atenção. O modelo de auto-atenção é implementado por volta do minuto 40, mas vale a pena assistir o vídeo todo.
* Use esta implementação como base: https://colab.research.google.com/drive/1vFTg4MSXVJwNSzPjaCcvmqhxTP7gK7HA?usp=sharing. Observe como o modelo é organizado e como a máscara é implementada na classe MultiHeadAttention.
* Use `context_size=9`

Aluno: Matheus Rodrigues de Souza Félix \
mrsf@cin.ufpe.br / matheusrdgsf@gmail.com

## Faz download e carrega o dataset

In [None]:
import os

if not os.path.exists("data"):
    os.mkdir("data")
    !wget https://www.gutenberg.org/ebooks/67724.txt.utf-8 -P data/
    !wget https://www.gutenberg.org/ebooks/67725.txt.utf-8 -P data/

In [None]:
text = (
    open("data/67724.txt.utf-8", "r").read() + open("data/67725.txt.utf-8", "r").read()
)

paragraphs = text.split("\n\n")

cleaned_paragraphs = [
    paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()
]

len(cleaned_paragraphs)

4892

In [None]:
context_size = 9
embedding_dim = 64
batch_size = 1024
vocab_size = 10000
debug = False
dropout = 0.2
n_head = 4
n_layer = 4
pattern_re = r"\w+|[,;.:!?\']"

## Análise do dataset

In [None]:
import re
from collections import Counter

regular_expression = re.compile(pattern_re)


def count_words(texts):
    word_counts = Counter()
    for text in texts:
        word_counts.update(re.findall(regular_expression, text.lower()))
    return word_counts


word_counts = count_words(cleaned_paragraphs)

print(len(word_counts))
print(word_counts)

12610
Counter({'.': 8870, ',': 7693, 'a': 4595, 'que': 4340, 'o': 4079, 'de': 3960, 'e': 3658, 'se': 2402, ';': 2357, 'um': 1711, 'do': 1442, 'não': 1280, 'uma': 1250, 'da': 1133, 'os': 1123, 'com': 1015, 'sua': 925, 'para': 857, 'seu': 777, '!': 773, 'pery': 732, 'as': 726, 'em': 724, 'no': 664, '?': 628, 'por': 622, 'ao': 594, 'como': 594, 'lhe': 558, 'd': 493, 'á': 490, 'tinha': 478, 'era': 469, ':': 468, 'cecilia': 457, 'na': 455, 'é': 441, 'sobre': 416, 'mas': 410, 'elle': 407, 'the': 376, 'dos': 373, 'indio': 340, 'me': 325, 'seus': 324, 'mais': 318, 'antonio': 303, 'quando': 288, 'alvaro': 278, 'disse': 259, 'das': 258, 'vos': 254, 'of': 252, 'ella': 233, 'olhos': 227, 'te': 227, 'senhora': 227, 'menina': 215, 'pela': 213, 'tu': 204, "'": 203, 'depois': 200, 'nos': 200, 'isabel': 197, 'havia': 195, 'gutenberg': 194, 'fidalgo': 194, 'casa': 192, 'estava': 187, 'ainda': 186, 'tempo': 182, 'já': 181, 'mariz': 180, 'project': 176, 'aventureiros': 175, 'momento': 174, 'loredano': 174

## Criando um vocabulÃ¡rio

In [None]:
most_frequent_words = ["<unk>"] + [
    word for word, count in word_counts.most_common(vocab_size)
]
vocab = {word: i for i, word in enumerate(most_frequent_words)}
vocab_size += 1

In [None]:
print(vocab)

{'<unk>': 0, '.': 1, ',': 2, 'a': 3, 'que': 4, 'o': 5, 'de': 6, 'e': 7, 'se': 8, ';': 9, 'um': 10, 'do': 11, 'não': 12, 'uma': 13, 'da': 14, 'os': 15, 'com': 16, 'sua': 17, 'para': 18, 'seu': 19, '!': 20, 'pery': 21, 'as': 22, 'em': 23, 'no': 24, '?': 25, 'por': 26, 'ao': 27, 'como': 28, 'lhe': 29, 'd': 30, 'á': 31, 'tinha': 32, 'era': 33, ':': 34, 'cecilia': 35, 'na': 36, 'é': 37, 'sobre': 38, 'mas': 39, 'elle': 40, 'the': 41, 'dos': 42, 'indio': 43, 'me': 44, 'seus': 45, 'mais': 46, 'antonio': 47, 'quando': 48, 'alvaro': 49, 'disse': 50, 'das': 51, 'vos': 52, 'of': 53, 'ella': 54, 'olhos': 55, 'te': 56, 'senhora': 57, 'menina': 58, 'pela': 59, 'tu': 60, "'": 61, 'depois': 62, 'nos': 63, 'isabel': 64, 'havia': 65, 'gutenberg': 66, 'fidalgo': 67, 'casa': 68, 'estava': 69, 'ainda': 70, 'tempo': 71, 'já': 72, 'mariz': 73, 'project': 74, 'aventureiros': 75, 'momento': 76, 'loredano': 77, 'só': 78, 'mesmo': 79, 'italiano': 80, 'todos': 81, 'pelo': 82, 'vida': 83, 'sem': 84, 'dous': 85, 'to

In [None]:
def encode_sentence(sentence, vocab):
    return [vocab.get(word, 0) for word in re.findall(pattern_re, sentence.lower())]


def decode_sentence(sentence, most_frequent_words):
    return " ".join([most_frequent_words[c] for c in sentence])


text = cleaned_paragraphs[0]

code = encode_sentence(text, vocab)
decode = decode_sentence(code, most_frequent_words)

print(code)
print(decode)

[41, 74, 66, 544, 53, 5, 745, 34, 1066, 1067, 2, 1256, 1, 140, 53, 719, 151, 544, 331, 282, 41, 506, 53, 1257, 2658, 116, 41, 470, 382, 104, 1258, 420, 2659, 53, 41, 2660, 545, 24, 1943, 104, 148, 2661, 24, 2662, 2663, 1, 92, 471, 595, 472, 2, 1511, 472, 1944, 90, 2664, 506, 472, 1068, 41, 340, 53, 41, 74, 66, 402, 1945, 148, 151, 544, 90, 1512, 545, 746, 1, 66, 1, 747, 1, 341, 92, 332, 259, 936, 116, 41, 470, 382, 2, 92, 827, 937, 86, 1513, 41, 684, 53, 41, 1514, 1259, 92, 332, 936, 1515, 1069, 151, 544, 1]
the project gutenberg ebook of o guarany : romance brazileiro , vol . 1 of 2 this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever . you may copy it , give it away or re use it under the terms of the project gutenberg license included with this ebook or online at www . gutenberg . org . if you are not located in the united states , you will have to check the laws of the country where

## Classe do dataset

In [None]:
import torch
from torch.utils.data import DataLoader, Dataset
import random
from sklearn.model_selection import train_test_split

random.seed(18)
torch.manual_seed(18)


def create_dataset(paragraphs, context_size, include_unk=True):
    input_sequences = []
    target_words = []

    for paragraph in paragraphs:
        words = encode_sentence(paragraph, vocab)
        if len(words) > context_size:
            for i in range(len(words) - context_size):
                input_sequence = words[i : i + context_size]
                target_word = words[i + 1 : i + context_size + 1]
                if include_unk:
                    input_sequences.append(input_sequence)
                    target_words.append(target_word)
                elif all(
                    word != "UNK" for word in input_sequence + target_word
                ):  # Agradecimento - Ramon Simões
                    input_sequences.append(input_sequence)
                    target_words.append(target_word)
    return torch.stack((torch.tensor(input_sequences), torch.tensor(target_words)))

In [None]:
class MyDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]

In [None]:
inputs, targets = create_dataset(cleaned_paragraphs, context_size)

x_train, x_validation, y_train, y_validation = train_test_split(
    inputs, targets, test_size=0.2, random_state=18
)

train_data = MyDataset(x_train, y_train)
valid_data = MyDataset(x_validation, y_validation)


train_loader = DataLoader(
    train_data, batch_size=batch_size, shuffle=True, drop_last=True
)
val_loader = DataLoader(
    valid_data, batch_size=batch_size, shuffle=False, drop_last=True
)

In [None]:
len(train_data), len(valid_data)

(79294, 19824)

## Model

In [None]:
import torch.nn as nn
import torch.nn.functional as F

### Head

In [None]:
class Head(nn.Module):
    def __init__(self, context_size, embedding_dim, head_size):
        super().__init__()
        self.key = nn.Linear(embedding_dim, head_size, bias=False)
        self.value = nn.Linear(embedding_dim, head_size, bias=False)
        self.query = nn.Linear(embedding_dim, head_size, bias=False)

        self.register_buffer(
            "tril", torch.tril(torch.ones(context_size, context_size))
        )  # Karpathy

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape

        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(1, 2) / (k.shape[-1]**0.5)  # (B, T, C) @ (B, T, C) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)

        v = self.value(x)
        out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)

        return out

### Multihead Attention

In [None]:
class MultiheadAttention(nn.Module):
    def __init__(self, context_size, embedding_dim, head_size, n_head):
        super().__init__()
        self.heads = nn.ModuleList(
            [
                Head(
                    context_size=context_size,
                    embedding_dim=embedding_dim,
                    head_size=head_size,
                )
                for _ in range(n_head)
            ]
        )
        self.wo = nn.Linear(
            embedding_dim, embedding_dim
        )  # nn.Linear(head_size * n_head, embedding_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.wo(out)  # (B, T, C)

        return out

### FeedForward

In [None]:
class FeedForward(nn.Module):
    def __init__(self, embedding_dim):
        super().__init__()
        self.linear_1 = nn.Linear(embedding_dim, 4 * embedding_dim)
        self.relu = nn.ReLU()
        self.linear_2 = nn.Linear(4 * embedding_dim, embedding_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

### Block

In [None]:
class Block(nn.Module):
    def __init__(self, context_size, embedding_dim, n_head):
        super().__init__()
        head_size = embedding_dim // n_head
        self.mha = MultiheadAttention(
            context_size=context_size,
            embedding_dim=embedding_dim,
            head_size=head_size,
            n_head=n_head,
        )
        self.ff = FeedForward(embedding_dim)
        self.ln_1 = nn.LayerNorm(embedding_dim)
        self.ln_2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        x = x + self.mha(self.ln_1(x))
        x = x + self.ff(self.ln_2(x))
        return x

In [None]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, n_head, n_layer):
        super().__init__()

        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.context_size = context_size

        # Embedding
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.positional_embedding = nn.Embedding(context_size, embedding_dim)

        # Transformer
        self.blocks = nn.Sequential(
            *[
                Block(
                    context_size=context_size,
                    embedding_dim=embedding_dim,
                    n_head=n_head,
                )
                for _ in range(n_layer)
            ]
        )

        # Final Normalization
        self.ln = nn.LayerNorm(embedding_dim)

        # Language Model Head
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x):
        B, T = x.shape
        x = self.embedding(x) + self.positional_embedding(
            torch.arange(T, device=x.device)
        )
        x = self.blocks(x)
        x = self.ln(x)
        x = self.lm_head(x)
        return x

In [None]:
device = (
    "cpu" if debug else (torch.device("cuda" if torch.cuda.is_available() else "cpu"))
)
model = LanguageModel(
    vocab_size,
    embedding_dim=embedding_dim,
    context_size=context_size,
    n_layer=n_layer,
    n_head=n_head,
).to(device)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 1,490,001 trainable parameters


In [None]:
model

LanguageModel(
  (embedding): Embedding(10001, 64)
  (positional_embedding): Embedding(9, 64)
  (blocks): Sequential(
    (0): Block(
      (mha): MultiheadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (key): Linear(in_features=64, out_features=16, bias=False)
            (value): Linear(in_features=64, out_features=16, bias=False)
            (query): Linear(in_features=64, out_features=16, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (wo): Linear(in_features=64, out_features=64, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ff): FeedForward(
        (linear_1): Linear(in_features=64, out_features=256, bias=True)
        (relu): ReLU()
        (linear_2): Linear(in_features=256, out_features=64, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ln_1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((64,), eps=1e-05, eleme

### Test Inference and Loss Computing

In [None]:
_input, _target = next(iter(train_loader))
_input, _target = _input.to(device), _target.to(device)

In [None]:
output = model(_input)
output.shape

torch.Size([1024, 9, 10001])

In [None]:
criterion = nn.CrossEntropyLoss()

output = model(_input)
B, T, C = output.shape

# compute loss
criterion(output.view(B * T, C), _target.view(B * T))

tensor(9.3884, device='cuda:0', grad_fn=<NllLossBackward0>)

### Initial Loss and PPL

In [None]:
epochs = 10
lr = 1e-3
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

In [None]:
def run_eval(model, val_loader):
    model.eval()

    total_loss = 0
    accuracy = 0
    samples = 0
    with torch.no_grad():
        for _input, target in val_loader:
            _input, target = _input.to(device), target.to(device)

            output = model(_input)
            B, T, C = output.shape

            output = output.view(B * T, C)
            target = target.view(B * T)

            loss = criterion(output, target)

            accuracy += (output.argmax(dim=1) == target).sum().item()
            samples += output.shape[0]
            total_loss += loss.item()

    mean_loss = total_loss / len(val_loader)
    perplexity = torch.exp(torch.tensor(mean_loss))
    print(
        f"Val Loss: {mean_loss}, Val Perplexity: {perplexity.item()}, Val Accuracy: {accuracy/samples*100:.3f}%"
    )


dummy_generate = " ".join(val[0].split()[:10])


def get_random_sentence(loader):
    """
    Get a random sentence from the text list.
    text_list: train/val
    """
    batch_index = random.randint(0, len(loader) - 1)
    for i in range(batch_index):
        data = next(iter(loader))
    batch = data[0]
    sample_index = random.randint(0, len(batch) - 1)
    text = decode_sentence(batch[sample_index], most_frequent_words)
    return text


def generate_text(
    model, vocab, text=dummy_generate, max_length=15, random_generate=False
):
    if random_generate:
        text = get_random_sentence(val_loader)
    print(f"input text: {text}")
    model.eval()
    words = text.split(" ")
    words_encoded = encode_sentence(text, vocab)
    for i in range(max_length):
        input_ids = words_encoded[-context_size:]
        _input = torch.tensor(input_ids).unsqueeze(0).to(device)
        try:
            # only the last word
            output = F.softmax(model(_input)[:, -1, :], dim=-1)

            # remove <unk>
            output[:, 0] = 0.0

            # normalize
            output = output / output.sum(dim=-1, keepdim=True)

            # sample distribution
            output = torch.multinomial(output, num_samples=1)

        except RuntimeError:
            raise ValueError(
                f"The input dont have the minumum context size of {context_size}. Input size: {len(input_ids)}"
            )
        word = decode_sentence(output, most_frequent_words)
        words_encoded.append(output[0])
        words.append(word)
    return " ".join(words)

In [None]:
run_eval(model, val_loader)
print()
generate_text(model, vocab, text=dummy_generate, random_generate=False)

Val Loss: 9.393739901090923, Val Perplexity: 12012.9404296875, Val Accuracy: 0.006%

input text: --É facil! respondeo a moça corando por sua vez; depois


'--É facil! respondeo a moça corando por sua vez; depois conservarei abraçai cabecinha fizerão do amo ignorado largura gozava caro even responder soffrer abbade ouviste'

## Train

In [None]:
for epoch in range(epochs):
    model.train()
    train_loss = 0
    train_accuracy = 0
    train_samples = 0

    for _input, target in train_loader:
        _input, target = _input.to(device), target.to(device)

        output = model(_input)
        B, T, C = output.shape

        output = output.view(B * T, C)
        target = target.view(B * T)

        loss = criterion(output, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_accuracy += (output.argmax(dim=1) == target).sum().item()
        train_samples += output.shape[0]

    train_loss /= len(train_loader)
    train_accuracy /= train_samples
    train_perplexity = torch.exp(torch.tensor(train_loss))

    print(f"Epoch {epoch+1}/{epochs}")
    print(
        f"Train Loss: {train_loss}, Train Perplexity: {train_perplexity.item()}, Train Accuracy: {train_accuracy*100:.3f}%"
    )

    run_eval(model, val_loader)
    print()

Epoch 1/10
Train Loss: 7.140251952332336, Train Perplexity: 1261.7464599609375, Train Accuracy: 7.961%
Val Loss: 6.305620093094675, Val Perplexity: 547.64111328125, Val Accuracy: 10.108%

Epoch 2/10
Train Loss: 6.0492678060160054, Train Perplexity: 423.8025817871094, Train Accuracy: 11.677%
Val Loss: 5.749647190696315, Val Perplexity: 314.0798034667969, Val Accuracy: 13.218%

Epoch 3/10
Train Loss: 5.486372650443734, Train Perplexity: 241.3800048828125, Train Accuracy: 14.545%
Val Loss: 5.265896847373561, Val Perplexity: 193.619873046875, Val Accuracy: 15.714%

Epoch 4/10
Train Loss: 5.09214211748792, Train Perplexity: 162.73809814453125, Train Accuracy: 16.531%
Val Loss: 4.955184585169742, Val Perplexity: 141.90878295898438, Val Accuracy: 17.221%

Epoch 5/10
Train Loss: 4.814550331660679, Train Perplexity: 123.29136657714844, Train Accuracy: 17.875%
Val Loss: 4.718911196056165, Val Perplexity: 112.04618835449219, Val Accuracy: 18.496%

Epoch 6/10
Train Loss: 4.591078609615177, Train P

## Avaliação

In [None]:
run_eval(model, val_loader)
print()
generate_text(model, vocab, text=dummy_generate, max_length=15, random_generate=False)

Val Loss: 3.9374912412543046, Val Perplexity: 51.289764404296875, Val Accuracy: 24.665%

input text: --É facil! respondeo a moça corando por sua vez; depois


'--É facil! respondeo a moça corando por sua vez; depois immovel com um prazer contra a qual lhe deo sessenta na realisação da terra ;'

## Exemplo de uso

In [None]:
text = "um dia a praia será um local"

max_length = 30
generate_text(model, vocab, text, max_length)

input text: um dia a praia será um local


'um dia a praia será um local servio e o ar negros : a relva me déste ; atacão , sob por dôr em sangue como ha cousa de te pede . ao longe de uma vez'

## Single Head

In [None]:
context_size = 9
embedding_dim = 64
batch_size = 1024
debug = False
dropout = 0.2
vocab_size = 10001  # Vocabulary already processed, 10000 + 1
n_head = 1
n_layer = 1
pattern_re = r"\w+|[,;.:!?\']"

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_single_head = LanguageModel(
    vocab_size,
    embedding_dim=embedding_dim,
    context_size=context_size,
    n_layer=n_layer,
    n_head=n_head,
).to(device)

In [None]:
print(f"The model has {count_parameters(model_single_head):,} trainable parameters")

The model has 1,340,625 trainable parameters


In [None]:
model_single_head

LanguageModel(
  (embedding): Embedding(10001, 64)
  (positional_embedding): Embedding(9, 64)
  (blocks): Sequential(
    (0): Block(
      (mha): MultiheadAttention(
        (heads): ModuleList(
          (0): Head(
            (key): Linear(in_features=64, out_features=64, bias=False)
            (value): Linear(in_features=64, out_features=64, bias=False)
            (query): Linear(in_features=64, out_features=64, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (wo): Linear(in_features=64, out_features=64, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ff): FeedForward(
        (linear_1): Linear(in_features=64, out_features=256, bias=True)
        (relu): ReLU()
        (linear_2): Linear(in_features=256, out_features=64, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ln_1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((64,), eps=1e-05, elementwise

In [None]:
epochs = 10
lr = 1e-3
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

In [None]:
run_eval(model_single_head, val_loader)
print()
generate_text(
    model_single_head, vocab, text=dummy_generate, max_length=15, random_generate=False
)

Val Loss: 9.347227347524543, Val Perplexity: 11466.982421875, Val Accuracy: 0.009%

input text: --É facil! respondeo a moça corando por sua vez; depois


'--É facil! respondeo a moça corando por sua vez; depois franco riçar prazer demonios brocados comprehende desapego dobrado sublime imperiosa jardim procurou tabaco lamparina ignorão'

In [None]:
for epoch in range(epochs):
    model.train()
    train_loss = 0
    train_accuracy = 0
    train_samples = 0

    for _input, target in train_loader:
        _input, target = _input.to(device), target.to(device)

        output = model(_input)
        B, T, C = output.shape

        output = output.view(B * T, C)
        target = target.view(B * T)

        loss = criterion(output, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_accuracy += (output.argmax(dim=1) == target).sum().item()
        train_samples += output.shape[0]

    train_loss /= len(train_loader)
    train_accuracy /= train_samples
    train_perplexity = torch.exp(torch.tensor(train_loss))

    print(f"Epoch {epoch+1}/{epochs}")
    print(
        f"Train Loss: {train_loss}, Train Perplexity: {train_perplexity.item()}, Train Accuracy: {train_accuracy*100:.3f}%"
    )

    run_eval(model, val_loader)
    print()

Epoch 1/10
Train Loss: 7.142873404862045, Train Perplexity: 1265.05810546875, Train Accuracy: 7.791%
Val Loss: 6.305801617471795, Val Perplexity: 547.7403564453125, Val Accuracy: 10.073%

Epoch 2/10
Train Loss: 6.04838582447597, Train Perplexity: 423.42889404296875, Train Accuracy: 11.665%
Val Loss: 5.749014954817922, Val Perplexity: 313.88128662109375, Val Accuracy: 13.136%

Epoch 3/10
Train Loss: 5.4855546951293945, Train Perplexity: 241.1826934814453, Train Accuracy: 14.533%
Val Loss: 5.261554416857268, Val Perplexity: 192.7808837890625, Val Accuracy: 15.696%

Epoch 4/10
Train Loss: 5.092209190517277, Train Perplexity: 162.74903869628906, Train Accuracy: 16.501%
Val Loss: 4.957072910509612, Val Perplexity: 142.177001953125, Val Accuracy: 17.191%

Epoch 5/10
Train Loss: 4.816159948126062, Train Perplexity: 123.48994445800781, Train Accuracy: 17.843%
Val Loss: 4.719716649306448, Val Perplexity: 112.13645935058594, Val Accuracy: 18.599%

Epoch 6/10
Train Loss: 4.592525667958445, Train 

In [None]:
run_eval(model_single_head, val_loader)
print()
generate_text(model_single_head, vocab, text=dummy_generate, random_generate=False)

Val Loss: 9.347227347524543, Val Perplexity: 11466.982421875, Val Accuracy: 0.009%

input text: --É facil! respondeo a moça corando por sua vez; depois


'--É facil! respondeo a moça corando por sua vez; depois sobresaltando libras phrases estime estremecia diria caminhada ceremonia trabalho contrariedade rei azulão infantil elevarão cujo'