<a href="https://colab.research.google.com/github/Antuko7/Story_Creation_IA/blob/main/Childrems_Stories_Creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
nome = "Antonio Angulo"
print(f'Meu nome é {nome}')

Meu nome é Antonio Angulo


#  Exercício: Modelo de Linguagem (Bengio 2003) - MLP + Embeddings

Neste exercício iremos treinar uma rede neural simples para prever a proxima palavra de um texto, data as palavras anteriores como entrada. Esta tarefa é chamada de "Modelagem da Língua".

Este dataset já possui um tamanho razoável e é bem provável que você vai precisar rodar seus experimentos com GPU.

Alguns conselhos úteis:
- **ATENÇÃO:** o dataset é bem grande. Não dê comando de imprimí-lo.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

In [None]:
# iremos utilizar a biblioteca dos transformers para ter acesso ao tokenizador do BERT.
!pip install transformers



## Importação dos pacotes

In [None]:
import collections
import itertools
import functools
import math
import random

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook


In [None]:
# Check which GPU we are using
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [None]:
# Check which GPU we are using (2nd run)
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cpu


In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)
torch.cuda.manual_seed(123)

## Implementação do MyDataset

In [None]:
from typing import List


def tokenize(text: str, tokenizer):
    return tokenizer(text, return_tensors=None, add_special_tokens=False).input_ids


class MyDataset():
    def __init__(self, texts: List[str], tokenizer, context_size: int):
        self.tokensIds_n = []
        self.y = []
        for text in texts:
            tokens_ids = tokenize(text, tokenizer)
            for i in range(len(tokens_ids)-context_size):
                self.tokensIds_n.append(tokens_ids[i:i+context_size])      
                self.y.append(tokens_ids[i+context_size])
                
    def __len__(self):  
        return len(self.tokensIds_n)

    def __getitem__(self, idx):
        
        return torch.tensor(self.tokensIds_n[idx]).long(), torch.tensor(self.y[idx]).long()

## Teste se sua implementação do MyDataset está correta

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

dummy_texts = ['Once upon a time a castle ', 'The secret of the pirate']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, context_size=3)
dummy_loader = DataLoader(dummy_dataset, batch_size=6, shuffle=False)
print(len(dummy_dataset))
assert len(dummy_dataset) == 5
print('passou no assert de tamanho do dataset')

5
passou no assert de tamanho do dataset


# Carregamento do dataset 

Iremos usar uma pequena amostra do dataset [Children Stories Text Corpus](https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus) para treinar e avaliar nosso modelo de linguagem.

In [None]:
!wget -nc https://github.com/Antuko7/Story_Creation_IA/cleaned_merged_fairy_tales_without_eos.txt

File ‘cleaned_merged_fairy_tales_without_eos.txt’ already there; not retrieving.



In [None]:
# Load datasets
context_size = 9

valid_examples = 100
test_examples = 100
texts = open('cleaned_merged_fairy_tales_without_eos.txt').readlines()

#print('Truncating for debugging purposes.')
#texts = texts[:500]  

training_texts = texts[:-(valid_examples + test_examples)]
valid_texts = texts[-(valid_examples + test_examples):-test_examples]
test_texts = texts[-test_examples:]

training_dataset = MyDataset(texts=training_texts, tokenizer=tokenizer, context_size=context_size)
valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, context_size=context_size)
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, context_size=context_size)

Token indices sequence length is longer than the specified maximum sequence length for this model (1490 > 512). Running this sequence through the model will result in indexing errors


In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

training examples: 49148
valid examples: 1829
test examples: 3995


In [None]:
class LanguageModel(torch.nn.Module):

    def __init__(self, vocab_size, context_size, embedding_dim, hidden_size):
        """
        Implements the Neural Language Model proposed by Bengio et al."

        Args:
            vocab_size (int): Size of the input vocabulary.
            context_size (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            hidden_size (int): Size of the hidden layer.
        """
        super(LanguageModel, self).__init__()

        self.context_size = context_size
        self.embeddings_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(self.context_size*self.embeddings_dim, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size*2) # adicionei uma camada a mais que o modelo original de Bengio
        self.linear3 = nn.Linear(hidden_size*2, vocab_size, bias=False)
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()

    def forward(self, inputs):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, context_size)
        """
        out =  self.embeddings(inputs).view(-1,self.context_size*self.embeddings_dim)
        out = self.linear1(out)
        out = self.relu1(out)
        out = self.linear2(out)
        out = self.relu2(out)
        out = self.linear3(out)
        
        return out

## Teste o modelo com um exemplo

In [None]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    context_size=context_size,
    embedding_dim=32,
    hidden_size=64,
).to(device)

sample_train, _ = next(iter(DataLoader(training_dataset)))
sample_train_gpu = sample_train.to(device)
model(sample_train_gpu).shape

torch.Size([1, 28996])

In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of model parameters: {num_params}')

Number of model parameters: 4666176


## Assert da Perplexidade


In [None]:
def perplexity(logits, target):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, vocab_size)
        target: a LongTensor of shape (batch_size,)

    Returns:
        A float corresponding to the perplexity.
    """
    crossentropy =  nn.functional.cross_entropy(logits,target)
    p = torch.exp(crossentropy)
    #p = np.exp(crossentropy.item())

    return p


n_examples = 100

sample_train, target_token_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
sample_train_gpu = sample_train.to(device)
target_token_ids = target_token_ids.to(device)
logits = model(sample_train_gpu)

my_perplexity = perplexity(logits=logits, target=target_token_ids)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=2000)
print('Passou o no assert da perplexidade')

my perplexity:              28650
correct initial perplexity: 28996
Passou o no assert da perplexidade


## Laço de Treinamento e Validação

In [None]:
max_examples = 100_000
eval_every_steps = 100
lr = 3e-5


model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    context_size=context_size,
    embedding_dim=32,
    hidden_size=64,
).to(device)

train_loader = DataLoader(training_dataset, batch_size=32, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=32)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input, target):
    model.train()
    model.zero_grad()

    logits = model(input.to(device))
    loss = nn.functional.cross_entropy(logits, target.to(device))
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input, target):
    logits = model(input)
    loss = nn.functional.cross_entropy(logits, target)
    return loss.item()


train_losses = []
n_examples = 0
step = 0
while n_examples < max_examples:
    for input, target in train_loader:
        loss = train_step(input.to(device), target.to(device)) 
        train_losses.append(loss)
        
        if step % eval_every_steps == 0:
            train_ppl = np.exp(np.average(train_losses))

            with torch.no_grad():
                valid_ppl = np.exp(np.average([
                    validation_step(input.to(device), target.to(device))
                    for input, target in validation_loader]))

            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}')
            train_losses = []

        n_examples += len(input)  # Increment of batch size
        step += 1
        if n_examples >= max_examples:
            break

0 steps; 0 examples so far; train ppl: 29626.91, valid ppl: 28612.29
100 steps; 3200 examples so far; train ppl: 28359.82, valid ppl: 27276.89
200 steps; 6400 examples so far; train ppl: 26367.70, valid ppl: 25675.30
300 steps; 9600 examples so far; train ppl: 23865.79, valid ppl: 23542.38
400 steps; 12800 examples so far; train ppl: 20182.51, valid ppl: 20405.98
500 steps; 16000 examples so far; train ppl: 14869.95, valid ppl: 15820.30
600 steps; 19200 examples so far; train ppl: 8951.48, valid ppl: 9925.95
700 steps; 22400 examples so far; train ppl: 3777.18, valid ppl: 4463.70
800 steps; 25600 examples so far; train ppl: 1051.95, valid ppl: 1571.66
900 steps; 28800 examples so far; train ppl: 409.00, valid ppl: 700.05
1000 steps; 32000 examples so far; train ppl: 233.05, valid ppl: 464.92
1100 steps; 35200 examples so far; train ppl: 175.22, valid ppl: 380.85
1200 steps; 38400 examples so far; train ppl: 154.36, valid ppl: 337.90
1300 steps; 41600 examples so far; train ppl: 144.04,

## Avaliação final no dataset de teste


Bonus: o modelo com menor perplexidade no dataset de testes ganhará 0.5 ponto na nota final.

In [None]:
test_loader = DataLoader(test_dataset, batch_size=64)

def validation_step(input, target):
    model.eval()
    logits = model(input)
    loss = nn.functional.cross_entropy(logits, target)
    return loss.item()

with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(input.to(device), target.to(device))
        for input, target in test_loader
    ]))

print(f'test perplexity: {test_ppl}')

test perplexity: 77.60827536967055


## Teste seu modelo com uma sentença

Escolha uma sentença gerada pelo modelo que ache interessante.

In [None]:
prompt = 'After they had gone he felt lonely and concern'  # Ex: 'Eu gosto de comer pizza pois me faz'
max_output_tokens = 10

for _ in range(max_output_tokens):
    input_ids = tokenize(text=prompt, tokenizer=tokenizer)
    input_ids_truncated = input_ids[-context_size:]  # Usamos apenas os últimos <context_size> tokens como entrada para o modelo.
    input_ids_truncated = torch.tensor(input_ids_truncated).long()
    logits = model(torch.LongTensor(input_ids_truncated).to(device))
    # Ao usarmos o argmax, a saída do modelo em cada passo é token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt)

After they had gone he felt lonely and concern "
After they had gone he felt lonely and concern " q
After they had gone he felt lonely and concern " q -
After they had gone he felt lonely and concern " q - -
After they had gone he felt lonely and concern " q - - -
After they had gone he felt lonely and concern " q - - - "
After they had gone he felt lonely and concern " q - - - " -
After they had gone he felt lonely and concern " q - - - " - "
After they had gone he felt lonely and concern " q - - - " - " -
After they had gone he felt lonely and concern " q - - - " - " - -
