# Notebook de referência

Nome: Pedro Rodrigues Corrêa

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante:
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

# Parâmetros

In [None]:
batch_size = 32
lr = 1e-4
epochs = 3
max_len = 128
accumulation_steps = 3

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
import numpy as np
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm

In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7be8e8d9da90>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.

^C


## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False This film is a perfect example of the recent crop of horror films that simply are not fully realized
True This has the funnist jokes out of all the Cheech & Chong flicks. It's the first one I saw with these
True what is wrong with you people, if you weren't blown away by the action car sequences and jessica Sim
3 últimas amostras treino:
False The movie is boring, the characters and scenarios are unrealistic, unbelievable, the action is hilar
True This is an above average Jackie Chan flick, due to the fantastic finale and great humor, however oth
True A stunning realization occurs when some sort of phenomenon takes place!! Be it, firecrackers going o
3 primeiras amostras validação:
True Definitely an odd debut for Michael Madsen. Madsen plays Cecil Moe, an alcoholic family man whose li
True Probably the worst Dolph film ever. There's nothing you'd want or expect here.

In [None]:
class IMDbDataset(Dataset):
    def __init__(self, reviews, labels, max_len):
        self.reviews = reviews
        self.labels = labels
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        self.max_len = max_len

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        review = str(self.reviews[idx])
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [None]:
ds_train = IMDbDataset(x_train, y_train, max_len = max_len)

In [None]:
ds_train[0]

{'input_ids': tensor([  101,  2023,  2143,  2003,  1037,  3819,  2742,  1997,  1996,  3522,
         10416,  1997,  5469,  3152,  2008,  3432,  2024,  2025,  3929,  3651,
          1012,  2045,  2024,  2048,  5847,  2000,  2202,  1999,  5469,  3152,
          1024,  2593,  2017,  2123,  1005,  1056,  2428,  4863,  2054,  1005,
          1055,  2183,  2006,  1006,  2030,  2040,  1996,  6359,  2003,  1010,
          2066,  1999,  3146,  8859, 10376,  9288,  1007,  2030,  2507,  1996,
          3494,  1037,  2843,  1997,  2067,  2466,  1998, 23191,  2061,  2008,
          2673,  2003,  4541,  1006, 14414,  2071,  9280,  2022,  2019,  2742,
          1997,  2023,  1007,  1012,  1026,  7987,  1013,  1028,  1026,  7987,
          1013,  1028,  6854,  1010, 19815, 11896,  1999,  2023,  2181,  1012,
          1045,  2156,  7078,  2053,  3114,  2000,  2507,  1037,  2235, 14021,
          5596,  1997,  1996,  2067,  2466,  2005,  7010,  2302,  3929, 11847,
          1996, 11305,  1997,  2010,  2

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased', num_labels=2).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_dataset = IMDbDataset(x_train, y_train, max_len=max_len)
valid_dataset = IMDbDataset(x_valid, y_valid, max_len=max_len)
test_dataset = IMDbDataset(x_test, y_test, max_len=max_len)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
def evaluate(model, eval_dataloader, device = device):
    model.eval()
    total_loss = 0.0
    correct_preds = 0
    total_preds = 0

    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            _, predicted = torch.max(logits, 1)
            correct_preds += torch.sum(predicted == labels).item()
            total_preds += len(labels)

            total_loss += loss.item()

    avg_loss = total_loss / len(eval_dataloader)
    accuracy = correct_preds / total_preds

    return avg_loss, accuracy

In [None]:
# Checar se a acurácia antes do treino é de aprox. 50%
evaluate(model, valid_dataloader)

Evaluating: 100%|██████████| 313/313 [00:50<00:00,  6.24it/s]


(0.7040197356059529, 0.496)

In [None]:
def train(model, train_dataloader, optimizer):
    model.train()
    total_loss = 0.0
    correct_preds = 0
    total_preds = 0
    steps = 0

    for epoch in range(epochs):
        for batch in tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{epochs} - Training"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            _, predicted = torch.max(logits, 1)
            correct_preds += torch.sum(predicted == labels).item()
            total_preds += len(labels)

            loss = loss / accumulation_steps
            loss.backward()

            if (steps + 1) % accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()

            total_loss += loss.item()
            steps += 1

        avg_loss = total_loss / len(train_dataloader)
        accuracy = correct_preds / total_preds

        print(f"Epoch {epoch + 1}/{epochs} - Training Loss: {avg_loss:.4f}, Training Accuracy: {accuracy:.4f}")

    print("Training finished.")

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
train(model, train_dataloader, optimizer)

Epoch 1/3 - Training: 100%|██████████| 1250/1250 [05:30<00:00,  3.79it/s]


Epoch 1/3 - Training Loss: 0.1244, Training Accuracy: 0.8381


Epoch 2/3 - Training: 100%|██████████| 1250/1250 [05:11<00:00,  4.02it/s]


Epoch 2/3 - Training Loss: 0.2312, Training Accuracy: 0.8516


Epoch 3/3 - Training: 100%|██████████| 1250/1250 [05:10<00:00,  4.03it/s]

Epoch 3/3 - Training Loss: 0.3218, Training Accuracy: 0.8658
Training finished.



