# Notebook de referência

Nome:

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante:
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

# Fixando a seed

In [19]:
%pip install evaluate
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [20]:
import random
import torch
import torch.nn.functional as F
import numpy as np

In [21]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7cb6c4f21ef0>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [22]:
import os

if not os.path.isfile("aclImdb.tgz"):
	!wget -nc http://files.fast.ai/data/aclImdb.tgz
	!tar -xzf aclImdb.tgz

## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [23]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False I run a group to stop comedian exploitation and I just spent the past 2 months hearing horror storie
False This is not a good film. The acting is remarkably stiff and unconvincing.The film doesn't seem to kn
True Whatever you become in your life,you must never forget that you have roots.This is the story of true
3 últimas amostras treino:
False This movie never made it to theaters in our area, so when it became available on DVD I was one of th
True The film begins with a bunch of kids in reform school and focuses on a kid named 'Gabe', who has app
True A common plotline in films consists of the main characters leaving the hustle and bustle of the city
3 primeiras amostras validação:
True If you've seen this movie, you've been to Puerto Rico. I've lived in Puerto Rico all my life, and ha
True Fairly funny Jim Carrey vehicle that has him as a News reporter who temporari

## Classe Dataset

In [24]:
from torch.utils.data import DataLoader, Dataset

class IMDBDataset(Dataset):
    def __init__(self, text_inputs, targets, context_size, tokenizer):
        self.text_inputs = text_inputs
        self.targets = targets
        self.context_size = context_size
        self.tokenizer = tokenizer

        self.encoded_text = self.tokenizer.batch_encode_plus(
            self.text_inputs,
            add_special_tokens=True,
            max_length=self.context_size,
            padding=True,
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

    def __len__(self):
        return len(self.text_inputs)

    def __getitem__(self, idx):
        inputs = self.encoded_text['input_ids'][idx]
        attention_mask = self.encoded_text['attention_mask'][idx]

        return {
            'inputs': inputs,
            'attention_mask': attention_mask,
            'targets': torch.tensor(self.targets[idx], dtype=torch.long)
        }


In [25]:
from transformers import AutoTokenizer

context_size = 512
#tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokenizer =  AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

train_data = IMDBDataset(x_train, y_train, context_size, tokenizer)
test_data = IMDBDataset(x_test, y_test, context_size, tokenizer)
val_data = IMDBDataset(x_valid, y_valid, context_size, tokenizer)

In [26]:
# Exibir o tamanho dos conjuntos de treinamento e validação
print(f'Train Length {len(train_data)}')
print(f'Test Length {len(test_data)}')
print(f'Validation Length {len(val_data)}')

Train Length 20000
Test Length 25000
Validation Length 5000


In [27]:
from transformers import DataCollatorWithPadding
#data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

batch_size = 60

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_data, batch_size=batch_size,  shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size,  shuffle=True)

In [28]:
sample= next(iter(train_loader))
sample

{'inputs': tensor([[  101,  8840, 16585,  ...,  2077,  1010,   102],
         [  101,  2061,  8576,  ...,     0,     0,     0],
         [  101,  2065,  2017,  ...,     0,     0,     0],
         ...,
         [  101,  1045,  2442,  ...,     0,     0,     0],
         [  101,  2028,  1997,  ...,     0,     0,     0],
         [  101,  2023,  2003,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'targets': tensor([0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
         0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
         1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0])}

In [29]:
from transformers import AutoModelForSequenceClassification
#model  = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased', num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [31]:
# Training setup
import torch.optim as optim
import torch.nn as nn

lr = 2e-5

criterion = nn.CrossEntropyLoss()
optimizer  = torch.optim.AdamW(model.parameters(), lr=lr)
#scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

val_loss_list = []
train_loss_list = []

model.to(device, dtype=torch.bfloat16)


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 3.81 GiB of which 5.19 MiB is free. Including non-PyTorch memory, this process has 3.79 GiB memory in use. Of the allocated memory 3.70 GiB is allocated by PyTorch, and 448.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
from tqdm import tqdm
# Initial Loss
initial_loss = 0

model.eval()

with torch.no_grad():
     for batch in tqdm(train_loader):
        inputs = batch['inputs'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        targets = batch['targets'].to(device)
        # Forward pass
        outputs = model(inputs, attention_mask=attention_mask, labels=targets)
        initial_loss += outputs.loss

initial_loss = initial_loss / len(train_loader)
print(f'\nInitial Loss: {initial_loss}')


100%|██████████| 334/334 [02:01<00:00,  2.75it/s]


Initial Loss: 0.73828125





In [None]:
def train_one_epoch(model, dataloader, optimizer, loss_function, epoch_index, device):
    running_loss = 0.0
    last_loss = 0.0
    model.to(device)
    start_time = time.time()  # Start time of the epoch

    for batch in tqdm(train_loader):
        inputs =  batch['inputs'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        targets = batch['targets'].to(device)

        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs, attention_mask=attention_mask, labels=targets)
        loss = outputs.loss

        # Backward and optimize
        # scaler.scale(outputs.loss).backward()
        # scaler.step(optimizer)
        # scaler.update()
        loss.backward()
        optimizer.step()

    end_time = time.time()  # End time of the epoch
    epoch_duration = end_time - start_time  # Duration of epoch

    print(f'Epoch [{epoch+1}/{num_epochs}], \
            Loss: {loss.item():.4f}, \
            Elapsed Time: {epoch_duration:.2f} sec')

    return loss.item()

In [None]:
# Training steps
import evaluate
import torch.nn.functional as F
import time
#from torch.cuda.amp import GradScaler

#scaler = GradScaler()
best_val_loss = -1

num_epochs = 5
for epoch in range(num_epochs):
    print(f'Epoch {epoch}:')
    accuracy = evaluate.load("accuracy")
    model.train(True)
    loss = train_one_epoch(model, train_loader, optimizer, criterion, epoch, device)

    train_loss_list.append(loss)

    running_val_loss = 0.0

    # Put the model into validation mode
    model.eval()
    model.to(device)

    # Disable the gradient
    print(f"Running Validation for Epoch {epoch}")
    with torch.no_grad():
        for val_data in tqdm(val_loader):
            val_inputs = val_data['inputs'].to(device)
            val_attention_mask = val_data['attention_mask'].to(device)
            val_targets = val_data['targets'].to(device)

            val_inputs = val_inputs.to(device)
            val_targets = val_targets.to(device)

            val_outputs = model(val_inputs, attention_mask=val_attention_mask, labels=val_targets)
            logits = val_outputs.logits
            probabilities = F.softmax(logits, dim=-1)
            predicted_labels = torch.argmax(probabilities, dim=-1)

            running_val_loss += val_outputs.loss
            accuracy.add_batch(references=val_targets, predictions=predicted_labels)

    val_loss = running_val_loss / len(val_loader)
    val_acc = accuracy.compute()
    val_acc = val_acc['accuracy']

    # Track best performance, and save the model's state
    if((best_val_loss < val_loss) or (val_loss == -1)):
        best_val_loss = val_loss
        torch.save(model.state_dict(), f'/content/drive/My Drive/ia024-aula4-models/model_{timestamp}_best')

    print(f'Validation Loss: {val_loss:.4f}')
    print(f'Validation Accuracy: {val_acc:.4f}')
    print('===============================================================================')

Epoch 0:


  0%|          | 0/334 [00:00<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 46.00 MiB. GPU 0 has a total capacity of 3.81 GiB of which 25.19 MiB is free. Including non-PyTorch memory, this process has 3.77 GiB memory in use. Of the allocated memory 3.65 GiB is allocated by PyTorch, and 32.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
from tqdm import tqdm
import evaluate
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

best_model  = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased', num_labels=2)
best_model.load_state_dict(torch.load(f'/content/drive/My Drive/ia/model_20240410_224419_best'))

accuracy = evaluate.load("accuracy")
best_model.eval()
best_model.to(device)

with torch.no_grad():
    for test_data in tqdm(test_loader):
        test_inputs = test_data['inputs'].to(device)
        test_attention_mask = test_data['attention_mask'].to(device)
        test_targets = test_data['targets'].to(device)

        test_outputs = best_model(test_inputs, attention_mask=test_attention_mask, labels=test_targets)
        logits = test_outputs.logits
        probabilities = F.softmax(logits, dim=-1)
        predicted_labels = torch.argmax(probabilities, dim=-1)

        accuracy.add_batch(references=test_targets, predictions=predicted_labels)

test_acc = accuracy.compute()
test_acc = test_acc['accuracy']

print(f'\nTest Accuracy: {test_acc:.4f}')