# Fine Tuning do BERT no IMDB

Nome: Elton Cardoso do Nascimento

## Instruções:
> 
> 
> Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
> 
> Importante:
> - [x] Deve-se implementar o próprio laço de treinamento.
> - [x] Implementar o acumulo de gradiente.
> 
> Dicas:
> - BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.
> 
> - Solução para erro de memória:
>   - Usar bfloat16 permite quase dobrar o batch size
> 
> Opcional:
> - Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

In [None]:
import os # Manipular arquivos
import random # Operações randômicas
import pickle # Serializar/deserializar backups
import time # Medição de tempo
from concurrent.futures import ThreadPoolExecutor # Parelização
from typing import Tuple, List, Dict, Optional # Type hints

import numpy as np # Operações vetoriais
import matplotlib.pyplot as plt # Plots
import torch # ML
from torch.utils.data import Dataset, DataLoader # Preparação de dados

try:
    import wandb # Logging
except:
    wandb = None

## Fixando a seed

In [192]:
def reset_seeds():
    random.seed(123)
    np.random.seed(123)
    torch.manual_seed(123)

In [193]:
reset_seeds()

## Preparando Dados

> Primeiro, fazemos download do dataset:

In [3]:
if not os.path.isfile("aclImdb.tgz"):
    !curl -LO http://files.fast.ai/data/aclImdb.tgz
    !tar -xzf aclImdb.tgz

### Carregando o dataset

> Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [4]:
max_valid = 5000

In [5]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path), encoding="utf8") as f:
            texts.append(f.read())
    return texts

In [6]:
executor = ThreadPoolExecutor(max_workers=4)

folders = ['aclImdb/train/pos', 'aclImdb/train/neg', 'aclImdb/test/pos', 'aclImdb/test/neg']

futures = []
for folder in folders:
    future = executor.submit(load_texts, folder) 

    futures.append(future)

all_texts = []

for future in futures:
    texts = future.result()

    all_texts.append(texts)

executor.shutdown()

x_train_pos = all_texts[0]
x_train_neg = all_texts[1]
x_test_pos = all_texts[2]
x_test_neg = all_texts[3]


In [7]:
x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

In [8]:
# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

In [9]:
print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.


In [10]:
print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

3 primeiras amostras treino:
False POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. Int
False The long list of "big" names in this flick (including the ubiquitous John Mills) didn't bowl me over
True Bette Midler showcases her talents and beauty in "Diva Las Vegas". I am thrilled that I taped it and
3 últimas amostras treino:
False I was previously unaware that in the early 1990's Devry University (or was it ITT Tech?) added Film 
True The story and music (George Gershwin!) are wonderful, as are Levant, Guetary, Foch, and, of course, 
True This is my favorite show. I think it is utterly brilliant. Thanks to David Chase for bringing this i
3 primeiras amostras validação:
True Why has this not been released? I kind of thought it must be a bit rubbish since it hasn't been. How
True I was amazingly impressed by this movie. It contained fundamental elements of depression, grief, lon
True photography was too jumpy to follow. dark scenes hard to

In [203]:
GOOD_MOVIE = 1 #True
BAD_MOVIE = 0 #False

### Tokenizador

Preparamos o tokenizador para uso. No caso vamos utilizar o tokenizador preparado para o modelo BERT:

In [11]:
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main
  from .autonotebook import tqdm as notebook_tqdm


Podemos testar o tokenizador imprimindo uma sequência (observe o token inicial "101"=\<CLS> e final "102"=\<SEP>):

In [213]:
tokens = tokenizer(x_train[0], add_special_tokens=True, padding="max_length", max_length=512)

tokens["input_ids"][:10], tokens["input_ids"][-10:]

([101, 153, 9025, 13882, 13360, 2036, 16625, 2346, 17656, 9637],
 [156, 2328, 12165, 2508, 1110, 1141, 10010, 10866, 119, 102])

### Dataset e Dataloader

Definimos o dataset para realizar a tokenizador e manipular os dados:

In [156]:
class IMDB_Dataset(Dataset):
    '''
    Dataset for sentiment analisys

    Input: tokenized review and mask (for padding).
    Output: if is a good (1) or bad (0) review. 
    '''
    def __init__(self, x_data:List[str], y_data:List[bool], tokenizer) -> None:
        """
        Creates a new dataset.

        Args:
            x_data (List[str]): dataset reviews.
            y_data (List[bool]): dataset targets.
            tokenizer: tokenizer to encode reviews.

        """

        super().__init__()

        self._x_data = tokenizer(x_data, 
                                 return_tensors="pt", #Return as torch tensor 
                                 padding=True, #Add padding to small sequences
                                 return_token_type_ids=False, #Don't return sequence mask (only one sequence)
                                 truncation=True) #Truncate big sentences (max = 512 tokens, with CLS and SEP)

        self._y_data = torch.tensor(y_data, dtype=torch.float32)

        self._size = len(self._y_data)

    def __len__(self) -> int:
        """
        Gets the size of the dataset.

        Returns:
            int: dataset size.
        """

        return self._size
    
    def __getitem__(self, idx:int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Gets a item of the dataset.

        Args:
            idx (int): data index.

        Returns:
            torch.Tensor: dataset input.
            torch.Tensor: dataset attention mask. 
            torch.Tensor: dataset target.
        """
        return self._x_data["input_ids"][idx], self._x_data["attention_mask"][idx], self._y_data[idx]

In [157]:
datasets = {}

xs = [x_train, x_valid, x_test]
ys = [y_train, y_valid, y_test]
names = ["train", "val", "test"]

for i in range(3):
    dataset = IMDB_Dataset(xs[i], ys[i], tokenizer)
    datasets[names[i]] = dataset

Para evitar precisamos realizar várias vezes a tokenização durante o desenvolvimento, podemos serializar o dataset para posteriormente deserializá-lo:

In [158]:
file_name = "datasets.bin"
with open(file_name, "wb") as file:
    pickle.dump(datasets, file)

In [159]:
with open(file_name, "rb") as file:
    datasets = pickle.load(file)

E defimos uma função para criar os dataloaders a partir dos ddatasets e batch size:

In [164]:
def create_dataloaders(datasets:Dict[str, Dataset], batch_size:int) -> Dict[str, DataLoader]:
    '''
    Generate dataloaders from datasets.

    Args:
        datasets (Dict[str, Dataset]): named datasets.
        batch_size (int): batch sizes.

    Returns:
        Dict[str, DataLoader]: dataloaders for the datasets.
    '''


    dataloaders = {}

    for name in names:
        dataloaders[name] = DataLoader(datasets[name], batch_size=batch_size, shuffle=True)
    
    return dataloaders

## Preparação do modelo

Prepamos o modelo a ser utilizado, que é um modelo BERT com uma camada adicional para realizar a classificação, que recebe como entrada o embedding final relacionado ao token CLS:

In [167]:
class BinaryClassifierBERT(torch.nn.Module):
    '''
    Classifier model using BERT.
    '''

    def __init__(self, dropout_rate:float=0) -> None:
        '''
        Model constructor.

        Args:
            dropout_rate (float, optional): Dropout before the final layer. Defaults to 0.
        '''
        super().__init__()
        
        self.bert = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')
        
        self.dropout = torch.nn.Dropout(dropout_rate)
        self.linear = torch.nn.Linear(768, 1)

    def forward(self, input_ids:torch.Tensor, attention_masks:Optional[torch.Tensor]=None) -> torch.Tensor:
        '''
        Computes the classification for the input.

        Args:
            input_ids (torch.Tensor): tokenized input.
            attention_masks (torch.Tensor, optional): attention mask of the input. Defaults to None.

        Returns:
            torch.Tensor: inference result.
        '''
        
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_masks)
        c_vector = bert_output.last_hidden_state[:, 0]

        y = self.dropout(c_vector)
        y = self.linear(y)

        return y

## Treino

Nesta seção iremos realizar o treino, iniciando pela definição de algumas funções auxiliares.

### Funções auxiliares

Iremos definir três funções auxiliares: uma para calcular a perplexidade a partir da loss, outra para printar informações e uma final para calcular a loss:

In [177]:
def ppl(loss:torch.Tensor) -> torch.Tensor:
    """
    Computes the perplexity from the loss.

    Args:
        loss (torch.Tensor): loss to compute the perplexity.

    Returns:
        torch.Tensor: corresponding perplexity.
    """
    return torch.exp(loss)

In [190]:
def print_info(loss_value:torch.Tensor, epoch:int, total_epochs:int, time:float=0.0):
    """
    Prints the information of a epoch.

    Args:
        loss_value (torch.Tensor): epoch loss.
        epoch (int): epoch number.
        total_epochs (int): total number of epochs. 
        time (float, optional): time to run the epoch. Don't print if is 0.0. Defaults to 0.0.
    """
    ppl_value = ppl(loss_value)

    
    print(f'Epoch [{epoch+1}/{total_epochs}], \
            Loss: {loss_value.item():.4f}, \
            Perplexity: {ppl_value.item():.4f}', end="")
    
    if time != 0:
        print(f", Elapsed Time: {time:.2f} sec")    
    else:
        print("")

In [178]:
MODE_TRAIN = 0
MODE_EVALUATE = 1

In [176]:
def compute_loss(model:torch.nn.Module, loader:DataLoader, criterion:torch.nn.Module, mode:int = MODE_EVALUATE) -> torch.Tensor:
    """
    Computes the loss from a model across a dataset.

    Args:
        model (torch.nn.Module): model to evaluate.
        loader (DataLoader): dataset.
        criterion (torch.nn.Module): loss function to compute.
        mode (int): mode of the computation. 
                    If MODE_EVALUATE, computes without gradient, in eval mode and detachs loss.
                    If MODE_TRAIN, computes with gradient and in train mode.
                    Default is MODE_EVALUATE.

    Returns:
        torch.Tensor: resulting loss.
    """
    device = next(iter(model.parameters())).device

    if mode == MODE_EVALUATE:
        model.eval()
        torch.set_grad_enabled(False)
    elif mode == MODE_TRAIN:
        model.train()
        torch.set_grad_enabled(True)
    else:
        raise ValueError(f"Unknown mode: {mode}.")

    total_loss = torch.tensor(0, dtype=torch.float32, device=device)
    n = 0
    for inputs, masks, targets in loader:
        inputs = inputs.to(device)
        masks = masks.to(device)

        targets = targets.reshape(-1)
        targets = targets.to(device)
        
        logits = model(inputs, masks)
        logits = logits.view(-1, logits.shape[-1])

        loss = criterion(logits.squeeze(), targets)
        total_loss += loss*targets.size(0)

        n += targets.size(0)

    total_loss /= n 
    
    torch.set_grad_enabled(True)

    if mode == MODE_EVALUATE:
        total_loss = total_loss.detach()

    return total_loss.detach()


### Inicialização

Começamos o processo de treino inicializando as variáveis.

Definimos se será realizado o logging utilizando o wandb:

In [None]:
use_wandb = False

Checamos se existe uma GPU disponível:

In [None]:
# Verifica se há uma GPU disponível e define o dispositivo para GPU se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

Definimos os parâmetros de treino:

In [None]:
batch_size = 64
dropout_rate = 0
lr = 5e-5
n_epoch = 2
optimizer_class = torch.optim.Adam
weight_decay = 0

config = {
    "batch_size": batch_size,
    "dropout_rate": dropout_rate,
    "lr": lr,
    "n_epoch": n_epoch,
    "optimizer_class": optimizer_class.__name__,
    "weight_decay": weight_decay,
}

if use_wandb:
    wandb.init(project="IA024-04-TransformDecoder", config=config)

Reiniciamos as sementes:

In [None]:
reset_seeds()

Criamos o modelo, loss, otimizador e dataloaders:

In [169]:
model = BinaryClassifierBERT(dropout_rate)
model.to(device)

criterion = torch.nn.BCEWithLogitsLoss()
optimizer = optimizer_class(model.parameters(), lr=lr, weight_decay=weight_decay)
dataloaders = create_dataloaders(datasets, batch_size)

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main


### Treino

E finalmente podemos realizar o processo de treino em si:

In [None]:
hist = {}
hist["loss_train"] = []
hist["loss_val"] = []
hist["ppl_train"] = []
hist["ppl_val"] = []

#Informações antes da primeira epoch
prev_loss = compute_loss(model, dataloaders["train"], criterion, MODE_EVALUATE)
print_info(prev_loss, -1, n_epoch, 0)

for epoch in range(n_epoch):
    start_time = time.time() 

    loss_train = compute_loss(model, dataloaders["train"], criterion, MODE_TRAIN)

    end_time = time.time() 
    
    epoch_duration = end_time - start_time 

    ppl_train = ppl(loss_train)

    print_info(loss_train, epoch, n_epoch, epoch_duration)
    
    #Validation stats
    print("VAL ", end="")
    loss_val = compute_loss(model, dataloaders["val"], criterion, MODE_EVALUATE)
    ppl_val = ppl(loss_val)
    print_info(loss_val, epoch, n_epoch)

    #Save history
    hist["loss_train"].append(loss_train.item())
    hist["loss_val"].append(loss_val.item())
    hist["ppl_train"].append(ppl_train.item())
    hist["ppl_val"].append(ppl_val.item())

    log = {
        "loss_train": loss_train.item(),
        "loss_val": loss_val.item(),
        "ppl_train": ppl_train.item(),
        "ppl_val": ppl_val.item()
    }

    if use_wandb:
        wandb.log(log)

for key in hist:
    hist[key] = np.array(hist[key])

if use_wandb:
    wandb.finish()

Plotamos os gráficos das estatísticas obtidas durante o treinamento, onde podemos observar que TODO

In [None]:
plt.plot(hist["loss_train"], "o-")
plt.plot(hist["loss_val"], "o-")

plt.legend(["Train", "Val"])
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss history")

plt.show()

In [None]:
plt.plot(hist["ppl_train"], "o-")
plt.plot(hist["ppl_val"], "o-")

plt.legend(["Train", "Val"])
plt.xlabel("Epoch")
plt.ylabel("PPL")
plt.title("Perplexity history")

plt.show()

## Avaliação

Para avaliação começamos calculando a loss no dataset de teste:

In [None]:
test_loss = compute_loss(model, dataloaders["test"], criterion, mode=MODE_EVALUATE)
test_ppl = ppl(test_loss)

test_ppl.item()

Calculamos quantos modelos adicionais foram necessários:

In [189]:
n_param_bert = sum([p.numel() for p in model.bert.parameters()])

n_param = sum([p.numel() for p in model.parameters()])
n_param-n_param_bert

769

E verificamos qualitativamente a saída do modelo:

In [None]:
tokens = tokenizer('''This must be gambling debt. Because only when someone threatens to break your legs 
                      if you don't pay will you go and agree to make a film like this.''', 
                    return_tensors="pt",
                    return_token_type_ids=False,
                    truncation=True)

with torch.no_grad():
    logits = model(tokens["input_ids"], tokens["attention_mask"])

print("Result:", logits.item())
print(f"Is this a good movie? {torch.round(logits).item() == GOOD_MOVIE}")

Podemos observar que ele realizou TODO corretamente/incorretamente TODO a classificação