# Fine Tuning do BERT no IMDB

Nome: Elton Cardoso do Nascimento

## Instruções:
> 
> 
> Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
> 
> Importante:
> - Deve-se implementar o próprio laço de treinamento.
> - Implementar o acumulo de gradiente.
> 
> Dicas:
> - BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.
> 
> - Solução para erro de memória:
>   - Usar bfloat16 permite quase dobrar o batch size
> 
> Opcional:
> - Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

## Fixando a seed

In [2]:
import os
import random
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from typing import Tuple, List, Union

import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader # Preparação de dados

In [3]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123);

## Preparando Dados

> Primeiro, fazemos download do dataset:

In [4]:
if not os.path.isfile("aclImdb.tgz"):
    !curl -LO http://files.fast.ai/data/aclImdb.tgz
    !tar -xzf aclImdb.tgz

### Carregando o dataset

> Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [5]:
max_valid = 5000

In [6]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path), encoding="utf8") as f:
            texts.append(f.read())
    return texts

In [7]:
executor = ThreadPoolExecutor(max_workers=4)

folders = ['aclImdb/train/pos', 'aclImdb/train/neg', 'aclImdb/test/pos', 'aclImdb/test/neg']

futures = []
for folder in folders:
    future = executor.submit(load_texts, folder) 

    futures.append(future)

all_texts = []

for future in futures:
    texts = future.result()

    all_texts.append(texts)

executor.shutdown()

x_train_pos = all_texts[0]
x_train_neg = all_texts[1]
x_test_pos = all_texts[2]
x_test_neg = all_texts[3]


In [8]:
x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

In [9]:
# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

In [10]:
print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.


In [11]:
print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

3 primeiras amostras treino:
False POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. Int
False The long list of "big" names in this flick (including the ubiquitous John Mills) didn't bowl me over
True Bette Midler showcases her talents and beauty in "Diva Las Vegas". I am thrilled that I taped it and
3 últimas amostras treino:
False I was previously unaware that in the early 1990's Devry University (or was it ITT Tech?) added Film 
True The story and music (George Gershwin!) are wonderful, as are Levant, Guetary, Foch, and, of course, 
True This is my favorite show. I think it is utterly brilliant. Thanks to David Chase for bringing this i
3 primeiras amostras validação:
True Why has this not been released? I kind of thought it must be a bit rubbish since it hasn't been. How
True I was amazingly impressed by this movie. It contained fundamental elements of depression, grief, lon
True photography was too jumpy to follow. dark scenes hard to

### Tokenizador

In [12]:
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main
  from .autonotebook import tqdm as notebook_tqdm


In [13]:
BERT_CLS = 101
BERT_SEP = 102

In [14]:
tokens = tokenizer.encode(x_train[0], add_special_tokens=True, padding="max_length", max_length=512)

tokens[:10]

[101, 153, 9025, 13882, 13360, 2036, 16625, 2346, 17656, 9637]

### Dataset e Dataloader

In [15]:
class IMDB_Dataset(Dataset):
    def __init__(self, x_data:Tuple[str], y_data:Tuple[bool], tokenizer) -> None:
        super().__init__()

        x_tokens = torch.empty((len(x_data), 512), dtype=torch.float32)

        for i in range(len(x_data)):
            x = x_data[i]
            tokens = tokenizer.encode(x, add_special_tokens=True, padding="max_length", max_length=512)

            if len(tokens) > 512:
                tokens = tokens[:512]
                tokens[-1] = BERT_SEP
                
            x_tokens[i] = torch.Tensor(tokens)

        self._x_data = x_tokens
        
        self._y_data = torch.tensor(y_data, dtype=torch.float32)

        self._size = len(self._x_data)

    def __len__(self) -> int:
        """
        Gets the size of the dataset.

        Returns:
            int: dataset size.
        """

        return self._size
    
    def __getitem__(self, idx:int) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Gets a item of the dataset.

        Args:
            idx (int): data index.

        Returns:
            torch.Tensor: dataset input. 
            torch.Tensor: dataset target.
        """
        return self._x_data[idx], self._y_data[idx]

In [19]:
args = [(x_train, y_train, tokenizer), (x_valid, y_valid, tokenizer), (x_test, y_test, tokenizer)]

xs = [x_train, x_valid, x_test]
ys = [y_train, y_valid, y_test]

with ProcessPoolExecutor() as executor:
    futures = []
    for i in range(3):
        future = executor.submit(IMDB_Dataset, x_data=xs[i], y_data=ys[i], tokenizer=tokenizer) 

        futures.append(future)
    
    datasets = []

    for future in futures:
        dataset = future.result()

        datasets.append(dataset)

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

thread: 2m58s

In [None]:
raise ValueError

## Preparação do modelo

https://pytorch.org/hub/huggingface_pytorch-transformers/

In [None]:

base_model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main
  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main


In [None]:
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased') 

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [None]:
tokens_tensor = torch.tensor([tokens, tokens])

In [None]:
output = base_model(tokens_tensor)

In [None]:
output.last_hidden_state.shape

torch.Size([2, 512, 768])

In [None]:
class SentimentAnalisysBERT(torch.nn.Module):
    def __init__(self, dropout_rate:float=0) -> None:
        super().__init__()
        
        self.bert_model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')
        
        self.dropout = torch.nn.Dropout(dropout_rate)
        self.linear = torch.nn.Linear(768, 1)
        self.relu = torch.nn.ReLU()

    def forward(self, x:torch.Tensor) -> torch.Tensor:
        bert_output = self.bert_model(x)
        c_vector = bert_output.last_hidden_state[:, 0]

        y = self.dropout(c_vector)
        y = self.linear(y)
        y = self.relu(y)

        return y

In [None]:
model = SentimentAnalisysBERT()

Using cache found in C:\Users\Elton/.cache\torch\hub\huggingface_pytorch-transformers_main


108310272

In [None]:
n_param_bert = sum([p.numel() for p in model.bert_model.parameters()])

n_param = sum([p.numel() for p in model.parameters()])
n_param-n_param_bert

769

In [None]:
logits = model(tokens_tensor)

In [None]:
criterion = torch.nn.BCEWithLogitsLoss()


In [None]:
targets = torch.tensor(y_train[:2], dtype=torch.float32)

In [None]:
targets.dtype

torch.float32