# **Task \#4 B**: Machine Learning MC886/MO444

##**Natural Language Processing (NLP)**##


In [None]:
print("Gabriel Borges Gutierrez" + " RA 237300")
print("Marcelo Antunes Soares Fantini" + " RA 108341")
print("Rubens de Castro Pereira" + " RA 217146")
print("")
print(f"Notebook de Rubens - v01")

Gabriel Borges Gutierrez RA 237300
Marcelo Antunes Soares Fantini RA 108341
Rubens de Castro Pereira RA 217146

Notebook de Rubens - v01


## Objective:

There are two main objectives of this notebook: you can either fine-tunning a BERT model to sentiment analysis or fine-tunning a T5 model to perform translation English to Portuguese.

**You can choose which task to perform. The BERT activity is relatively easy and should take less time compared to the T5 task. However, fine-tuning these models require time. Therefore, it is recommended to test the models on a small dataset (such as one batch), to ensure the functions are working correctly. Once confirmed, you can proceed to train the models on the entire dataset.**

**If you complete both tasks, you will earn extra points.**
**Obs: In this work, you can use scikit-learn, PyTorch and HuggingFace API.**


## **Sentiment Analisys**

Sentiment analysis is a task in natural language processing that involves determining the sentiment expressed in a given text, classifying it as positive, negative, or neutral. It helps analyze people's opinions and emotions from text data, enabling businesses to understand customer feedback, monitor brand reputation, and make informed decisions.

In this notebook, we will use the IMDB Dataset, which is widely used in the field of natural language processing and sentiment analysis. It comprises a large collection of movie reviews from the IMDB website, with each review labeled as either positive or negative based on the sentiment expressed in the text.

![bert_model](https://drive.google.com/uc?export=view&id=1rWKk7K5-0MX8EkjPeRZTaG7byFQuD6Cx)

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art deep learning model for natural language processing (NLP). It is based on the Transformer architecture and is pre-trained on a large corpus of text data. BERT is designed to understand the context and meaning of words in a sentence by considering both the left and right context, enabling it to capture intricate language patterns. It has achieved remarkable results across various NLP tasks, including text classification, named entity recognition, question answering, and has significantly advanced the field of NLP.


In [None]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Download Dependencies


In [None]:
%%time
!wget -nc -q http://files.fast.ai/data/aclImdb.tgz
!tar -xzf aclImdb.tgz
!pip install datasets transformers -q
print(f'Dataset downloaded with success.')

Dataset downloaded with success.
CPU times: user 220 ms, sys: 35 ms, total: 255 ms
Wall time: 39 s


In [None]:
import os
import copy
import torch
import random
import numpy as np
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm
from torch.utils.data import DataLoader, Dataset
from torch import nn
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    AdamW,
    get_linear_schedule_with_warmup,
)

from datasets import load_metric

print(f"Libraries imported successfully.")

Libraries imported successfully.


### Parameters


In [None]:
params = {
    'bert_version': 'bert-base-uncased',
    # There are multiple versions of BERT available at the following links:
    # https://huggingface.co/google/bert_uncased_L-12_H-768_A-12
    # https://huggingface.co/bert-base-uncased
    # You can explore these links to access different versions of BERT.

    'batch_size': 8,
    'learning_rate': 1e-4,  # Choose a learning rate between 1e-4 and 1e-5
    # The maximum length of the sentence (can be adjusted)
    'max_length': 300,
    # Choose a value between 1 and 5 (or alternatively, use early stopping)
    'epochs': 1,
}

In [None]:
# Important: Fix seeds so we can replicate results
random_seed = 42
random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda:0


### Load data

Here we are loading the data. For training, we will use 20k samples, 5k samples for validation, and 25k samples for testing.


In [None]:
max_valid = 5000


def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts


x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print('\nFirst three train samples:')
for i, (source, target) in enumerate(zip(x_train[:3], y_train[:3])):
    print(f"{i}: Input: {source}\n   Target: {'positive' if target else 'negative'}\n")

print('-'*200)
print('\nFirst three valid samples:')
for i, (source, target) in enumerate(zip(x_valid[:3], y_valid[:3])):
    print(f"{i}: Input: {source}\n   Target: {'positive' if target else 'negative'}\n")

print('-'*200)
print(f'Train size: {len(x_train)}')
print(f'Valid size: {len(x_valid)}')
print(f'Test  size: {len(x_test)}')


First three train samples:
0: Input: On the surface, this movie would appear to deal with the psychological process called individuation, that is how to become a true self by embracing the so-called 'dark' side of human nature. Thus, we have the Darkling, a classic shadowy devilish creature desperately seeking the company (that is, recognition) of men, and the story revolves around the various ways in which this need is handled, more or less successfully. <br /><br />However, if we dig a little deeper, we find that what this movie is actually about is how you should relate to your car like you would to any other person: - in the opening scene, the main character (male car mechanic fallen from grace)is collecting bits and pieces from car wrecks with his daughter, when a car wreck nearly smashes the little girl. Lesson #1: Cars are persons embodied with immortal souls, and stealing from car wrecks is identical with grave robbery. The wicked have disturbed the dead and must be punished. 

In [None]:
# y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
# print([True])
# print(len(x_train_pos))
# print([True] * len(x_train_pos))
# print(len([True] * len(x_train_pos)))
# print()
# print([False])
# print(len(x_train_neg))
# y_train
# y_test

### Tokenizer

To use text as input for a deep learning model, we first need to tokenize each sentence based on a set of rules. After tokenization, each token is assigned a correlated index, creating a feature vector. This vector is then utilized by the model to train and update the weights. Here is an example demonstrating how the BERT tokenizer works:

![bert_tokenizer](https://drive.google.com/uc?export=view&id=11LioDFis0JE3ghr672PEIeaAxZO42gUL)

Initially, the input sentence is divided into tokens predetermined by the BERT tokenizer. Next, the BertTokenizer introduces two special tokens: CLS and SEP. CLS represents sentence start for tasks like classification, while SEP indicates sentence separation for boundary detection within a document. Additionally, to ensure sentences are of equal length, the tokenizer employ the PAD token for each input.

Finally, each token is converted into a predetermined index for BERT input. This indexing enables the Bert model to train and update its weights effectively.


In [None]:
class IMDBDataset(Dataset):
    def __init__(self, data, labels):
        super().__init__()
        self.data = data
        self.labels = torch.Tensor(labels).long()

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, index):
        item = {key: value[index] for key, value in self.data.items()}
        item['labels'] = self.labels[index]
        return item


print(f'Class IMDBDataset created successfully ')

Class IMDBDataset created successfully 


In [None]:
%%time

tokenizer = BertTokenizerFast.from_pretrained(params['bert_version'], disable_tqdm=False)

# tokenizer = BertTokenizerFast.from_pretrained(params['bert_version'], disable_tqdm=False)
# tokenizer = BertTokenizerFast.from_pretrained(params['bert_version'], use_auth_token=access_token,  disable_tqdm=False)
# tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)
# model = BertTokenizerFast.from_pretrained('bert-base-uncased')
# access_token = "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"
# tokenizer = BertTokenizerFast.from_pretrained(params['bert_version'], use_auth_token=access_token,  disable_tqdm=False)
# "private/model",

## TOKENIZE
train_encodings = tokenizer(list(x_train), truncation=True, padding=True, return_tensors='pt', max_length=params['max_length'])
valid_encodings = tokenizer(list(x_valid), truncation=True, padding=True, return_tensors='pt', max_length=params['max_length'])
test_encodings  = tokenizer(list(x_test),  truncation=True, padding=True, return_tensors='pt', max_length=params['max_length'])

## DATASET
train_dataset = IMDBDataset(data=train_encodings, labels=y_train)
valid_dataset = IMDBDataset(data=valid_encodings, labels=y_valid)
test_dataset  = IMDBDataset(data=test_encodings, labels=y_test)

## DATALOADER
train_loader = DataLoader(train_dataset, batch_size=params['batch_size'], shuffle=True, num_workers=1)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=params['batch_size'], num_workers=1)
test_loader  = DataLoader(dataset=test_dataset, batch_size=params['batch_size'], num_workers=1)

print()
print(f'Dataset and loader created successfully')
version = params['bert_version']
batch_size = params['batch_size']
print(f'params[bert_version] : {version}')
print(f'params[batch_size]   : {batch_size}')
print()



Dataset and loader created successfully
params[bert_version] : bert-base-uncased
params[batch_size]   : 8

CPU times: user 1min 18s, sys: 2.53 s, total: 1min 20s
Wall time: 1min 9s


In [None]:
# print(len(train_dataset))
# print(len(train_loader))
# print()

# print(len(valid_dataset))
# print(len(valid_loader))
# print()

# print(len(test_dataset))
# print(len(test_loader))
# print()

# Some of the common BERT tokens
# marker for ending of a sentence
print(tokenizer.sep_token, tokenizer.sep_token_id)
# start of each sentence, so BERT knows we’re doing classification
print(tokenizer.cls_token, tokenizer.cls_token_id)
print(tokenizer.pad_token, tokenizer.pad_token_id)  # special token for padding
# tokens not found in training set
print(tokenizer.unk_token, tokenizer.unk_token_id)

[SEP] 102
[CLS] 101
[PAD] 0
[UNK] 100


In [None]:
# Examples
data = next(iter(train_loader))
print(data.keys())

print(data["input_ids"].shape)
print(data["token_type_ids"].shape)
print(data["attention_mask"].shape)
print(data["labels"].shape)

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
torch.Size([8, 300])
torch.Size([8, 300])
torch.Size([8, 300])
torch.Size([8])


### Useful functions

**Note:** The following functions are provided as suggestions. You are free to modify and create your own functions, classes, or code. Feel free to customize!

**If the batch does not fit in memory, use gradient accumulation.**

**Hint 1:** Example of gradient accumalation in PyTorch: https://kozodoi.me/blog/20210219/gradient-accumulation.

**Hint 2:** If preferred, you can utilize the [Trainer](https://huggingface.co/docs/transformers/training) from Hugging Face for assistance.


### From Kaggle Sentiment Analysis using BERT:

https://www.kaggle.com/code/prakharrathi25/sentiment-analysis-using-bert/notebook


In [None]:
def train_epoch(model, data_loader, optimizer, device):
    model.train()
    metric = load_metric("accuracy")
    losses = []
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask, labels=labels)

        loss = outputs.loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        predictions = outputs.logits.argmax(dim=-1)

        metric.add_batch(predictions=predictions, references=batch['labels'])
        losses.append(loss.item())

    return np.mean(losses), metric.compute()

@torch.no_grad()
def evaluate_epoch(model, data_loader, device):
    model.eval()
    metric = load_metric("accuracy")
    losses = []
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask, labels=labels)

        loss = outputs.loss
        predictions = outputs.logits.argmax(dim=-1)

        metric.add_batch(predictions=predictions, references=batch['labels'])
        losses.append(loss.item())


    return np.mean(losses), metric.compute()

In [None]:
class EarlyStopping():
    '''
      Early stopping prevents overfitting by stopping the training process when
      the model's performance on a validation set starts to worsen.

      Parameters:
      -----------
      patience  : int
        Tolerance for no improvement.
      min_delta : float
        Minimum change required.
    '''

    def __init__(self, patience=1, min_delta=0.0001):
        self.patience = patience
        self.counter = 0
        self.best_score = None
        self.best_model_wts = None
        self.min_delta = min_delta

    def __call__(self, model, val_loss):
        score = -val_loss

        if self.best_score is None:
            self.best_score = score
            self.best_model_wts = copy.deepcopy(model.state_dict())
            return False

        elif score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                return True
        else:
            self.best_score = score
            self.best_model_wts = copy.deepcopy(model.state_dict())
            self.counter = 0
        return False

### Train the BERT model

**Note:** The following functions are provided as suggestions. You are free to modify and create your own functions, classes, or code. Feel free to customize!

**Hint 1:** See the [BertForSequenceClassification](https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/bert#transformers.BertForSequenceClassification) documentation for more information.

**Hint 2:** Instead of saving information by epoch, you can save it by step. A step corresponds to a single update of the model's weights based on a mini-batch of data, while an epoch represents a complete pass through the entire training dataset. The number of steps is determined by the batch size and the total number of training examples, whereas the number of epochs is a user-defined hyperparameter.

**Hint 3:** BERT adapts very well to classification problems, so in just 3 or 4 epochs, the results are already acceptable (**BERT Base**). If the results are still not good, check the learning rate.

**Hint 4:** Conduct small tests, such as using only one batch, to train and verify the functionality of the training and evaluation functions. After confirming their effectiveness, proceed to train the model with all the dataset.


In [None]:
model = BertForSequenceClassification.from_pretrained(params['bert_version'])
model = model.to(device)

# optimizer = None # https://pytorch.org/docs/stable/optim.html
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

# scheduler = None # https://pytorch.org/docs/stable/optim.html (not mandatory)
total_steps = len(train_loader) * params['epochs']
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=total_steps)

early_stopping = EarlyStopping()

history = {'train_loss': [], 'valid_loss': [],
           'train_acc': [], 'valid_acc': []}

for epoch in tqdm(range(params['epochs']), desc='Training'):
    train_loss, train_acc = train_epoch(
        model,
        train_loader,
        optimizer,
        device
    )

    valid_loss, valid_acc = evaluate_epoch(
        model,
        valid_loader,
        device
    )

    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)

    history['valid_loss'].append(valid_loss)
    history['valid_acc'].append(valid_acc)

    if early_stopping(model, valid_loss):
        break

print(f'Finished the training of the model.')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Training:   0%|          | 0/1 [00:00<?, ?it/s]

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.1984, -0.0967],
        [ 0.0228,  0.1180],
        [-0.0535,  0.0761],
        [ 0.1896, -0.0640],
        [ 0.2562,  0.0580],
        [-0.1529,  0.2377],
        [ 0.1706,  0.0156],
        [-0.0920, -0.1030]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
------------------
tensor([[ 0.1984, -0.0967],
        [ 0.0228,  0.1180],
        [-0.0535,  0.0761],
        [ 0.1896, -0.0640],
        [ 0.2562,  0.0580],
        [-0.1529,  0.2377],
        [ 0.1706,  0.0156],
        [-0.0920, -0.1030]], device='cuda:0', grad_fn=<AddmmBackward0>)


TypeError: ignored

In [None]:
history

### Evaluation of the model


In [None]:
# Save Best Model Weights and history
path = 'drive/MyDrive'
torch.save(early_stopping.best_model_wts, f'{path}/weights_bert.pth')

np.save(f'{path}/train_loss_bert', np.array(history['train_loss']))
np.save(f'{path}/valid_loss_bert', np.array(history['valid_loss']))

np.save(f'{path}/train_acc_bert', np.array(history['train_acc']))
np.save(f'{path}/valid_acc_bert', np.array(history['valid_acc']))

In [None]:
# Load
train_loss = np.load(f'{path}/train_loss_bert.npy')
train_acc = np.load(f'{path}/train_acc_bert.npy')

valid_loss = np.load(f'{path}/valid_loss_bert.npy')
valid_acc = np.load(f'{path}/valid_acc_bert.npy')

#### Plot the Train and Valid loss


In [None]:
## --- Insert code here --- ##
plt(history)

#### Evaluate in Test set


In [None]:
## --- Insert code here --- ##
model = BertForSequenceClassification.from_pretrained(params["bert_version"])
model.load_state_dict(early_stopping.best_model_wts).to(device)
metric = load_metric("accuracy")

for batch in test_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    outputs = model(input_ids=input_ids,
                    attention_mask=attention_mask, labels=labels)

    predictions = outputs.logits.argmax(dim=-1)

    metric.add_batch(predictions=predictions, references=batch['labels'])

print(f'Test accuracy: {metric.compute()}')

> What is your conclusions?
