# Fine Tune BERT

In this Notebook we perform a fine tunning of BERT for binary classification.

Code was extracted from this tutorial:

https://colab.research.google.com/github/Ankur3107/colab_notebooks/blob/master/classification/BERT_Fine_Tuning_Sentence_Classification_v2.ipynb#scrollTo=6J-FYdx6nFE_

The task that we will be trying to train BERT to solve is Named Entity Dissambiguation. Namely, given an entity and different options of possible entities, and given the context of which the entity is being mentioned, which of the options in the context referring to?

For example, the entity "Jaguar" has a lot of options:

- Jaguar as an animal
- Jaguar as a brand of cars
- Jaguar as a supercomputer
- etc..

So, given the context: 
 - "The man saw a Jaguar speed in the highway" -> the context is most likely referring to "Jaguar" as a car.
 - "The prey saw the jaguar cross the jungle" -> The context is most likely referring to "jaguar" as an animal.

We formulated this as a binary classification problem, giving BERT all entities on a dataset, as long as the context on which the entity is being mentioned, the option and the label 1/0 if it is correct of incorrect.

The dataset consists of all combinations of the different options for an entity and mentions in a sentence, along with a 0/1 label if the entity option is correct for a certain sentence mention (context). For more information on how this dataset was built to fine tune the model, see `1-make_dataset.ipynb` notebook. Each "bert_qry" input text was shortened in order to avoid input texts that exceed 512 tokens.

For example, in this Jaguar example, the Data Set would look like:
```
]
    }
    "bert_qry": "Is 'Jaguar' in the context of: 'The man saw a Jaguar speed in the highway', referring to [SEP] Jaguar as an animal?",
    "label": 0,
    },
}
    "bert_qry": "Is 'Jaguar' in the context of: 'The man saw a Jaguar speed in the highway', referring to [SEP] Jaguar as a brand of cars?",
    "label": 1,
    },
}
    "bert_qry": "Is 'Jaguar' in the context of: 'The man saw a Jaguar speed in the highway', referring to [SEP] Jaguar as a supercomputer?",
    "label": 0,
    },
}
    "bert_qry": "Is 'Jaguar' in the context of: 'The prey saw the jaguar cross the jungle', referring to [SEP] Jaguar as an animal?",
    "label": 1,
    },
}
    "bert_qry": "Is 'Jaguar' in the context of: 'The prey saw the jaguar cross the jungle', referring to [SEP] Jaguar as a brand of cars?",
    "label": 0,
    },
}
    "bert_qry": "Is 'Jaguar' in the context of: 'The prey saw the jaguar cross the jungle', referring to [SEP] Jaguar as a supercomputer?",
    "label": 0,
    }
]
```

About the DataSet used for this training:

Our Dataset is comprised of news articles scrapped from a Mexican Newspaper. NER was applied to each article, so we got all Entities mentioned in the article. For each Entity, we performed a query to WikiData in order to extract all possible options for an Entity. An LLM was used as a teacher in order to disambiguate a subset of the whole dataset (`0-ask_stable_beluga.ipynb`) The teacher observations were used to get the labels. A preprocessing was performed in order to create the "bert_qry", making sure it doesn't exceed 512 tokens (`1-make_dataset.ipynb`).

In [1]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset, random_split
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import json
from tqdm import tqdm
import nltk
import re

In [2]:
def print_progress_bar(iteration, total, bar_length=50):
    progress = float(iteration) / float(total)
    arrow = '=' * int(round(progress * bar_length) - 1)
    spaces = ' ' * (bar_length - len(arrow))

    print(f'Progress: [{arrow + spaces}] {int(progress * 100)}%', end='\r')

## Load Data

In [3]:
# This data set has been previously hand crafted. Performing the entity mentions
# and the sentence where the entity is mentioned as context in order to provide that
# to BERT as input. The bert queries were also shortened in order to ensure
# than they don't exceed 512 tokens length.
with open("datasets/dataset_for_bert_fine_tune_shortened.json", "r") as file:
    data_for_bert_fine_tunning = json.load(file)

## Tokenize and Encode the Data

In [4]:
# Load the tokenizer.
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [6]:
total_exs = len(data_for_bert_fine_tunning)
input_ids = []
attention_masks = []
labels = []  # This should be 0 for incorrect options and 1 for the correct one.
for e, example in enumerate(data_for_bert_fine_tunning):
    
    print_progress_bar(iteration=e, total=total_exs)
    query = example['bert_qry']
    
    encoded_dict = tokenizer.encode_plus(
        query,                           # Sentence to encode.
        add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
        max_length = 512,                # Set the max length to 512
        padding='max_length',            # Pad to max length (512)
        truncation=True,                 # Truncate when greater than 512
        return_attention_mask = True,    # Construct attention masks.
        return_tensors = 'pt',           # Return pytorch tensors.
    )

    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])
    labels.append(example["label"])



In [9]:
labels = torch.tensor(labels)

In [26]:
 # Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

In [10]:
input_ids.shape, attention_masks.shape, labels.shape

(torch.Size([3360112, 512]), torch.Size([3360112, 512]), torch.Size([3360112]))

## Create and Split Data

In [30]:
# Convert the dataset into a DataLoader - combines a dataset and a sampler, and provides an iterable over the dataset.
batch_size = 16  # You might need to adjust this depending on your GPU.

dataset = TensorDataset(input_ids, attention_masks, labels)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(
    dataset, 
    [train_size, val_size]
)

train_dataloader = DataLoader(
    train_dataset,  # The training samples.
    sampler = RandomSampler(train_dataset), # Select batches randomly
    batch_size = batch_size
)

validation_dataloader = DataLoader(
    val_dataset, # The validation samples.
    sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
    batch_size = batch_size
)

## Load Model

In [32]:
# Tell pytorch to run this model on the GPU.
device_to_use = "cuda:0"
device = torch.device(device_to_use)

In [33]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-cased", 
    num_labels = 2,  # Binary classification, 1/0.
    output_attentions = False, 
    output_hidden_states = False, 
).to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

## Train the Model

In [34]:
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr = 2e-5, # args.learning_rate
    eps = 1e-8 # args.adam_epsilon
)

In [35]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs (authors recommend between 2 and 4)
epochs = 4

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps = 0, # Default value in run_glue.py
    num_training_steps = total_steps
)

In [36]:
torch.cuda.empty_cache()
gradient_accumulation_steps = 4

# Store the average loss after each epoch so we can plot them.
loss_values = []

for epoch_i in range(0, epochs):
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.
    
    print(
        '======== Epoch {:} / {:} ========'.format(
            epoch_i + 1, 
            epochs
        )
    )
    # empty gpu cache to free memory
    torch.cuda.empty_cache()
    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):
        # Progress update every 2000 batches.
        if step % 2000 == 0 and not step == 0:
            # Report progress.
            print(
                '  Batch {:>5,}  of  {:>5,}.'.format(
                    step, 
                    len(train_dataloader)
                )
            )

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)     

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(
            b_input_ids, 
            token_type_ids=None, 
            attention_mask=b_input_mask, 
            labels=b_labels
        )

        # Then access the loss and logits directly
        loss = outputs.loss
        loss /= gradient_accumulation_steps 
        #logits = outputs.logits

        # Perform a backward pass to calculate the gradients.
        loss.backward()
        
        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item() * gradient_accumulation_steps

        # Perform optimization step every 'gradient_accumulation_steps' batches
        if (step + 1) % gradient_accumulation_steps == 0:
            # Clip the norm of the gradients to 1.0 to prevent the "exploding gradients" problem
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # Update parameters
            optimizer.step()
            # Update the learning rate.
            scheduler.step() # Uncomment if using a learning rate scheduler
            # Clear the gradients
            model.zero_grad()          

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)
    
    print("  Average training loss: {0:.2f}".format(avg_train_loss))

    # ========================================
    #               Validation
    # ========================================
    print("Running Validation...")

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        
            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(
                b_input_ids, 
                token_type_ids=None, 
                attention_mask=b_input_mask,
                labels=b_labels
            )
        
        # access the loss
        loss = outputs.loss
        # update total evaluation loss
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = outputs.logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        preds = np.argmax(logits, axis=1).flatten()
        total_eval_accuracy += np.sum(preds == label_ids)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader.dataset)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))

print("Training complete!")

  Batch 2,000  of  168,006.
  Batch 4,000  of  168,006.
  Batch 6,000  of  168,006.
  Batch 8,000  of  168,006.
  Batch 10,000  of  168,006.
  Batch 12,000  of  168,006.
  Batch 14,000  of  168,006.
  Batch 16,000  of  168,006.
  Batch 18,000  of  168,006.
  Batch 20,000  of  168,006.
  Batch 22,000  of  168,006.
  Batch 24,000  of  168,006.
  Batch 26,000  of  168,006.
  Batch 28,000  of  168,006.
  Batch 30,000  of  168,006.
  Batch 32,000  of  168,006.
  Batch 34,000  of  168,006.
  Batch 36,000  of  168,006.
  Batch 38,000  of  168,006.
  Batch 40,000  of  168,006.
  Batch 42,000  of  168,006.
  Batch 44,000  of  168,006.
  Batch 46,000  of  168,006.
  Batch 48,000  of  168,006.
  Batch 50,000  of  168,006.
  Batch 52,000  of  168,006.
  Batch 54,000  of  168,006.
  Batch 56,000  of  168,006.
  Batch 58,000  of  168,006.
  Batch 60,000  of  168,006.
  Batch 62,000  of  168,006.
  Batch 64,000  of  168,006.
  Batch 66,000  of  168,006.
  Batch 68,000  of  168,006.
  Batch 70,000  of

In [37]:
# Save the model to the path of your choice, for example 'model_save/'
model.save_pretrained('model_save/test1')
tokenizer.save_pretrained('model_save/test1')

('model_save/test1\\tokenizer_config.json',
 'model_save/test1\\special_tokens_map.json',
 'model_save/test1\\vocab.txt',
 'model_save/test1\\added_tokens.json')