## Deliverable 1: BASELINE - Baseline implementation (Due 30th september 2024)

Using BERT-Base model (https://huggingface.co/google-bert/bert-base-uncased) and SQUAD dataset (https://rajpurkar.github.io/SQuAD-explorer/), you have to select an implementation in Pytorch for its training using a single GPU. This implementation will be called in the following the **BASELINE implementation**. In order to generate this implementation you can search for one on the Internet, as the ability to generate such an implementation from scratch is probably beyond your expertise.

You have to measure the training time for that code using one single GPU. If the time is too small (less than one minute), maybe you can add more epochs to the training or look for a larger data set or more sophisticated model architecture.

If you are able to provide a profiling of the training using Tensorboard or any other tool, that will be a plus in your work.

# BASELINE SINGLE GPU

Using BERT-Base model (https://huggingface.co/google-bert/bert-base-uncased) and SQUAD dataset (https://rajpurkar.github.io/SQuAD-explorer/)

The objectives of the task is to measure training time using a single GPU Nvidia A100

Optional: Profiling of the training using Tensorboard

RESOURCES
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=jwMn3_6gx6P8

https://www.youtube.com/watch?v=wG2J_MJEjSQ

https://www.youtube.com/watch?v=IcrN_L2w0_Y

https://lightning.ai/pages/community/tutorial/how-to-speed-up-pytorch-model-training/

https://datasets.activeloop.ai/docs/ml/datasets/squad-dataset/

https://knswamy.medium.com/nlp-deep-learning-training-on-downstream-tasks-using-pytorch-lightning-question-answering-on-17d2a0965733

https://pytorch.org/text/0.9.0/_modules/torchtext/datasets/squad2.html

https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/squad.html



In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [2]:

# run this cell, then restart the runtime before continuing
# !pip install datasets transformers --quiet
! pip install -q transformers[torch] datasets

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [3]:
!pip install pytorch-lightning --quiet
#!pip install colorama --quiet

[31mERROR: Operation cancelled by user[0m[31m
[0m

## Load Datasets

In [5]:
from datasets import load_dataset
# Load the dataset
squad = load_dataset("squad")

ModuleNotFoundError: No module named 'datasets'

In [None]:
squad

In [None]:

example = squad['train'][10]
for key in example:
    print(key, ":", example[key])

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')

In [None]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride.
    # This results in one example possible giving several features when a context is long,
    # each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",  # truncate context, not the question
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context.
    # This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:

# Apply the function to our data
tokenized_datasets = squad.map(prepare_train_features, batched=True, remove_columns=squad["train"].column_names)

In [None]:
squad

In [None]:
tokenized_datasets

In [None]:

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"finetune-BERT-squad",
    #eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
)

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"].select(range(1000)),
    eval_dataset=tokenized_datasets["validation"].select(range(100)),
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
# Run the trainer
import torch

trainer.train()

# Evaluate the Model

In [None]:
instance = squad['train'][20]
context = instance['context']
question = instance['question']

In [None]:
context

In [None]:
instance['answers']

In [None]:

given_answer = instance['answers']['text'][0]  # Assuming the first answer is the correct one
given_answer_start = instance['answers']['answer_start'][0]
given_answer, given_answer_start

In [None]:
inputs = tokenizer(question, context, return_tensors='pt', max_length=512, truncation=True)

In [None]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [None]:
inputs = {k: v.to(device) for k, v in inputs.items()}

In [None]:
# Get model's output
with torch.no_grad():
    output = model(**inputs)

In [None]:
# Get the predicted answer
start_idx = torch.argmax(output.start_logits)
end_idx = torch.argmax(output.end_logits)

predicted_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx + 1]))


In [None]:
predicted_answer, start_idx, end_idx, start_idx.item(), end_idx.item()

In [None]:
correct = (predicted_answer.lower() == given_answer.lower())
evaluation = 'Correct' if correct else f'Incorrect (Predicted: {predicted_answer}, Given: {given_answer})'
print(evaluation)

In [None]:
# Function to evaluate a single instance
def evaluate_instance(instance, device):
    context = instance['context']
    question = instance['question']
    given_answer = instance['answers']['text'][0]  # Assuming the first answer is the correct one

    # Tokenize the data
    inputs = tokenizer(question, context, return_tensors='pt', max_length=512, truncation=True)

    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Apply the BERT model
    with torch.no_grad():  # No need to calculate gradients
        output = model(**inputs)

    # Get the predicted answer
    start_idx = torch.argmax(output.start_logits)
    end_idx = torch.argmax(output.end_logits)
    predicted_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx + 1]))

    return predicted_answer.lower() == given_answer.lower()

In [None]:
from tqdm import tqdm

In [None]:
correct_count = 0
total_count = 100

for i in tqdm(range(total_count)):
    correct_count += evaluate_instance(squad['train'][i], device)

In [None]:
# Calculate and output the accuracy
accuracy = correct_count / total_count
print(f'Accuracy: {accuracy * 100:.2f}%')

# Track Metrics on Tensorboard

In [None]:
pip install torch_tb_profiler
tensorboard --logdir=./log
http://localhost:6006/#pytorch_profiler

In [None]:
#%load_ext tensorboard
#%tensorboard --logdir lightning_logs/

In [None]:
#model.eval()
#model.freeze()
#test_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size = 5, shuffle=False)

In [None]:
# I try this when Colab runs out of Cuda memory
#torch.cuda.empty_cache()

In [None]:
#!/opt/bin/nvidia-smi

In [None]:
#!ps -aux|grep python

In [None]:
# This is the best way to free up GPU memory - kill the ipykernel process
#!kill -9 1129

In [None]:
## Trying out the LR Find method in Pytorch Lightning.  This won't work for multi gpu situations.  Wasn't happy with the initial results of the Learning rate finder.
## This code won't work without defining bert_imdb variable
## bert_ner = NERModel(transformer = transformer_model, n_tags = len(tag_complete))
## trainer = pl.Trainer(gpus=1, max_epochs=1, auto_lr_find=True)

# Run learning rate finder
# lr_finder = trainer.fit(bert_ner)

# Results can be found in
# lr_finder.results

# Plot with
# fig = lr_finder.plot(suggest=True)
# fig.show()