# ***Paul SERIN***
# *HPC TOOLS : Deliverable 1*

## ***Step 1:*** Import libraries

In [1]:
import json
from pathlib import Path
import torch
from torch.utils.data import DataLoader
import time
from tqdm import tqdm

## ***Step 2:*** Retrieve and Store the data

Here I take and store the texts, queries and answers from the train and validation .json files. If we look carefully we will see that in these files there are a number of queries and answers for each passage. I save these informations into lists.

In [2]:
pathTrainData = "../data/train-v2.0.json"
pathTestData =  "../data/dev-v2.0.json"

def load_data(file_path):
    """Charge et retourne les contextes, questions et réponses depuis un fichier JSON."""
    with open(file_path, "r") as f:
        data = json.load(f)

    contexts = []
    questions = []
    answers = []

    for group in data['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
    
    return contexts, questions, answers

train_contexts, train_questions, train_answers = load_data(pathTrainData)
test_contexts, test_questions, test_answers = load_data(pathTestData)

## ***Step 3:*** Check the data

As you can see we have 86821 passages, queries and answers from the training data. The answer is stored in a dictionary with the specific answer in the "text" cell and the accurate character index that the answer is started in cell "answer start". As we observe, is missing the information about the exact index of the character that the answer is ending.

In [3]:
print(len(train_contexts))
print(len(train_questions))
print(len(train_answers))

86821
86821
86821


In [4]:
print("Passage: ",train_contexts[0])
print("Query: ",train_questions[0])
print("Answer: ",train_answers[0])

Passage:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Query:  When did Beyonce start becoming popular?
Answer:  {'text': 'in the late 1990s', 'answer_start': 269}


As you can see we have 20302 passages, queries and answers from the validation data

In [5]:
print(len(test_contexts))
print(len(test_questions))
print(len(test_answers))

20302
20302
20302


In [6]:
print("Passage: ",test_contexts[0])
print("Query: ",test_questions[0])
print("Answer: ",test_answers[0])

Passage:  The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
Query:  In what country is Normandy located?
Answer:  {'text': 'France', 'answer_start': 159}


## ***Step 4:*** Find the end position character

Because Bert model needs both start and end position characters of the answer, I have to find it and store it for later. Sometimes, I notice that SQuAD anwers "eat" one or two characters from the real answer in the passage. For example, (as a colleague said in Piazza) for the word "sixth" in passage, SQuAD give the answer of "six". So in these cases I select to handle this problem by "cutting" the passage by 1 or 2 characters to be the same as the given answer. This strategy is because BERT works with ***tokens*** of a specific format so I needed to process the squad dataset to keep up with the input that BERT is waiting for.

Find end position character in train data

In [7]:
def adjust_answer_indices(answers, contexts):
    """
    Adjust the start and end indices of answers to ensure they correctly align
    with the context text. It handles cases where the actual answer might differ
    by one or two characters from the indexed position.
    
    Parameters:
    answers (list): List of answer dictionaries containing 'text' and 'answer_start'.
    contexts (list): List of context strings from which the answers are extracted.
    """
    for answer, context in zip(answers, contexts):
        real_answer = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(real_answer)  # Calculate the end index

        # Check if the real answer matches the exact indexed position
        if context[start_idx:end_idx] == real_answer:
            answer['answer_end'] = end_idx
        # Handle case where the real answer is off by one character
        elif context[start_idx-1:end_idx-1] == real_answer:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1
        # Handle case where the real answer is off by two characters
        elif context[start_idx-2:end_idx-2] == real_answer:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2

# Adjust the indices for both training and test sets
adjust_answer_indices(train_answers, train_contexts)
adjust_answer_indices(test_answers, test_contexts)


## ***Step 5:*** Tokenize passages and queries

In this task is asked to select the BERT-base pretrained model “bert-base-uncased” for the tokenization

In [8]:
import warnings
warnings.filterwarnings("ignore")
from transformers import AutoTokenizer,AdamW,BertForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train_encodings = tokenizer(
    train_contexts, 
    train_questions, 
    truncation=True, 
    padding=True, 
    # clean_up_tokenization_spaces=True  # Delete warning
)

test_encodings = tokenizer(
    test_contexts, 
    test_questions, 
    truncation=True, 
    padding=True, 
    # clean_up_tokenization_spaces=True 
)


## ***Step 6:*** Convert the start-end positions to tokens start-end positions

In [9]:
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []

  count = 0

  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

    # if start position is None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length

    # if end position is None, the 'char_to_token' function points to the space after the correct token, so add - 1
    if end_positions[-1] is None:
      end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - 1)
      # if end position is still None the answer passage has been truncated
      if end_positions[-1] is None:
        count += 1
        end_positions[-1] = tokenizer.model_max_length

  print(count)

  # Update the data in dictionary
  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(test_encodings, test_answers)

10
16


I observed that after tokenize the end position character, sometimes was still None. This happended only for 10 answers in train data (of total 86821) and 16 answers in validation data (of total 20302). So I decided to move the answer 1 position left. If it was still None then I give them the model_max_length as before. I have to refer that I was trying to see if the answers in this case was 1 postition after (so I added +1 to the end position) or 2 positions left or right (+/- 2 positions), but the answers that there are still None was more (ie 526, while with this code there only 10). So I kept this strategy in the end, in order to have as less as possible "burned" answers.

## ***Step 7:*** Create a Dataset class

Create a Squatdataset class (inherits from torch.utils.data.Dataset), that helped me to train and validate my previous data more easily and convert encodings to datasets.

In [10]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

In [11]:
train_dataset = SquadDataset(train_encodings)
test_dataset = SquadDataset(test_encodings)

## ***Step 8:*** Use of DataLoader

I put my previous data to DataLoader, so as to split them in "pieces" of 8 batch size. I will explain the selection of this value of batch size later.

In [12]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=True)

small_train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=16,
    sampler=torch.utils.data.SubsetRandomSampler(range(int(0.05 * len(train_dataset))))
)

small_test_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=16,
    sampler=torch.utils.data.SubsetRandomSampler(range(int(0.05 * len(test_dataset))))
)

## ***Step 9:*** Use GPU

## ***Step 10:*** Build the Bert model

I select BertForQuestionAnswering from transformers library, as it was the most relative with this task. When we instantiate a model with from_pretrained(), the model configuration and pre-trained weights of the specified model are used to initialize the model. Moreover, I used the PyTorch optimizer of AdamW which implements gradient bias correction as well as weight decay.



In [13]:
import os
import torch
import pytorch_lightning as pl
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor, Timer
from pytorch_lightning.callbacks import Callback
from tqdm import tqdm
from transformers import BertForQuestionAnswering
from torch.optim import AdamW
import time

# Désactiver le parallélisme des tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Choisir un autre port pour éviter l'erreur EADDRINUSE
#os.environ["MASTER_PORT"] = str(12355)

# Optimiser pour Tensor Cores
torch.set_float32_matmul_precision('high')

# Callback pour mesurer et afficher le temps d'exécution
class TimeCallback(Callback):
    def on_train_start(self, trainer, pl_module):
        self.start_time = time.time()
        print(f"\nTraining started at {time.strftime('%H:%M:%S')}")

    def on_train_end(self, trainer, pl_module):
        end_time = time.time()
        total_time = end_time - self.start_time
        print(f"\nTraining finished at {time.strftime('%H:%M:%S')}")
        print(f"Total training time: {total_time:.2f} seconds")

# Callback spécifique à SLURM
class MySlurmCallback(Callback):
    def on_train_start(self, trainer, pl_module):
        slurm_id = os.getenv('SLURM_JOB_ID')
        slurm_rank = os.getenv('SLURM_PROCID')
        device_id = torch.cuda.current_device()
        print(f"SLURM_JOB_ID: {slurm_id}, SLURM_PROCID: {slurm_rank}, CUDA Device ID: {device_id}")

# Adaptation du modèle en LightningModule avec tqdm pour le suivi du progrès
class BertLightning(pl.LightningModule):
    def __init__(self, model_name='bert-base-uncased', learning_rate=5e-5):
        super().__init__()
        self.model = BertForQuestionAnswering.from_pretrained(model_name)
        self.learning_rate = learning_rate
        self.train_loss = 0.0
        self.num_batches = 0

    def forward(self, input_ids, attention_mask, start_positions, end_positions):
        return self.model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)

    def training_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        start_positions = batch['start_positions']
        end_positions = batch['end_positions']

        outputs = self(input_ids, attention_mask, start_positions, end_positions)
        loss = outputs[0]

        # Logging training loss
        self.log('train_loss', loss)

        # Accumuler les pertes pour le suivi
        self.train_loss += loss.item()
        self.num_batches += 1

        return loss

    def on_train_epoch_end(self):
        avg_loss = self.train_loss / self.num_batches if self.num_batches > 0 else 0
        self.log('avg_train_loss', avg_loss)  # Loguer la perte moyenne
        print(f"Epoch {self.current_epoch + 1} average training loss: {avg_loss:.4f}")

        # Réinitialiser les compteurs pour la prochaine époque
        self.train_loss = 0.0
        self.num_batches = 0

    def validation_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        start_positions = batch['start_positions']
        end_positions = batch['end_positions']

        outputs = self(input_ids, attention_mask, start_positions, end_positions)
        loss = outputs[0]

        # Logging validation loss
        self.log('val_loss', loss)

    def configure_optimizers(self):
        return AdamW(self.model.parameters(), lr=self.learning_rate)



In [14]:

# Choisir un autre port pour éviter l'erreur EADDRINUSE
os.environ["MASTER_PORT"] = "12358"  # Ensure this is unique if you're running multiple processes
os.environ["WORLD_SIZE"] = "2"  # Set this to the number of GPUs you have available
os.environ["LOCAL_RANK"] = "0"  # Set this for each process

model = BertLightning()

trainer = Trainer(
    max_epochs=2, 
    num_nodes=1,  # Only use one node in the notebook
    accelerator="gpu",
    devices=2,  # Use the number of GPUs available (should be >1 for DDP)
    strategy="ddp_notebook",  # Set to DDP strategy
    callbacks=[
        EarlyStopping(monitor='train_loss'), 
        ModelCheckpoint(dirpath='checkpoints/', filename='{epoch}-{train_loss:.2f}'), 
        LearningRateMonitor(logging_interval='step'), 
        Timer(),
        MySlurmCallback(),
    ]
)

print("""
                        ##################################################
                        #                                                #
                        #         START TRAINING AND EVALUATION          #
                        #                                                #
                        ##################################################
""")

total_train_time = time.time()  # Start timing the entire training process

trainer.fit(model, train_loader, test_loader)

total_train_time = time.time() - total_train_time 

minutes = total_train_time // 60
seconds = total_train_time % 60
print(f"Total Training Time: {int(minutes)} minutes {seconds:.2f} seconds")

print("""
                        ##################################################
                        #                                                #
                        #                      END                       #
                        #                                                #
                        ##################################################
""")


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs



                        ##################################################
                        #                                                #
                        #         START TRAINING AND EVALUATION          #
                        #                                                #
                        ##################################################



Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

2024-10-06 12:43:43.969552: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-06 12:43:43.982765: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-06 12:43:43.999046: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unab

Sanity Checking: |                                                                               | 0/? [00:00<…

SLURM_JOB_ID: 8840548, SLURM_PROCID: 0, CUDA Device ID: 0SLURM_JOB_ID: 8840548, SLURM_PROCID: 0, CUDA Device ID: 1



Training: |                                                                                      | 0/? [00:00<…

Validation: |                                                                                    | 0/? [00:00<…

Epoch 1 average training loss: 1.1865
Epoch 1 average training loss: 1.1662


Validation: |                                                                                    | 0/? [00:00<…

Epoch 2 average training loss: 0.6414
Epoch 2 average training loss: 0.6292


`Trainer.fit` stopped: `max_epochs=2` reached.


Total Training Time: 14 minutes 33.01 seconds

                        ##################################################
                        #                                                #
                        #                      END                       #
                        #                                                #
                        ##################################################

