# 1. Dataset Construction

In this task, we will construct a dataset consisting of question-answer pairs that will be used to train our BERT-based question-answering model. The dataset consists of a set of **1015** data points, which were sourced from ChatGPT.

The dataset follows a SQuAD-style format, where each question corresponds to a passage of text (context), and the answer is a span of text within that context. We will ensure that each question is answered by the corresponding passage and that the `answer_start` positions are correctly specified.

### Steps:
1. We will structure the dataset in a JSON format with the following attributes:
   - `context`: The paragraph or passage from which answers can be extracted.
   - `qas`: A list of questions and their corresponding answers with the `answer_start` indicating where the answer begins in the `context`.
2. We will include **1000-1500** question-answer pairs in total, ensuring that they are diverse and representative of the content on pressure ulcers.

### Example:
```json
{
  "context": "Pressure ulcers, also known as bedsores or decubitus ulcers...",
  "qas": [
    {
      "id": 1,
      "question": "What is another name for pressure ulcers?",
      "answers": [
        {
          "text": "bedsores or decubitus ulcers",
          "answer_start": 38
        }
      ]
    }
  ]

# 2. Data Pre-processing

Data pre-processing is an essential step before training a model. In this section, we will preprocess the dataset in preparation for training the BERT model. This will include **tokenization** and **padding**.

### Tasks:
1. **Loading the Dataset**:
   We will first load the dataset into a Python dictionary (as a JSON file) to ensure we have all the questions, answers, and contexts in the correct format.

2. **Tokenization**:
   Tokenization is the process of converting the text (questions and contexts) into tokens that can be fed into the model. We will use the `BertTokenizerFast` to tokenize both the questions and contexts.

3. **Padding**:
   Since the tokenized sequences can have varying lengths, we will pad them to a fixed length to ensure that all inputs are of equal length. This helps with batch processing during model training.

In [None]:
import json
import torch
from datasets import Dataset
from transformers import BertTokenizerFast, BertForQuestionAnswering, TrainingArguments, Trainer
from transformers import default_data_collator
import evaluate


  from .autonotebook import tqdm as notebook_tqdm
  warn(


### Loading and Flattening the SQuAD Dataset

In this step, we load the SQuAD-style dataset (`data-V2.json`) and flatten it into a format suitable for model training. Each entry contains a question, context, and its corresponding answer(s). This transformation ensures that the data is structured as a list of question-answer pairs, each with an associated context.

**Why**: 
- The dataset is originally in a nested JSON format that groups questions under paragraphs and articles. Flattening the dataset simplifies it into individual question-answer pairs, making it compatible with model training.
- This step prepares the data for the next stages of tokenization and model input.

The final output is a Hugging Face `Dataset` object, which is a convenient format for further processing and training with Hugging Face transformers.


In [2]:
# Load the SQuAD-style dataset
with open('data/data-V2.json') as f:
    squad_data = json.load(f)

# Flatten the JSON into QA-style examples
examples = []
for article in squad_data["data"]:
    for paragraph in article["paragraphs"]:
        context = paragraph["context"]
        for qa in paragraph["qas"]:
            question = qa["question"]
            answer = qa["answers"][0]
            examples.append({
                "id": qa["id"],
                "context": context,
                "question": question,
                "answers": {
                    "text": [answer["text"]],
                    "answer_start": [answer["answer_start"]]
                }
            })

# Convert to Hugging Face Dataset
raw_dataset = Dataset.from_list(examples)


### Tokenization and Preprocessing

Here, we use the BERT tokenizer (`BertTokenizerFast`) to process the dataset. The tokenizer converts text into a format that can be input to the BERT model, including token IDs, attention masks, and token type IDs. The `preprocess` function is applied to each example in the dataset.

**Why**: 
- **Truncation**: We set `truncation="only_second"` to ensure the context (the second part) is truncated if it exceeds the max length of 384 tokens, as BERT limits input size.
- **Max Length**: The input length is restricted to 384 tokens to avoid excessive memory usage and ensure that inputs fit within the BERT model's capacity.
- **Stride**: A stride of 128 ensures overlapping windows when truncating long contexts, which helps retain critical context information across tokenized segments.
- **Padding**: We pad sequences to the maximum length to create uniform input sizes, required for batch processing.
- **Offsets Mapping**: This returns the start and end positions of tokens in the original text, which will be useful during training for identifying the correct answer span.
- **Token Type IDs**: These identify whether a token belongs to the question or the context, which is required by BERT for distinguishing the two segments.

The result is a tokenized dataset ready for training.


In [3]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def preprocess(example):
    return tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        padding="max_length",
        return_offsets_mapping=True,
        return_token_type_ids=True
    )

tokenized_dataset = raw_dataset.map(preprocess, batched=True)


Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1015/1015 [00:00<00:00, 2510.44 examples/s]


### Adding Token Positions

In this step, we calculate the token start and end positions corresponding to the answer span in the context. These positions are necessary for training BERT for question-answering tasks, where the model learns to predict the start and end tokens of the correct answer.

**Why**:
- **CLS Token**: The `cls_token_id` marks the beginning of each sequence, which is used in the input format for BERT. We track its position to handle token identification accurately.
- **Offsets**: We use the offsets generated during tokenization to find the exact span of the answer in the tokenized text.
- **Start/End Character**: We convert the character-based start and end positions of the answer into token positions using the offsets. This allows us to align the original answer span with token indices in the BERT input.
- **While Loops**: These loops identify the correct token range by checking where the answer's start and end characters appear in the tokenized offsets. We adjust indices to ensure that the token span is correctly assigned.
- **Remove Columns**: After extracting the start and end positions, we remove unnecessary columns like the `offset_mapping`, `answers`, and `question` to reduce the dataset size and keep only the relevant features.

The output dataset now includes `start_positions` and `end_positions`, which are used during model training to guide BERT in locating the correct answer span.


In [4]:
def add_token_positions(example):
    cls_index = example["input_ids"].index(tokenizer.cls_token_id)
    offsets = example["offset_mapping"]
    start_char = example["answers"]["answer_start"][0]
    end_char = start_char + len(example["answers"]["text"][0])

    token_start_index = 0
    token_end_index = len(offsets) - 1

    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    token_start_index -= 1

    while token_end_index >= 0 and offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    token_end_index += 1

    example["start_positions"] = token_start_index
    example["end_positions"] = token_end_index
    return example

tokenized_dataset = tokenized_dataset.map(add_token_positions)
tokenized_dataset = tokenized_dataset.remove_columns(["offset_mapping", "answers", "question", "context", "id"])


Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1015/1015 [00:01<00:00, 578.32 examples/s]


### Splitting the Dataset and Initializing the Model

In this section, we split the tokenized dataset into training and evaluation sets and initialize the BERT model for question answering.

1. **Splitting the Dataset**:
   - We use the `train_test_split` function from Hugging Face's `datasets` library to divide the tokenized dataset into training and evaluation sets.
   - We specify a **test size of 20%** (`test_size=0.2`), meaning that 80% of the data will be used for training and 20% will be reserved for evaluation.
   - The resulting split is stored in the `train_dataset` and `eval_dataset` variables, which will be used during model training and evaluation.

2. **Loading the BERT Model**:
   - We load the pre-trained BERT model (`bert-base-uncased`) for question answering using the `BertForQuestionAnswering` class.
   - This model is fine-tuned for the question-answering task and is capable of predicting the start and end positions of answers in a given context.

The next step will involve training this model using the preprocessed training dataset and evaluating its performance on the evaluation set.


In [5]:
train_test = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_test['train']
eval_dataset = train_test['test']

model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 3. Model Development/Training

In this section, we will train a **BERT-based model** for the task of question-answering using the pre-processed data. We will use the Hugging Face `transformers` library to fine-tune the pre-trained BERT model on our custom dataset.

### Steps:
1. **Model Selection**:
   We will use the pre-trained BERT model for Question Answering: `BertForQuestionAnswering`. This model is specifically fine-tuned for the SQuAD task and is suitable for our needs.

2. **Hyperparameter Configuration**:
   Choosing the right hyperparameters is crucial for effective training. We will configure the learning rate, batch size, and the number of epochs. We will also discuss the reason for selecting these values.

3. **Training**:
   We will use the `Trainer` API to fine-tune the model.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-qa",                         # Directory where the model and checkpoints will be saved
    evaluation_strategy="epoch",                     # Evaluate the model after each epoch
    learning_rate=2e-5,                              # Learning rate for fine-tuning; a typical value for BERT
    per_device_train_batch_size=16,                  # Batch size for training; increases training speed without overloading memory
    per_device_eval_batch_size=16,                   # Batch size for evaluation
    num_train_epochs=3,                              # Number of epochs to train the model
    weight_decay=0.01,                               # Weight decay for regularization to avoid overfitting
    save_strategy="epoch",                           # Save the model after each epoch
    logging_dir="./logs",                            # Directory for logging information
    logging_steps=50,                                # Log every 50 steps for better monitoring of training progress
    warmup_steps=500,                                # Warmup steps to gradually increase the learning rate at the beginning of training
    load_best_model_at_end=True,                     # Load the best model after training based on evaluation results
    metric_for_best_model="eval_loss",               # Metric to monitor for the best model
    greater_is_better=False,                         # Indicate whether higher metric values are better (for loss, False)
    report_to="tensorboard",                         # Report to TensorBoard for visualizing training progress
    fp16=True,                                       # Enable mixed precision training for faster training on compatible hardware
    dataloader_num_workers=4,                        # Number of workers for loading data; improves data loading speed
)




# Training Arguments Justification

In this section, we will discuss the configuration of the **TrainingArguments** used for fine-tuning the BERT model. The parameters selected are designed to achieve an optimal balance between training efficiency and model performance. Below is a justification for each parameter used:

### Parameters:
1. **output_dir**:
   - **Value**: `./bert-qa`
   - **Justification**: This is the directory where the model and checkpoints will be saved during training. It allows us to keep track of the model versions.

2. **evaluation_strategy**:
   - **Value**: `epoch`
   - **Justification**: The model will be evaluated after each epoch. This ensures that we can monitor its progress throughout training and make adjustments if necessary.

3. **learning_rate**:
   - **Value**: `2e-5`
   - **Justification**: A common learning rate for fine-tuning BERT-based models is `2e-5`. It allows the model to learn effectively without making large, unstable updates to the weights. This value has been widely used in the literature for similar tasks and provides good performance.

4. **per_device_train_batch_size**:
   - **Value**: `16`
   - **Justification**: A batch size of `16` is used to increase the speed of training without overloading the GPU memory. Larger batch sizes can help with convergence but may lead to memory issues on GPUs with limited memory.

5. **per_device_eval_batch_size**:
   - **Value**: `16`
   - **Justification**: The evaluation batch size is also set to `16`, which matches the training batch size. This helps ensure consistency in processing and allows us to evaluate the model efficiently.

6. **num_train_epochs**:
   - **Value**: `3`
   - **Justification**: We train for 3 epochs to avoid overfitting while ensuring sufficient training. The choice of 3 epochs is a tradeoff between training time and model generalization. We will monitor the modelâ€™s performance during training to ensure it does not start overfitting.

7. **weight_decay**:
   - **Value**: `0.01`
   - **Justification**: A weight decay of `0.01` is used as a regularization technique to prevent the model from overfitting. Weight decay applies a penalty to large weights, which encourages simpler models that generalize better.

8. **save_strategy**:
   - **Value**: `epoch`
   - **Justification**: The model will be saved after each epoch. This allows us to retain checkpoints for each training stage, which can be helpful for later model analysis or resuming training from a specific epoch.

9. **logging_dir**:
   - **Value**: `./logs`
   - **Justification**: This is the directory where the logs of the training process will be stored. It helps with monitoring the training progress and debugging any issues that arise.

10. **logging_steps**:
    - **Value**: `50`
    - **Justification**: Logs are generated every 50 steps during training to provide regular updates on the model's progress. This frequency can be adjusted depending on the size of the dataset.

11. **warmup_steps**:
    - **Value**: `500`
    - **Justification**: Warmup steps gradually increase the learning rate from 0 to the specified learning rate (`2e-5`) over the first 500 steps. This helps the model to start learning slowly and avoid large updates early on.

12. **load_best_model_at_end**:
    - **Value**: `True`
    - **Justification**: This option ensures that after training, we load the best model based on the evaluation results. This is particularly useful when monitoring metrics like validation loss or accuracy.

13. **metric_for_best_model**:
    - **Value**: `eval_loss`
    - **Justification**: We use `eval_loss` as the metric to determine the best model. Lower loss indicates better performance, so we will select the model with the lowest evaluation loss.

14. **greater_is_better**:
    - **Value**: `False`
    - **Justification**: Since we are monitoring `eval_loss`, lower loss values indicate better performance. Therefore, `greater_is_better` is set to `False`.

15. **report_to**:
    - **Value**: `tensorboard`
    - **Justification**: TensorBoard will be used for visualizing training progress and metrics such as loss, accuracy, and others. This provides an intuitive and interactive way to monitor the training process.

16. **fp16**:
    - **Value**: `True`
    - **Justification**: Mixed precision training is enabled to speed up the training process and reduce memory usage. This is especially helpful when working with large models like BERT.

17. **dataloader_num_workers**:
    - **Value**: `4`
    - **Justification**: Setting the number of workers to `4` helps to load data more efficiently during training. This allows the CPU to process multiple data batches in parallel, improving training speed.

### Summary:
The chosen hyperparameters are aimed at optimizing training efficiency while preventing overfitting. The learning rate is conservative, and the batch size is set to a reasonable value for the available hardware. Regular evaluation, warmup steps, and logging will ensure the training process is monitored effectively.


# Model Training

This code trains the BERT model for question answering

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=default_data_collator
)

trainer.train()


  trainer = Trainer(
                                                 
 25%|â–ˆâ–ˆâ–Œ       | 102/408 [08:52<21:13,  4.16s/it]

{'eval_loss': nan, 'eval_runtime': 29.3628, 'eval_samples_per_second': 6.914, 'eval_steps_per_second': 0.885, 'epoch': 1.0}


                                                 
 50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 204/408 [17:39<14:09,  4.16s/it]

{'eval_loss': nan, 'eval_runtime': 29.5683, 'eval_samples_per_second': 6.865, 'eval_steps_per_second': 0.879, 'epoch': 2.0}


                                                 
 75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ  | 306/408 [26:27<07:07,  4.19s/it]

{'eval_loss': nan, 'eval_runtime': 29.7766, 'eval_samples_per_second': 6.817, 'eval_steps_per_second': 0.873, 'epoch': 3.0}


                                                 
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 408/408 [35:14<00:00,  4.18s/it]

{'eval_loss': nan, 'eval_runtime': 29.4694, 'eval_samples_per_second': 6.889, 'eval_steps_per_second': 0.882, 'epoch': 4.0}


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 408/408 [35:16<00:00,  5.19s/it]

{'train_runtime': 2116.0308, 'train_samples_per_second': 1.535, 'train_steps_per_second': 0.193, 'train_loss': 0.0, 'epoch': 4.0}





TrainOutput(global_step=408, training_loss=0.0, metrics={'train_runtime': 2116.0308, 'train_samples_per_second': 1.535, 'train_steps_per_second': 0.193, 'total_flos': 636518899408896.0, 'train_loss': 0.0, 'epoch': 4.0})

### Model Saving and Reloading

This code saves the trained BERT model and tokenizer to the specified directory `./bert-pressure-ulcers` for later use. 

1. **Saving the Model**: 
   - `model.save_pretrained("./bert-pressure-ulcers")`: Saves the trained model weights and configuration.
   - `tokenizer.save_pretrained("./bert-pressure-ulcers")`: Saves the tokenizer configuration for consistent tokenization during future use.

2. **Reloading the Model**: 
   - The model and tokenizer can be reloaded with the commented-out code:
     - `model = BertForQuestionAnswering.from_pretrained("./bert-pressure-ulcers")`
     - `tokenizer = BertTokenizerFast.from_pretrained("./bert-pressure-ulcers")`
   
   This allows you to resume inference or fine-tuning without retraining from scratch.


In [10]:
model.save_pretrained("./bert-pressure-ulcers")
tokenizer.save_pretrained("./bert-pressure-ulcers")

# Reload later:
# model = BertForQuestionAnswering.from_pretrained("./bert-pressure-ulcers")
# tokenizer = BertTokenizerFast.from_pretrained("./bert-pressure-ulcers")


('./bert-pressure-ulcers\\tokenizer_config.json',
 './bert-pressure-ulcers\\special_tokens_map.json',
 './bert-pressure-ulcers\\vocab.txt',
 './bert-pressure-ulcers\\added_tokens.json',
 './bert-pressure-ulcers\\tokenizer.json')

### Answer Extraction from the Model

This function, `get_answer()`, takes a question and context as input and uses the pre-trained BERT model to extract an answer.

1. **Tokenization**: 
   - The question and context are tokenized into input format compatible with the model using `tokenizer()`. It ensures that the sequence length does not exceed 384 tokens and applies truncation when needed.

2. **Model Inference**:
   - The model makes predictions for the starting and ending positions of the answer in the context (`start_logits` and `end_logits`).

3. **Probability Calculation**:
   - The logits are converted into probabilities using softmax. This allows the model to evaluate the likelihood of each token being part of the answer.

4. **Answer Extraction**:
   - The function iterates over the possible token positions and selects the span with the highest probability (based on both start and end positions).
   - The identified token span is decoded back into a string to provide the final answer.

5. **Example**:
   - For the question, "What causes pressure ulcers?", the function will extract the answer from the provided context.



In [None]:
def get_answer(question, context):
    # Tokenize the question and context, truncating the context if it exceeds the max length
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    
    # Disable gradient calculation as we are in inference mode
    with torch.no_grad():
        # Perform the model's forward pass to get start and end logits
        outputs = model(**inputs)

    # Extract the start and end logits from the model's output
    start_logits = outputs.start_logits[0]
    end_logits = outputs.end_logits[0]

    # Apply softmax to convert logits into probabilities
    start_probs = torch.softmax(start_logits, dim=0)
    end_probs = torch.softmax(end_logits, dim=0)

    # Initialize variables to track the best start and end indices
    max_prob = 0
    best_start, best_end = 0, 0

    # Loop through all possible start positions
    for start_idx in range(len(start_probs)):
        # Loop through all possible end positions (make sure the end index is after the start index)
        for end_idx in range(start_idx, min(start_idx + 30, len(end_probs))):  # Limit the answer length to 30 tokens
            # Calculate the combined probability for this (start, end) pair
            prob = start_probs[start_idx] * end_probs[end_idx]
            
            # If this probability is higher than the current max, update the best start and end indices
            if prob > max_prob:
                best_start = start_idx
                best_end = end_idx
                max_prob = prob

    # Extract the tokens from the input_ids corresponding to the best start and end indices
    answer_ids = inputs["input_ids"][0][best_start:best_end + 1]
    
    # Decode the token ids back into a string (removing special tokens)
    return tokenizer.decode(answer_ids, skip_special_tokens=True)

# Corrected question
question = "What causes pressure ulcers?"
# Get the context from the SQuAD dataset (this is the paragraph that the model will refer to when answering the question)
context = squad_data["data"][0]["paragraphs"][0]["context"]
# Print the predicted answer
print("Prediction:", get_answer(question, context))


Prediction: prolonged pressure, particularly over bony prominences such as the sacrum, heels, and hips. these ulcers often


### BERTScore Evaluation

BERTScore is a metric used to evaluate the quality of generated text by comparing it to a reference using BERT embeddings. It computes precision, recall, and F1 scores by comparing token-level representations of both the predicted and ground truth answers.

1. **Prediction**:
   - We use the `get_answer()` function to predict an answer to the question.

2. **Ground Truth**:
   - The ground truth (correct answer) is defined as `"bedsores or decubitus ulcers"` in this example.

3. **BERTScore Calculation**:
   - The `evaluate.load("bertscore")` function loads the BERTScore evaluation module.
   - `bertscore.compute()` compares the predicted answer (`pred`) to the ground truth (`gt`) using BERT embeddings to calculate precision, recall, and F1 scores.

4. **Output**:
   - The results show how similar the predicted answer is to the ground truth based on semantic similarity:
     - **Precision**: How many of the predicted tokens are relevant compared to all predicted tokens.
     - **Recall**: How many of the predicted tokens are relevant compared to all the ground truth tokens.
     - **F1 Score**: Harmonic mean of precision and recall.

The result is printed as:


In [None]:
bertscore = evaluate.load("bertscore")

# Example prediction
pred = get_answer(question, context)
gt = "bedsores or decubitus ulcers"

results = bertscore.compute(predictions=[pred], references=[gt], lang="en")
print(f"BERTScore Precision: {results['precision'][0]:.3f}, Recall: {results['recall'][0]:.3f}, F1: {results['f1'][0]:.3f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.900, Recall: 0.820, F1: 0.858


### Conclusion

In this project, we have developed a question-answering system using the BERT architecture to answer questions related to pressure ulcers. Through various stages, we have prepared the dataset, tokenized the text, and fine-tuned the BERT model on the task of extractive question answering.

The evaluation of our model using **BERTScore** has shown the following results:
- **Precision**: 0.900
- **Recall**: 0.820
- **F1 Score**: 0.858

These scores indicate that our model performs well in terms of both precision and recall, with a balanced performance between them. The F1 score of 0.858 demonstrates that the model is effectively capturing relevant information while maintaining a low false positive rate.

Overall, the model has successfully demonstrated the potential of using pre-trained transformer models like BERT for domain-specific question answering tasks. However, further optimization, including fine-tuning on a larger and more diverse dataset, could improve its accuracy even further. Additionally, using techniques like **data augmentation** and **early stopping** might help mitigate overfitting and further enhance model performance.
