<a href="https://colab.research.google.com/github/AlinZohari/InformationExtraction/blob/main/003_SQuAD_TuneQAmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning QA model

This notebook are run in Google Colab to leverage its GPU capability

Reference:
1. Hugging Face -  [Question and Answering Task Guide](https://huggingface.co/docs/transformers/tasks/question_answering)
2. Creating Train and Validation Datasets - https://simpletransformers.ai/docs/qa-data-formats/

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

Setting CUDA_LAUNCH_BLOCKING=1 makes all CUDA operations synchronous, which means the CPU will wait for the GPU to finish before executing the next line of code. This makes it easier to identify and debug errors, because the stack trace will show exactly where the error occurred.However, this will make the code run slower

In [None]:
!pip install transformers[torch]

In [None]:
!pip install accelerate -U

In [None]:
!pip show accelerate

In [None]:
import torch
torch.cuda.is_available()

## Pretrained model capabilities

let us see first the capability of the pretrained deepset/roberta-base-squad2 model on our questions

In [None]:
import torch
from transformers import RobertaTokenizer, RobertaForQuestionAnswering


# Load the tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

# Read context from a .txt file
import requests

url = "https://raw.githubusercontent.com/AlinZohari/InformationExtraction/main/data/authorize_doc/Kuiper_FCC-20-102A1.txt"
response = requests.get(url)
context = response.text

# Dictionary of questions
questions = {
    "const_name": "What's the name of the satellite constellation the company seeks to deploy or operate?",
    "date_release": "On which date was the document released?",
    "date_50": "By which date must the company launch and operate half of its satellites?",
    "date_100": "By which date is the company expected to have all its satellites operational?",
    "total_sat_const": "How many satellites is the company authorized to deploy and operate for this constellation?",
    "altitude": "At which authorized altitudes will the company deploy its satellites?",
    "inclination": "What are the authorized satellite inclinations within the corresponding altitudes?",
    "number_orb_plane": "How many orbital planes, corresponding to given altitudes and inclinations, has the company been authorized for?",
    "total_sat_per_orb_plane": "How many satellites are allocated to each orbital plane?",
    "total_sat_per_alt_incl": "How many satellites, for each altitude and inclination, are there across all matching orbital planes?",
    "operational_lifetime": "What is the satellite's expected operational lifetime in years?"
}

# Loop through each question
for key, question in questions.items():
    # Prepare the input
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", max_length=512, truncation=True)


    # Get the model's prediction
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]
    output = model(input_ids, attention_mask=attention_mask)

    answer_start_scores = output.start_logits
    answer_end_scores = output.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores)
    answer = tokenizer.decode(input_ids[0][answer_start:answer_end + 1])

    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print()


The warning message you are seeing is due to the truncation strategy used by the tokenizer. The 'longest_first' truncation strategy truncates tokens from the longest of the two sequences (question or context) until they fit within the specified max_length. The warning is informing you that the overflowing tokens, which are the tokens removed during truncation, are not being returned in the inputs. This is expected behavior, as we are not using the overflowing tokens in this case.

The answers that are just indicate that the model is not able to find a suitable answer in the context for the given question. This could be because the answer is not present in the context, or because the context is too large and the relevant portion was truncated.

Because of this let us fine tune this model to fit our purpose.

## Lets now Fine-Tuned the model

We are using deepset/roberta-base-squad2 model which is used for question answering taks. It is based oon RoBERTa model which ia a variant of BERT (Bidirectional Encoder Representations from Transformers) model. BERT and RoBERTa are models designed to understand the context and relationships among words.
- RoBERTa: RoBERTa stands for "A Robustly Optimized BERT Pretraining Approach". It is an optimized version of BERT, which means it is trained on more data and for more iterations than BERT. RoBERTa modifies key hyperparameters in BERT, including removing the next-sentence pretraining objective, and training with much larger mini-batches and learning rates.
- squad2: SQuAD stands for Stanford Question Answering Dataset version 2.0 an extension of SQuAD 1.1 which includes unanswerable questions. This means that the model trained on this dataset not only needs to answer questions but also has to determine if a question is answerable or not based on the provided context.

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from transformers import RobertaTokenizerFast

#Reference: https://huggingface.co/deepset/roberta-base-squad2

model_name = "deepset/roberta-base-squad2"

#Load model & tokenizer
#model = AutoModelForQuestionAnswering.from_pretrained(model_name)
#tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)

#using AutoModelForQuestionAnswering automatically infer the correct model and tokenizer classes to use based on the model name. This makes the code more flexible as it can work with any model architecture
#using RobertaTokenizerFast which is a fast tokenizer for RoBERTa models. The "fast" tokenizers are implemented in Rust and are more performant compared to the standard Python tokenizers. They also provide additional functionalities like alignment between the original and tokenized text.

In [None]:
#looking at RoBerta Question Answering
model

How to fine-tune a QA model
- we need GPU
- building a training script


In [None]:
#getting our own build training datasets
import requests
import json

url = "https://raw.githubusercontent.com/AlinZohari/InformationExtraction/main/data/QA_model/train.json"
response = requests.get(url)
train = response.json()

In [None]:
#looking at the train dataset
train

In [None]:
##etting our own build validation datasets
import requests
import json

url = "https://raw.githubusercontent.com/AlinZohari/InformationExtraction/main/data/QA_model/validation.json"
response = requests.get(url)
validation = response.json()

In [None]:
#looking atthe validation dataset
validation

## Preprocess the data

In [None]:
!pip install datasets

In [None]:
#we need to defined the tokenizer
#from transformers import RobertaTokenizerFast
#tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
# needed to use BertTokenizerFast/ RobertaTokenizerFast return_offset_mapping feature is not available when using Python tokenizers.

In [None]:
import pandas as pd
from datasets import Dataset

def preprocess_function(examples):
    questions = []
    contexts = []
    answers = []

    for i in range(len(examples['context'])):
        context = examples['context'][i]
        qas = examples['qas'][i]

        for qa in qas:
            questions.append(qa['question'].strip())
            contexts.append(context)
            if not qa['is_impossible']:
                ans = qa['answers'][0]
                answers.append({'answer_start': [ans['answer_start']], 'text': [ans['text']]})
            else:
                answers.append({'answer_start': [None], 'text': [None]})

    inputs = tokenizer(
        questions,
        contexts,
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer['answer_start'][0]
        end_char = start_char + len(answer['text'][0]) if answer['text'][0] else None
        sequence_ids = inputs.sequence_ids(i)

        if start_char is None or end_char is None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1

            if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs

# Convert lists to Dataset objects
train_dataset = Dataset.from_pandas(pd.DataFrame(train))
validation_dataset = Dataset.from_pandas(pd.DataFrame(validation))

# Apply preprocess_function
tokenized_train = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
tokenized_validation = validation_dataset.map(preprocess_function, batched=True, remove_columns=validation_dataset.column_names)


In [None]:
tokenized_train

In [None]:
train_dataset

The DefaultDataCollator is a class from the transformers library that is used to collate samples into batches for training or evaluation. When you train a model, you usually don't pass the entire dataset at once, but rather use mini-batches of data. The data_collator is responsible for taking the individual samples and combining them into these mini-batches.

The DefaultDataCollator will:

Handle the padding of the input data (if necessary) to ensure that all samples in the batch have the same length.
Convert the batch into PyTorch tensors.

In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Training

In [None]:
from transformers import TrainingArguments, Trainer
#model = AutoModelForQuestionAnswering.from_pretrained(model_name)

metric = load_metric("squad") loads the SQuAD (Stanford Question Answering Dataset) evaluation metric. This metric computes the Exact Match (EM) and F1 score, which are commonly used for evaluating question answering models.

Exact Match (EM): This is the simplest metric. It measures the percentage of predictions that match any one of the ground truth answers exactly.
F1 Score: This is a more complex metric that considers the overlap between the prediction and ground truth answer. It is the harmonic mean of precision and recall.

In [None]:
#defining training argument
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)


In [None]:
from datasets import load_metric
import numpy as np


metric = load_metric("squad")

def compute_metrics(p):
    # Get the model's predictions
    start_logits, end_logits = p.predictions
    start_preds = np.argmax(start_logits, axis=1)
    end_preds = np.argmax(end_logits, axis=1)

    # Get the ground truth labels
    start_labels = p.label_ids[0]
    end_labels = p.label_ids[1]

    # Convert the predictions and labels to the format expected by the metric
    predictions = [{'prediction_text': tokenizer.decode(input_ids[start:end+1].tolist())} for input_ids, start, end in zip(tokenized_validation['input_ids'], start_preds, end_preds)]
    references = [{'answers': {'answer_start': [answer['answer_start']], 'text': [answer['text']]}} for answer in tokenized_validation['answers']]

    # Compute the metric
    result = metric.compute(predictions=predictions, references=references)

    return result

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation,
    data_collator=data_collator,
    #compute_metrics=compute_metrics,
)

trainer.train()


The TrainOutput object contains some information about the training process:

global_step: The total number of training steps completed. This is 18 in your case.
training_loss: The final training loss. This is 3.016 in your case.
metrics: A dictionary containing some additional metrics:
train_runtime: The total runtime of the training in seconds. This is 11.33 seconds in your case.
train_samples_per_second: The number of samples processed per second. This is 11.118 in your case.
train_steps_per_second: The number of training steps completed per second. This is 1.588 in your case.
total_flos: The total number of floating-point operations performed during training. This is 24,692,543,511,552 in your case.
train_loss: The final training loss. This is the same as training_loss and is 3.016 in your case.
epoch: The total number of epochs completed. This is 3 in your case.

In [None]:
# Evaluate the model
results = trainer.evaluate()

print(results)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


# Save model and tokenizer
model.save_pretrained("/content/gdrive/MyDrive/tuned_model")
tokenizer.save_pretrained("/content/gdrive/MyDrive/tuned_model")



## Using tuned model

In [None]:
import os

# List the contents of the directory
os.listdir('/content/gdrive/MyDrive/tuned_model')


In [None]:
print(os.path.abspath('/content/gdrive/MyDrive/tuned_model'))


In [None]:
from transformers import RobertaTokenizer, RobertaTokenizerFast, RobertaForQuestionAnswering
import torch


# Load the saved model and tokenizer
#model = RobertaForQuestionAnswering.from_pretrained("/content/gdrive/MyDrive/tuned_model")
model = AutoModelForQuestionAnswering.from_pretrained("/content/gdrive/MyDrive/tuned_model")
tokenizer = RobertaTokenizerFast.from_pretrained("/content/gdrive/MyDrive/tuned_model")

# Read context from a .txt file
import requests

url = "https://raw.githubusercontent.com/AlinZohari/InformationExtraction/main/data/authorize_doc/Kuiper_FCC-20-102A1.txt"
response = requests.get(url)
context = response.text

#define the questions
questions = [
    "What's the name of the satellite constellation the company seeks to deploy or operate?",
    "On which date was the document released?",
    "By which date must the company launch and operate half of its satellites?",
    "By which date is the company expected to have all its satellites operational?",
    "How many satellites is the company authorized to deploy and operate for this constellation?",
    "At which authorized altitudes will the company deploy its satellites?",
    "What are the authorized satellite inclinations within the corresponding altitudes?",
    "How many orbital planes, corresponding to given altitudes and inclinations, has the company been authorized for?",
    "How many satellites are allocated to each orbital plane?",
    "How many satellites, for each altitude and inclination, are there across all matching orbital planes?",
    "What is the satellite's expected operational lifetime in years?"
]

# Function to ask a single question
def ask_question(question, context):
    # Split the context into chunks of 512 tokens
    chunk_size = 512 - tokenizer.num_special_tokens_to_add(pair=True)
    context_chunks = [context[i:i+chunk_size] for i in range(0, len(context), chunk_size)]

    answers = []

    for context_chunk in context_chunks:
        inputs = tokenizer(question, context_chunk, return_tensors='pt')
        outputs = model(**inputs)
        answer_start = torch.argmax(outputs.start_logits)
        answer_end = torch.argmax(outputs.end_logits)
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end+1]))
        answers.append(answer)

    # Combine the answers from each chunk
    full_answer = ' '.join(answers)

    return full_answer

# Ask each question
answers = [ask_question(question, context) for question in questions]

# Print the answers
for question, answer in zip(questions, answers):
    print(f'Question: {question}')
    print(f'Answer: {answer}\n')

