# Fine-Tuning pre-trained T5 Question-Answering model by Christian Di Maio and Giacomo Nunziati

In [None]:
# Use this because of dependancy error
!pip uninstall transformers accelerate

!pip install transformers[torch]


In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")
model = AutoModelForSeq2SeqLM.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

# Loading SQuAD v1.1 dataset from datasets library

In [None]:
# Install this and restart the run-time because of pyarrow dependency error
!pip install datasets

## Prepare SQuAD v1.1 dataset

In [2]:
from datasets import load_dataset

squad = load_dataset("squad")


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

## Tokenize data

In [14]:
def preprocess_function(examples):
    inputs = [q + " " + c for q, c in zip(examples["question"], examples["context"])]
    model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True)

    # Tokenize the targets
    targets = [answer['text'][0] for answer in examples['answers']]
    with tokenizer.as_target_tokenizer():
        model_inputs["labels"] = tokenizer(targets, max_length=64, padding="max_length", truncation=True)["input_ids"]
    return model_inputs

# Apply the preprocessing function to the dataset
tokenized_squad = squad.map(preprocess_function, batched=True)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
# Apply the preprocessing function to the dataset
tokenized_squad = squad.map(preprocess_function, batched=True)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
# The codes below are to make sure that the dataset is as required
# Function to decode tokenized examples
def decode_example(tokenized_example):
    input_ids = tokenized_example['input_ids']
    labels = tokenized_example['labels']
    input_text = tokenizer.decode(input_ids, skip_special_tokens=True)
    target_text = tokenizer.decode(labels, skip_special_tokens=True)
    return input_text, target_text

# Inspect the first few examples
for i in range(3):
    input_text, target_text = decode_example(tokenized_squad['train'][i])
    print(f"Example {i+1}:")
    print(f"Input: {input_text}")
    print(f"Target: {target_text}")
    print("\n")

Example 1:
Input: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Target: Saint Bernadette Soubirous


Example 2:
Input: What is in front of the Notre Dame Main Building? Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue o

# Just using 5000 data items because of resource constraints

In [7]:
from datasets import load_dataset
import random

# Load the SQuAD dataset
squad = load_dataset("squad")

# Shuffle the dataset
squad = squad.shuffle(seed=42)

# Take a subset of 5000 examples for training
train_dataset = squad["train"].select(range(5000))

# Take a subset of 1000 examples for validation
validation_dataset = squad["validation"].select(range(1000))

# Print sizes of subsets
print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(validation_dataset)}")


Train dataset size: 5000
Validation dataset size: 1000


In [16]:
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [17]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="content/drive/MyDrive/VivekaHackathon2024/QAmodels_trained",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,    # Adjust batch size as needed
    per_device_eval_batch_size=8,     # Adjust batch size as needed
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='content/drive/MyDrive/VivekaHackathon2024/QAlogs',
    logging_steps=100,
)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
)

# Fine-tune the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print(results)



Epoch,Training Loss,Validation Loss
1,0.2988,0.219887
2,0.1121,0.074586
3,0.0681,0.068709


{'eval_loss': 0.06870898604393005, 'eval_runtime': 49.0059, 'eval_samples_per_second': 20.406, 'eval_steps_per_second': 2.551, 'epoch': 3.0}


# Inference

In [19]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Step 1: Load Model and Tokenizer from Google Drive
model_path = '/content/content/drive/MyDrive/VivekaHackathon2024/QAmodels_trained/checkpoint-1500'
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)




In [20]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text2text-generation", model=model, tokenizer = tokenizer)

In [21]:
import re

def clean_and_format_text(text):
    # Remove extra whitespace and newlines
    text = ' '.join(text.split())

    # Optionally, normalize punctuation (depends on tokenizer requirements)
    text = re.sub(r'([.,!?])', r' \1 ', text)
    text = re.sub(r'\s{2,}', ' ', text)  # Remove multiple spaces

    # Lowercase the text (if necessary)
    text = text.lower()

    return text


In [43]:
context = '''
Title: Asthma Overview

Asthma is a chronic respiratory condition that affects the airways in the lungs. It is characterized by inflammation and narrowing of the airways, which can cause difficulty breathing, wheezing, coughing, and chest tightness. Asthma symptoms can vary in severity and may be triggered by allergens, respiratory infections, exercise, or environmental factors.

Treatment for asthma includes medications such as bronchodilators to relax the airway muscles and corticosteroids to reduce inflammation. Long-term management involves identifying triggers, maintaining good air quality indoors, and having an asthma action plan to handle exacerbations.

Severe asthma attacks may require emergency medical treatment with medications like epinephrine and immediate medical attention to restore normal breathing.

'''
question = "What are the symptoms of Asthma?"


In [44]:
input_text = f"question: {question} context: {context}"

In [45]:
result = pipe(input_text)

# Perform text generation (which in this case will answer the question)
generated_text = pipe(input_text, max_length = 150)

# Extract the generated answer from the output
answer = generated_text[0]['generated_text'].strip()

# Print the question, context, and answer
print("Question:", question)
print("Context:", context)
print("Answer:", answer)

Question: What are the symptoms of Asthma?
Context: 
Title: Asthma Overview

Asthma is a chronic respiratory condition that affects the airways in the lungs. It is characterized by inflammation and narrowing of the airways, which can cause difficulty breathing, wheezing, coughing, and chest tightness. Asthma symptoms can vary in severity and may be triggered by allergens, respiratory infections, exercise, or environmental factors.

Treatment for asthma includes medications such as bronchodilators to relax the airway muscles and corticosteroids to reduce inflammation. Long-term management involves identifying triggers, maintaining good air quality indoors, and having an asthma action plan to handle exacerbations.

Severe asthma attacks may require emergency medical treatment with medications like epinephrine and immediate medical attention to restore normal breathing.


Answer: difficulty breathing, wheezing, coughing, and chest tightness


# Thank You!!