<a href="https://colab.research.google.com/github/TanzeelAbbas/DL_Files/blob/main/GPT_2_on_Squad_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load SQuAD dataset**

In [39]:
# ! pip install transformers datasets evaluate
from datasets import load_dataset

squad = load_dataset("squad", split="train[:600]")

In [40]:
# Split the dataset’s train split into a train and test set with the train_test_split method

squad = squad.train_test_split(test_size=0.2)

squad["train"][2]

{'id': '5733cbdad058e614000b628f',
 'title': 'University_of_Notre_Dame',
 'context': 'The "Notre Dame Victory March" is the fight song for the University of Notre Dame. It was written by two brothers who were Notre Dame graduates. The Rev. Michael J. Shea, a 1904 graduate, wrote the music, and his brother, John F. Shea, who earned degrees in 1906 and 1908, wrote the original lyrics. The lyrics were revised in the 1920s; it first appeared under the copyright of the University of Notre Dame in 1928. The chorus is, "Cheer cheer for old Notre Dame, wake up the echos cheering her name. Send a volley cheer on high, shake down the thunder from the sky! What though the odds be great or small, old Notre Dame will win over all. While her loyal sons are marching, onward to victory!"',
 'question': 'Who is responsible for writing the music for "Notre Dame Victory March?"',
 'answers': {'text': ['Rev. Michael J. Shea'], 'answer_start': [149]}}

In [3]:
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForQuestionAnswering.from_pretrained("gpt2")

# Set the padding token to '[PAD]'
tokenizer.pad_token = "[PAD]"


Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2 and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # Add a print statement here
    print(f"Total examples: {len(offset_mapping)}")

    for i, offset in enumerate(offset_mapping):
        # Add another print statement here
        print(f"Processing example {i + 1}/{len(offset_mapping)}")
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while idx < len(sequence_ids) and sequence_ids[idx] != 1:
            idx += 1
        context_start = idx

        while idx < len(sequence_ids) and sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [5]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/480 [00:00<?, ? examples/s]

Total examples: 480
Processing example 1/480
Processing example 2/480
Processing example 3/480
Processing example 4/480
Processing example 5/480
Processing example 6/480
Processing example 7/480
Processing example 8/480
Processing example 9/480
Processing example 10/480
Processing example 11/480
Processing example 12/480
Processing example 13/480
Processing example 14/480
Processing example 15/480
Processing example 16/480
Processing example 17/480
Processing example 18/480
Processing example 19/480
Processing example 20/480
Processing example 21/480
Processing example 22/480
Processing example 23/480
Processing example 24/480
Processing example 25/480
Processing example 26/480
Processing example 27/480
Processing example 28/480
Processing example 29/480
Processing example 30/480
Processing example 31/480
Processing example 32/480
Processing example 33/480
Processing example 34/480
Processing example 35/480
Processing example 36/480
Processing example 37/480
Processing example 38/480
P

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Total examples: 120
Processing example 1/120
Processing example 2/120
Processing example 3/120
Processing example 4/120
Processing example 5/120
Processing example 6/120
Processing example 7/120
Processing example 8/120
Processing example 9/120
Processing example 10/120
Processing example 11/120
Processing example 12/120
Processing example 13/120
Processing example 14/120
Processing example 15/120
Processing example 16/120
Processing example 17/120
Processing example 18/120
Processing example 19/120
Processing example 20/120
Processing example 21/120
Processing example 22/120
Processing example 23/120
Processing example 24/120
Processing example 25/120
Processing example 26/120
Processing example 27/120
Processing example 28/120
Processing example 29/120
Processing example 30/120
Processing example 31/120
Processing example 32/120
Processing example 33/120
Processing example 34/120
Processing example 35/120
Processing example 36/120
Processing example 37/120
Processing example 38/120
P

In [6]:
import os

os.environ["HF_HOME"] = "/root/.huggingface"
os.environ["HF_HOME"] += "/token"
os.environ["HF_HOME"] = os.path.join(os.environ["HF_HOME"], "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")

In [7]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

# **Model Training**

1. Define training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. We’ll push this model to the Hub by setting push_to_hub=True (we need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
3. Call train() to finetune model.


In [12]:
# !pip install accelerate -U

training_args = TrainingArguments(
    output_dir="/content",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=20,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,3.0099,3.84375
2,2.849,3.654203
3,2.7855,3.558265


TrainOutput(global_step=90, training_loss=2.9058094024658203, metrics={'train_runtime': 118.8954, 'train_samples_per_second': 12.111, 'train_steps_per_second': 0.757, 'total_flos': 282200497274880.0, 'train_loss': 2.9058094024658203, 'epoch': 3.0})

# **Evaluate Model**

In [13]:
# Define your data collator
data_collator = DefaultDataCollator()

# Define evaluation arguments
evaluation_args = TrainingArguments(
    per_device_eval_batch_size=16,  # Adjust batch size for evaluation if needed
    output_dir="./evaluation_results",  # Specify an output directory for evaluation results
)

# Create a Trainer for evaluation
eval_trainer = Trainer(
    model=model,
    args=evaluation_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Evaluate the model on the test dataset
eval_results = eval_trainer.evaluate(tokenized_squad["test"])

# Print the evaluation results
print(eval_results)

{'eval_loss': 3.55826473236084, 'eval_runtime': 2.9999, 'eval_samples_per_second': 40.001, 'eval_steps_per_second': 2.667}


# **Predict Answer**

In [41]:
context = 'The "Notre Dame Victory March" is the fight song for the University of Notre Dame. It was written by two brothers who were Notre Dame graduates. The Rev. Michael J. Shea, a 1904 graduate, wrote the music, and his brother, John F. Shea, who earned degrees in 1906 and 1908, wrote the original lyrics. The lyrics were revised in the 1920s; it first appeared under the copyright of the University of Notre Dame in 1928. The chorus is, "Cheer cheer for old Notre Dame, wake up the echos cheering her name. Send a volley cheer on high, shake down the thunder from the sky! What though the odds be great or small, old Notre Dame will win over all. While her loyal sons are marching, onward to victory!'
question = "Who is responsible for writing the music for 'Notre Dame Victory March?'"

# Tokenize the question and context
inputs = tokenizer(question, context, return_tensors="pt")

# Move the inputs to the same device as the model
inputs = {key: value.to(model.device) for key, value in inputs.items()}

# Generate predictions
with torch.no_grad():
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

# Find the answer span
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits)

# Convert indices to Python integers
start_idx = start_idx.item()
end_idx = end_idx.item()

# Tokenize the context and extract the answer span
context_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
answer_tokens = context_tokens[start_idx:end_idx + 1]
answer = tokenizer.convert_tokens_to_string(answer_tokens)

print(answer)


 Rev. Michael J. Shea, a 1904 graduate, wrote the music, and his brother, John F. Shea
