## HW3 Question Answering on SQUAD with BERT
The objective of this assignment is to introduce you to the BERT model, and its application in the Question Answering task. Please note that all required code implementations are marked with "TODO".

Install some libraries first.

In [None]:
!pip install transformers[torch]
!pip install accelerate -U
!pip install datasets



### SQUAD Dataset
Download the SQUAD dataset.

In [None]:
from datasets import load_dataset

In [None]:
full_datasets = {"train": load_dataset("squad", split="train[:20%]"),
            "validation": load_dataset("squad", split="validation[:20%]")}

The structure of the `full_datasets` object and some train/val examples from the dataset. Each data sample contains a unique id, a question, a given context, and an answer.

In [None]:
full_datasets

{'train': Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 17520
 }),
 'validation': Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 2114
 })}

In [None]:
print("Example from the training subset:")
print("Context: ", full_datasets["train"][0]["context"])
print("Question: ", full_datasets["train"][0]["question"])
print("Answer: ", full_datasets["train"][0]["answers"])

Example from the training subset:
Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [None]:
print("Example from the training subset:")
print("Context: ", full_datasets["train"][1]["context"])
print("Question: ", full_datasets["train"][1]["question"])
print("Answer: ", full_datasets["train"][1]["answers"])

Example from the training subset:
Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  What is in front of the Notre Dame Main Building?
Answer:  {'text': ['a copper statue of Christ'], 'answer_start': [188]}


In [None]:
print("Example from the validation subset:")
print("Context: ", full_datasets["validation"][0]["context"])
print("Question: ", full_datasets["validation"][0]["question"])
print("Answer: ", full_datasets["validation"][0]["answers"])

Example from the validation subset:
Context:  Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Question:  Which NFL team represented the AFC at Super Bowl 50?
Answer:  {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}


The next step is to pre-process the train/val data. The pre-processing includes word tokenization, and offset mapping.

For word tokenization, we need to build a tokenizer. Here, we use a well-established tokenizer from BERT.

In [None]:
model_name = "prajjwal1/bert-small"

In [None]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained(model_name)

Below, we define `preprocess_training_examples` and `preprocess_validation_examples` functions for pre-processing data. Don't worry about it. We have implemented them for you. You can call them directly.

In [None]:
max_length = 384
stride = 128
def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

Then we use `preprocess_training_examples` and `preprocess_validation_examples` functions to convert the raw dataset to a tokenized dataset.

In [None]:
tokenized_datasets = {"train": full_datasets['train'].map(preprocess_training_examples, batched=True, remove_columns=full_datasets["train"].column_names),
                      "validation": full_datasets['validation'].map(preprocess_validation_examples, batched=True, remove_columns=full_datasets["validation"].column_names)}

The structure of the `tokenized_datasets` object:

In [None]:
tokenized_datasets

{'train': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
     num_rows: 17674
 }),
 'validation': Dataset({
     features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
     num_rows: 2137
 })}

 Below we show a training sample. We present the raw question and context in the first two rows. The tokenized words are in the third row. To make it interpretable, we convert ids back to words. As you can see, the `input_ids` contains both question and context. The last four rows are `token_type_ids`, `attention_mask`, and the start and end positions of the answer.

In [None]:
print("Question: ", full_datasets["train"][0]["question"])
print("Context: ", full_datasets["train"][0]["context"])
print("-----------------------------------------------")
print("input_ids:", tokenized_datasets["train"][0]["input_ids"])
print("decoded input_ids:", tokenizer.decode(tokenized_datasets["train"][0]["input_ids"]))
print("-----------------------------------------------")
print("token_type_ids: ", tokenized_datasets["train"][0]["token_type_ids"])
print("attention_mask: ", tokenized_datasets["train"][0]["attention_mask"])
print("start_positions: ", tokenized_datasets["train"][0]["start_positions"])
print("end_positions: ", tokenized_datasets["train"][0]["end_positions"])

Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
-----------------------------------------------
input_ids: [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2

We build `dataset` and `dataloader` below once the pre-processing is completed.

In [None]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset = tokenized_datasets["train"]
train_dataset.set_format("torch")
eval_dataset = tokenized_datasets["validation"].remove_columns(["example_id", "offset_mapping"])
eval_dataset.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    eval_dataset,
    collate_fn=default_data_collator,
    batch_size=8
)

### Metrics
Before we build the BERT model, we should define the performance metric to measure the QA quality of any given model. Here the `compute_metrics` function is used for metric calculation. `compute_metrics` will call `modify_result` to modify the format of predictions and ground truths then it will call `compute_score` for F1 and EXACT_MATCH. Your job is to implement these two metrics.

In [None]:
def compute_metrics(start_logits, end_logits, features, examples):
    predicted_answers, theoretical_answers = modify_result(start_logits, end_logits, features, examples)
    scores = compute_score(predicted_answers, theoretical_answers)

    return scores

In [None]:
import collections

n_best = 20
max_answer_length = 30
def modify_result(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in examples:
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]

    predicted_answers = {prediction["id"]: prediction["prediction_text"] for prediction in predicted_answers}
    theoretical_answers = [{"answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
                             "id": ref["id"]}
                           for ref in theoretical_answers]

    return predicted_answers, theoretical_answers

In [None]:
import collections
import re
import string
import sys
from collections import Counter

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        return re.sub(r"\b(a|an|the)\b", " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_score(prediction, ground_truth):
    prediction = normalize_answer(prediction)
    ground_truth = normalize_answer(ground_truth)
    prediction_tokens = prediction.split()
    ground_truth_tokens = ground_truth.split()
    common_tokens_count = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    common_count = sum(common_tokens_count.values())
    if not prediction_tokens or not ground_truth_tokens:
        return 0.0
    precision = common_count / len(prediction_tokens)
    recall = common_count / len(ground_truth_tokens)
    if precision + recall == 0:
        return 0.0
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def exact_match_score(prediction, ground_truth):
    prediction = normalize_answer(prediction)
    ground_truth = normalize_answer(ground_truth)
    em = (prediction == ground_truth)
    return em

def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)

def compute_score(prediced_answers, theoretical_answers):
    f1 = exact_match = total = 0
    for qa in theoretical_answers:
        total += 1
        if qa["id"] not in prediced_answers:
            message = "Unanswered question " + qa["id"] + " will receive score 0."
            print(message, file=sys.stderr)
            continue
        ground_truths = list(map(lambda x: x["text"], qa["answers"]))
        prediction = prediced_answers[qa["id"]]
        exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
        f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)

    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {"exact_match": exact_match, "f1": f1}

### BERT Model
The next and most exciting step is to instantiate a BERT model and use it for QA finetuning. Class `BertForQuestionAnswering`'s goal is to inherit the pre-trained BERT model and add some new project heads.

Your job is :
1. Adding a projection head in `__init__` function. It can be a simple Linear layer, and a 2-layer Multilayer Perceptron (MLP) with the ReLU activation function.
2. Modify the `forward` function so that the model can predict each answer's start and end indices based on the extracted features using the BERT backbone.

In [None]:
from typing import List, Optional, Tuple, Union

import torch
from torch import nn
from torch.nn import CrossEntropyLoss, Linear
from transformers.models.bert.modeling_bert import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import QuestionAnsweringModelOutput


class BertForQuestionAnswering(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        

        #Freeze
        for param in self.bert.parameters():
            param.requires_grad = False
        #MLP
        self.qa_outputs = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.ReLU(),
            nn.Linear(config.hidden_size, config.num_labels)
        )

        #Linear
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)


        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        start_positions: Optional[torch.Tensor] = None,
        end_positions: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
        r"""
        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.
        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

       

        logits = self.qa_outputs(sequence_output)

        # Split the logits to start and end logits
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            loss_fct = CrossEntropyLoss()
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        if not return_dict:
            output = (start_logits, end_logits) + outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output

        return QuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Instantiate a BERT model and load the pre-trained parameters. Don't worry about the warning information. It happens when you load a pre-trained model and finetune it on down-stream tasks.

In [None]:
model = BertForQuestionAnswering.from_pretrained(model_name)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at prajjwal1/bert-small and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The architecture of `model`:

In [None]:
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 512, padding_idx=0)
      (position_embeddings): Embedding(512, 512)
      (token_type_embeddings): Embedding(2, 512)
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=512, out_features=512, bias=True)
              (key): Linear(in_features=512, out_features=512, bias=True)
              (value): Linear(in_features=512, out_features=512, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=512, out_features=512, bias=True)
              (LayerNorm): LayerNorm((512,), eps=1e-12, elemen

Print trainable parameters.

In [None]:
print("Trainable parameters:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)

Trainable parameters:
bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self

### Train & Test
Build a optimizer.

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

Use `Accelerator` to enable mixed-precision training and handle cpu/gpu allocation.

In [None]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="fp16")
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Build a learnable rate scheduler.

In [None]:
from transformers import get_scheduler

num_train_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

The training and validation loop.
Your job is:
1. Finish the training loop. Enable loss backpropagation, weight update, and learning rate update.
2. Finish the validation loop. Enable model inference and performance measurement.

In [None]:
import numpy as np
from tqdm.auto import tqdm

import torch


progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    
    model.train()

    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(accelerator.device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.update(1)


    model.eval()
    accelerator.print("Evaluation!")
    all_start_logits = []
    all_end_logits = []
    with torch.no_grad():
        for batch in tqdm(eval_dataloader):
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            outputs = model(**batch)
            start_logits = outputs.start_logits.detach().cpu().numpy()
            end_logits = outputs.end_logits.detach().cpu().numpy()
            all_start_logits.append(start_logits)
            all_end_logits.append(end_logits)

    start_logits = np.concatenate(all_start_logits, axis=0)
    end_logits = np.concatenate(all_end_logits, axis=0)

    metrics = compute_metrics(
        start_logits, end_logits, tokenized_datasets["validation"], full_datasets["validation"]
    )
    print(f"epoch {epoch}:", metrics)

  0%|          | 0/11050 [00:00<?, ?it/s]

Evaluation!


  0%|          | 0/268 [00:00<?, ?it/s]

epoch 0: {'exact_match': 55.345316934720906, 'f1': 62.96278105045126}
Evaluation!


  0%|          | 0/268 [00:00<?, ?it/s]

epoch 1: {'exact_match': 59.366130558183535, 'f1': 67.77400215832931}
Evaluation!


  0%|          | 0/268 [00:00<?, ?it/s]

epoch 2: {'exact_match': 60.17029328287607, 'f1': 68.57782224695929}
Evaluation!


  0%|          | 0/268 [00:00<?, ?it/s]

epoch 3: {'exact_match': 60.879848628193, 'f1': 69.40766776736409}
Evaluation!


  0%|          | 0/268 [00:00<?, ?it/s]

epoch 4: {'exact_match': 60.17029328287607, 'f1': 68.69323529209969}
