**Downloading Packages**

Please change runtime to to T4 GPU before running this colab notebook.

In [27]:
!pip install transformers
!pip install accelerate -U
!pip install datasets rouge_score evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [28]:
import transformers
import torch

print(transformers.__version__)

4.35.2


**Importing Dataset from HuggingFace**

In [29]:
from datasets import load_dataset

cosmosDF = load_dataset("cosmos_qa")

In [30]:
cosmosDF

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 25262
    })
    test: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 6963
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label'],
        num_rows: 2985
    })
})

In [31]:
cosmosDF["train"][0]

{'id': '3Q9SPIIRWJKVQ8244310E8TUS6YWAC##34V1S5K3GTZMDUBNBIGY93FLDOB690##A1S1K7134S2VUC##Blog_1044056##q1_a1##3XU9MCX6VQQG7YPLCSAFDPQNH4GR20',
 'context': "Good Old War and person L : I saw both of these bands Wednesday night , and they both blew me away . seriously . Good Old War is acoustic and makes me smile . I really can not help but be happy when I listen to them ; I think it 's the fact that they seemed so happy themselves when they played .",
 'question': 'In the future , will this person go to see other bands play ?',
 'answer0': 'None of the above choices .',
 'answer1': 'This person likes music and likes to see the show , they will see other bands play .',
 'answer2': 'This person only likes Good Old War and Person L , no other bands .',
 'answer3': 'Other Bands is not on tour and this person can not see them .',
 'label': 1}

In [32]:
cosmosDF["validation"][0]

{'id': '3BFF0DJK8XA7YNK4QYIGCOG1A95STE##3180JW2OT5AF02OISBX66RFOCTG5J7##A2LTOS0AZ3B28A##Blog_56156##q1_a1##378G7J1SJNCDAAIN46FM2P7T6KZEW2',
 'context': 'Do i need to go for a legal divorce ? I wanted to marry a woman but she is not in the same religion , so i am not concern of the marriage inside church . I will do the marriage registered with the girl who i am going to get married . But legally will there be any complication , like if the other woman comes back one day , will the girl who i am going to get married now will be in trouble or Is there any complication ?',
 'question': 'Why is this person asking about divorce ?',
 'answer0': 'If he gets married in the church he wo nt have to get a divorce .',
 'answer1': 'He wants to get married to a different person .',
 'answer2': 'He wants to know if he does nt like this girl can he divorce her ?',
 'answer3': 'None of the above choices .',
 'label': 1}

**Creating a Function to better see the data**

In [33]:
def cleanOutput(example):
    print(f"Context: {example['context']}")
    print(f"Question: {example['question']}")
    print(f"  A - {example['answer0']}")
    print(f"  B - {example['answer1']}")
    print(f"  C - {example['answer2']}")
    print(f"  D - {example['answer3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")

In [34]:
cleanOutput(cosmosDF["train"][0])

Context: Good Old War and person L : I saw both of these bands Wednesday night , and they both blew me away . seriously . Good Old War is acoustic and makes me smile . I really can not help but be happy when I listen to them ; I think it 's the fact that they seemed so happy themselves when they played .
Question: In the future , will this person go to see other bands play ?
  A - None of the above choices .
  B - This person likes music and likes to see the show , they will see other bands play .
  C - This person only likes Good Old War and Person L , no other bands .
  D - Other Bands is not on tour and this person can not see them .

Ground truth: option B


In [35]:
cleanOutput(cosmosDF["validation"][0])

Context: Do i need to go for a legal divorce ? I wanted to marry a woman but she is not in the same religion , so i am not concern of the marriage inside church . I will do the marriage registered with the girl who i am going to get married . But legally will there be any complication , like if the other woman comes back one day , will the girl who i am going to get married now will be in trouble or Is there any complication ?
Question: Why is this person asking about divorce ?
  A - If he gets married in the church he wo nt have to get a divorce .
  B - He wants to get married to a different person .
  C - He wants to know if he does nt like this girl can he divorce her ?
  D - None of the above choices .

Ground truth: option B


**Setting up Tokenizer**

In [36]:
modelName = "bert-base-uncased"
batchSize = 8

In [37]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(modelName, use_fast=True)

In [38]:
answerOptions = ["answer0", "answer1", "answer2", "answer3"]

#This function will format everything into a context + question + answer(ranging from answer0 to answer3)
def preprocess_function(examples):
    # Repeat each 4 time for the 4 possibilities of the question.
    firstContext = [[context] * 4 for context in examples["context"]]
    # Grab all answers and combine with the question.
    questionHeaders = examples["question"]
    questionAnswer = [[f"{header} {examples[end][i]}" for end in answerOptions] for i, header in enumerate(questionHeaders)]

    # Flatten everything
    firstContext = sum(firstContext, [])
    questionAnswer = sum(questionAnswer, [])

    # Tokenize
    tokenizedExamples = tokenizer(firstContext, questionAnswer, truncation=True)
    # Un-flatten
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenizedExamples.items()}

In [39]:
examples = cosmosDF["train"][:5]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])

5 4 [89, 101, 98, 97]


Checking if the tokenizer works properly

In [40]:
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]

['[CLS] so, last day in seattle, and my flight was at 1 : 30. i got to chit chat with my old manager ( more like a mentor ), and left seattle feeling really good and inspired.. [SEP] why did i chit chat with my old manager? because my flight was at 1 : 30. [SEP]',
 '[CLS] so, last day in seattle, and my flight was at 1 : 30. i got to chit chat with my old manager ( more like a mentor ), and left seattle feeling really good and inspired.. [SEP] why did i chit chat with my old manager? because i left seattle feeling really good and inspired. [SEP]',
 "[CLS] so, last day in seattle, and my flight was at 1 : 30. i got to chit chat with my old manager ( more like a mentor ), and left seattle feeling really good and inspired.. [SEP] why did i chit chat with my old manager? because it's my last day in seattle. [SEP]",
 '[CLS] so, last day in seattle, and my flight was at 1 : 30. i got to chit chat with my old manager ( more like a mentor ), and left seattle feeling really good and inspired.. 

In [41]:
cleanOutput(cosmosDF["train"][3])

Context: So , last day in Seattle , and my flight was at 1:30 . I got to chit chat with my old manager ( more like a mentor ) , and left Seattle feeling really good and inspired . .
Question: Why did I chit chat with my old manager ?
  A - Because my flight was at 1:30 .
  B - Because I left Seattle feeling really good and inspired .
  C - Because it 's my last day in Seattle .
  D - Because I enjoy talking to him .

Ground truth: option D


In [42]:
encodedDF = cosmosDF.map(preprocess_function, batched=True)  #using the tokenizer to encode the whole data

**Importing BERT from HuggingFace**

In [43]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(modelName)

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
model_name = modelName.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-comos",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batchSize,
    per_device_eval_batch_size=batchSize,
    num_train_epochs=2,
    weight_decay=0.01,
)

Creating Custome DataCollector

In [45]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union

#the nomal data collector does not work to dynamically pad inputs, thus need to create our own
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [46]:
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encodedDF["test"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Checking if the padding is correct

In [47]:
[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(4)]

["[CLS] bertrand berry has been announced as out for this sunday's game with the new york jets. of course that comes as no surprise as he left the washington game early and did not practice yesterday. his groin is now officially listed as partially torn. [SEP] what might happen if his groin is not healed in good time? none of the above choices. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]",
 "[CLS] bertrand berry has been announced as out for this sunday's game with the new york jets. of course that comes as no surprise as he left the washington game early and did not practice yeste

In [48]:
cleanOutput(cosmosDF["test"][8])

Context: Bertrand Berry has been announced as out for this Sunday 's game with the New York Jets . Of course that comes as no surprise as he left the Washington game early and did not practice yesterday . His groin is now officially listed as partially torn .
Question: What might happen if his groin is not healed in good time ?
  A - None of the above choices .
  B - He will be used regardless because he can play his position even with a light injury
  C - He will be benched for the rest of the season because of his injury
  D - He will play through the injury because he is essential to the team

Ground truth: option D


Importing F1 score for metric evluation

In [58]:
import evaluate
import numpy as np

metric = evaluate.load("f1")

def compute_metrics(eval_pred):

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)

   rouge = metric.compute(predictions=predictions, references=labels, average="micro")["f1"]
   return {"f1": rouge}

Tried rogue Score but could not work since predictions were not in the proper format

In [52]:
'''import numpy as np
import evaluate

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)  #decoding the predicted values
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)  #decoding labels into normal form to compare with predictions

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)  #comparing the two to get rogue score

    #printing all 4 ty[es of outputs
    return {k: round(v, 4) for k, v in result.items()}'''

In [59]:
trainer = Trainer(
    model,
    args,
    train_dataset=encodedDF["train"],
    eval_dataset=encodedDF["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

In [60]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.7346,0.975613,0.60067
2,0.4478,1.358816,0.61005


TrainOutput(global_step=6316, training_loss=0.46965995145946304, metrics={'train_runtime': 1600.1515, 'train_samples_per_second': 31.575, 'train_steps_per_second': 3.947, 'total_flos': 1.4475549628100736e+16, 'train_loss': 0.46965995145946304, 'epoch': 2.0})

In [61]:
prediction = trainer.predict(encodedDF["validation"])  #using the validation set since the testing dataset doesnt contain any labels

In [62]:
print(prediction.metrics)

{'test_loss': 1.3588157892227173, 'test_f1': 0.6100502512562814, 'test_runtime': 30.943, 'test_samples_per_second': 96.468, 'test_steps_per_second': 12.087}
