<a href="https://www.kaggle.com/code/aisuko/multiple-choice-nlp?scriptVersionId=164693074" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

A multiple choice task is similar to question answering, except several candidate answers are provided along with a context and the model is trained to select the correct answer. Here we are going to finetune BERT on the regular configuration of the SWAG dataset to select the best answer given multiple options and some context.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning BERT"
os.environ["WANDB_NOTES"] = "Fine tuning the BERT model"
os.environ["WANDB_NAME"] = "ft-bert-with-swag"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading Dataset

In [3]:
from datasets import load_dataset

swag=load_dataset("swag", "regular", split="train[:500]") # fit lower computing resources
swag

Downloading builder script:   0%|          | 0.00/7.97k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/73546 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20006 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20005 [00:00<?, ? examples/s]

Dataset({
    features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
    num_rows: 500
})

In [4]:
swag=swag.train_test_split(test_size=0.2)

train_dataset=swag['train']
validate_dataset=swag['test']

print(train_dataset)
print(validate_dataset)

Dataset({
    features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
    num_rows: 400
})
Dataset({
    features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
    num_rows: 100
})


In [5]:
swag["train"][0]

{'video-id': 'anetv_kZMDKbfIis0',
 'fold-ind': '9501',
 'startphrase': 'He begins playing a game of curling. Two people',
 'sent1': 'He begins playing a game of curling.',
 'sent2': 'Two people',
 'gold-source': 'gen',
 'ending0': 'sing with him and give reactions as he is snowboarding.',
 'ending1': 'are sitting on a couch next to him.',
 'ending2': 'are shown again teaching in the hockey.',
 'ending3': 'watch in a circle as they continue playing.',
 'label': 3}

The important fileds:

* `sent1` and `sent2`: these fields show how a sentence starts, and if we put the two together, we get the startphrase filed

* `ending`: suggests a possible ending for how a sentence can end, but only one of them is correct.

* `label`: identifies the correct sentence ending.

# Preprocess

The next step is to load a BERT tokenizer to process the sentence starts and the four possible endings:

In [6]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


Here we need a preprocessing function you want to create needs to:
* Make four copies of the `sent1` field and combine each of them with `sent2` to recreate how a sentence starts.
* Combine `sent2` with each of the four possible sentence endings
* Flatten these two lists so you can tokenize them, and then unflatten them afterward so each example has a corresponding input-ids, attention_mask, and label field.

In [7]:
ending_names=["ending0","ending1","ending2","ending3"]

def preprocess_function(examples):
    first_sentences=[[context]*4 for context in examples["sent1"]]
    question_headers=examples["sent2"]
    second_sentences=[
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]
    
    first_sentences=sum(first_sentences, [])
    second_sentences=sum(second_sentences, [])
    
    tokenized_examples=tokenizer(first_sentences, second_sentences, truncation=True)
    return {k:[v[i:i+4] for i in range(0, len(v),4)] for k,v in tokenized_examples.items()}

tokenized_swag=swag.map(preprocess_function, batched=True)
tokenized_swag

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 400
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 100
    })
})

We need to adapt the DataCollator With Padding to create a batch of examples. It's more efficient to `dynamically pad` the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. `DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results:

In [8]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch


@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int]=None
    pad_to_multiple_of: Optional[int]=None
    
    def __call__(self, features):
        label_name ="label" if "label" in features[0].keys() else "labels"
        labels=[feature.pop(label_name) for feature in features]
        batch_size=len(features)
        num_choices=len(features[0]["input_ids"])
        flattened_features=[
            [{k:v[i] for k,v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features=sum(flattened_features, [])
        
        batch=self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        batch={k:v.view(batch_size, num_choices, -1) for k,v in batch.items()}
        batch["labels"]=torch.tensor(labels, dtype=torch.int64)
        return batch

# Evaluate


Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: 

$Accuracy=\frac{(TP+TN)}{(TP+TN+FP+FN)}$

Where 


- TP: True positive
- TN: True negative
- FP: False positive
- FN: False negative

In [9]:
import evaluate
import numpy as np

accuracy=evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels=eval_pred
    predictions=np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

# Training

In [10]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model=AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
print(model.config)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [11]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    gradient_checkpointing=True,
    fp16=True,
    # It is used to apply L2 regularization to the model weights during training. It helps in preventing overfiting by adding a penalty term to the loss function.
    weight_decay=0.01,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)


trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)


trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_122323-adkpa3wg[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-bert-with-swag[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20BERT[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20BERT/runs/adkpa3wg[0m
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.177856,0.5
2,No log,0.993526,0.62
3,No log,1.027028,0.58
4,No log,1.030641,0.58
5,No log,1.046641,0.62




TrainOutput(global_step=65, training_loss=0.622317387507512, metrics={'train_runtime': 189.86, 'train_samples_per_second': 10.534, 'train_steps_per_second': 0.342, 'total_flos': 235942720440576.0, 'train_loss': 0.622317387507512, 'epoch': 5.0})

In [12]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 2.70


In [13]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': 'bert-base-uncased',
    'tasks': 'multiple-choice',
    'dataset_tags':'text-classification',
    'dataset':'swag'
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

training_args.bin:   0%|          | 0.00/4.16k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/aisuko/ft-bert-with-swag/tree/main/'

# Inference

In [14]:
prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
candidate1 = "The law does not apply to croissants and brioche."
candidate2 = "The law applies to baguettes."

In [15]:
import torch
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("WANDB_NAME"))
inputs=tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
labels=torch.tensor(0).unsqueeze(0)

In [16]:
from transformers import AutoModelForMultipleChoice

model=AutoModelForMultipleChoice.from_pretrained(os.getenv("WANDB_NAME"))
outputs=model(**{k: v.unsqueeze(0) for k,v in inputs.items()}, labels=labels)
logits=outputs.logits

In [17]:
predicted_class=logits.argmax().item()
predicted_class

1