# KAIST AI605 Assignment 4: Sequence and Token Classification with BERT
Instructor: Minjoon Seo (minjoon@kaist.ac.kr)

TA in charge: Seokin Seo (tzs930@kaist.ac.kr)

**Due date**: May 29 (Wed) 11:00pm, 2021

Your name: Radhika Dua

Your student ID: 2024824

Your collaborators: Seongsu Bae (Student ID: 20204363)

## Assignment Objectives
- Use BERT for sequence classification (Assignment 1)
- Use BERT for token classification (Assignment 2)

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Also make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 40 points altogether. Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.9
torch 1.7.1


## 1. Hugging Face Transformers
In this assignment, you will  use `transformers` library by Hugging Face. The library provides you an easy way to utilize diverse pretrained language models. You will be specifically asked to re-do sequence classification (sentiment analysis) and token classification (question answering) that you already did in your Assignment 1 and 2. 

First, install both `transformers` and `datasets` packages:

In [None]:
!pip install transformers datasets



In Lecture 17, we walked through how we can use pretrained and finetuned BERT for sequence classification (https://huggingface.co/transformers/task_summary.html#sequence-classification) and token classification (https://huggingface.co/transformers/task_summary.html#extractive-question-answering).
Recall that `bert-base-cased-finetuned-mrpc` means that you load a pretrained `bert-base-cased` model and you finetune it on `mrpc` dataset. 

**Problem 1.1** *(10 points)* Put your favorite emoji here 😇
https://getemoji.com/

Your favorite emoji:  🙌

## 2. Sequence Classification with BERT
**Problem 2.1** *(20 points)* Tutorial at https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch shows you how you can finetune a sequence classification model from `bert-base-cased` for IMDB dataset. Repeat the same process with SST-2 dataset and report the accuracy here (i.e. it's fine to copy & paste code from the documentation).

Note that you can load SST-2 dataset via


## $\color{blue}{\text{Solution 2.1}}$

In [None]:
######### Load the dataset ##########
from datasets import load_dataset
dataset_sst = load_dataset('glue', 'sst2')

Reusing dataset glue (/home/radhika/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [None]:
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = dataset_sst.map(tokenize_function, batched=True)
# print(tokenized_datasets)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# print(tokenized_datasets)
tokenized_datasets.set_format("torch")


small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(65000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42)
# train_dataset = tokenized_datasets["train"]
# eval_dataset = tokenized_datasets["validation"]

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

HBox(children=(FloatProgress(value=0.0, max=68.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_scheduler
import torch
from tqdm.auto import tqdm
from datasets import load_metric
import time
import os 

os.environ["CUDA_VISIBLE_DEVICES"]="2"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

device
# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda', index=0)

In [None]:
###### define the model ######
model1 = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

###### define optimizer, scheduler, and hyperparameters ########
optimizer = AdamW(model1.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model1.to(device)

######### Train the model ###########
progress_bar = tqdm(range(num_training_steps))
start_time = time.process_time()

model1.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model1(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

print("\nTotal training time with BERT model", time.process_time() - start_time, "seconds")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

HBox(children=(FloatProgress(value=0.0, max=24375.0), HTML(value='')))


Total training time with BERT model 3340.532257859 seconds


In [None]:
######## Evaluation ##########
metric= load_metric("accuracy")
model1.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model1(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.9036697247706422}

<font color='blue'> In the above cells, a **sequence classification model** from **bert-base-cased** is finetuned on **SST-2 dataset**. The model is trained for 3 epochs on **65,000 samples from train-split**, which is equal to the number of training examples used in assignment 1. The model is evaluated using the **validation dataset which comrpises of 872 samples**. The **accuracy** of the model obtained on the validation set is **90.37%**

The dataset does not have labels for `test` data so please use `validation` data as your test data. 


**Problem 2.2** *(10 points)* How does your accuracy with BERT compares to your accuracy with LSTM in Assignment 1? How about training speed?



## $\color{blue}{\text{Solution 2.2}}$

#### $\color{blue}{\text{1. Accuracy:}}$
<font color='blue'> The accuracy of **BERT model** on the validation-split of SST-2 dataset is **90.37%**. While the accuracy of the **LSTM model** (trained in assignment 1) on the test set of SST-2 dataset is **78.21%**. Hence, the **BERT model outperforms LSTM by 12.16%**.

#### $\color{blue}{\text{2. Training speed:}}$
<font color='blue'> The training time of the BERT model for 3 epochs is approximately **3340.54** seconds (evidenced in training loop inline), which is much more than the time taken for training an LSTM model on SST-2 classiication. Hence, the **training speed of BERT is much slower than the training speed of an LSTM model**.

**Problem 2.3** *(10 points)* Try your own sentences and find three failure cases. Explain why you think the model got them wrong.

## $\color{blue}{\text{Solution 2.3}}$

In [None]:
text = "this is a good movie"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 1:", text)
print("prediction:", prediction.item(), "\n")

text = "ThIs Is a gOoD mOvIe"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 2:", text)
print("prediction:", prediction.item(), "\n")

text = "IT IS A GOOD MOVIE ..."
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 3:", text)
print("prediction:", prediction.item(), "\n")

text = "THIS IS NOT A BAD MOVIE"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 4:", text)
print("prediction:", prediction.item(), "\n")


text = "this is not a bad movie"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 5:", text)
print("prediction:", prediction.item(), "\n")


text = "my name is khan and i am not a terrorist"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 6:", text)
print("prediction:", prediction.item(), "\n")


text = "MY NAME IS KHAN AND I AM NOT A TERRORIST"
encoded_input = tokenizer(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 7:", text)
print("prediction:", prediction.item(), "\n")


text = "The weather is good today"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 8:", text)
print("prediction:", prediction.item(), "\n")


text = "Today, the weather is good."
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model1(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 9:", text)
print("prediction:", prediction.item(), "\n")

Example 1: this is a good movie
prediction: 1

Example 2: ThIs Is a gOoD mOvIe
prediction: 0
 
Example 3: IT IS A GOOD MOVIE ...
prediction: 1

Example 4: THIS IS NOT A BAD MOVIE
prediction: 0 

Example 5: this is not a bad movie
prediction: 1 

Example 6: my name is khan and i am not a terrorist
prediction: 1 

Example 7: MY NAME IS KHAN AND I AM NOT A TERRORIST
prediction: 0 

Example 8: The weather is good today
prediction: 1
 
Example 9: Today, the weather is good.
prediction: 0 



<font color='blue'> **Failure cases:**
 1. **Model is case-sensitive:** Model makes a difference in prediction when the letters are in upper-case vs lower-case. For instance, examples 1, 2, and 3 are the semantically same. They just differ in terms of case of letters (upper-case or lower-case). When these samples are passed as an input to model, the predictions vary. More examples of this failure case includes example (4,5) and example(6,7), which are semantically same but differs in case and hence, the model predicts different outputs. The model made a wrong prediction because this model regards capital letters ans small letters differently and hence generates different results.
 2. **Model sometimes does not focus on words like "not" that changes its meaning:** For instance, consider example 3 and 4 from above inline. "IT IS A GOOD MOVIE ..." and "THIS IS NOT A BAD MOVIE". Both sentences are semantically same. But, model provides different predictions for each. In my opinion, the model predicted the wrong answer because of either of these two reasons: 1.) the model focused on local information more ("BAD MOVIE") or 2.) Having input in capitals or small letters can act as noise and can distract the model from focusing on the correct information. 
 3. **Model is sensitive to word order in two semantically similar and grammatically correct sentences:** If we pass two sentences which are same in meaning and grammatically correct but differs in the structure, then model predicts different answers. In this case also model fails to understand the meaning of sentence and focused on structure of the sentence which in turn lead to incorrect prediction. For instance, consider example 8 and 9 from above inline. The sentences, "Today, the weather is good.", and  "The weather is good today" are semantically same but the model generates different prediction. This raises questions on the reliability of this model as the model provides different predictions on changing the structure of the sentence. In my opinion, model fails in this case because it fails to learn the correct information due to distraction from inputs in capital and small letters, which may act as noise for the model.

**Problem 2.4 (bonus)** *(20 points)*  Try `bert-base-uncased` and analyze if it makes any difference. What is the difference between `cased` and `uncased` in English? How about in Korean?

## $\color{blue}{\text{Solution 2.4}}$

In [None]:
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

tokenizer1 = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer1(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = dataset_sst.map(tokenize_function, batched=True)
print(tokenized_datasets)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
print(tokenized_datasets)
tokenized_datasets.set_format("torch")


small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(35000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42)
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

HBox(children=(FloatProgress(value=0.0, max=68.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 872
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 1821
    })
})
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 872
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 1821
    })
})


In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_scheduler
import torch
from tqdm.auto import tqdm
from datasets import load_metric
import time

###### Model ######
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

###### define optimizer, scheduler, and hyperparameters
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

######### Training the model ###########
progress_bar = tqdm(range(num_training_steps))
start_time = time.process_time()

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

print("\nTotal training time with bert-base-uncased model", time.process_time() - start_time, "seconds")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

HBox(children=(FloatProgress(value=0.0, max=13125.0), HTML(value='')))


Total training time with bert-base-uncased model 1792.8268981689998 seconds


In [None]:
######## Evaluation ##########
metric= load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.9059633027522935}

In [None]:
text = "this is a good movie"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 1:", text)
print("prediction:", prediction.item(), "\n")

text = "ThIs Is a gOoD mOvIe"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 2:", text)
print("prediction:", prediction.item(), "\n")

text = "IT IS A GOOD MOVIE ..."
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 3:", text)
print("prediction:", prediction.item(), "\n")

text = "THIS IS NOT A BAD MOVIE"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 4:", text)
print("prediction:", prediction.item(), "\n")


text = "this is not a bad movie"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 5:", text)
print("prediction:", prediction.item(), "\n")


text = "my name is khan and i am not a terrorist"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 6:", text)
print("prediction:", prediction.item(), "\n")


text = "MY NAME IS KHAN AND I AM NOT A TERRORIST"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 7:", text)
print("prediction:", prediction.item(), "\n")


text = "The weather is good today"
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 8:", text)
print("prediction:", prediction.item(), "\n")


text = "Today, the weather is good."
encoded_input = tokenizer1(text, return_tensors='pt').to(device)
output = model(**encoded_input)
logit = output.logits
prediction = torch.argmax(logit, dim=-1)
print("Example 9:", text)
print("prediction:", prediction.item(), "\n")

Example 1: this is a good movie
prediction: 1

Example 2: ThIs Is a gOoD mOvIe
prediction: 2
 
Example 3: IT IS A GOOD MOVIE ...
prediction: 1

Example 4: THIS IS NOT A BAD MOVIE
prediction: 1 

Example 5: this is not a bad movie
prediction: 1 

Example 6: my name is khan and i am not a terrorist
prediction: 0 

Example 7: MY NAME IS KHAN AND I AM NOT A TERRORIST
prediction: 0 

Example 8: The weather is good today
prediction: 1
 
Example 9: Today, the weather is good.
prediction: 1 



<font color='blue'> In the above cells, a **sequence classification model** from **bert-base-uncased** is finetuned on **SST-2 dataset**. The model is trained for 3 epochs on **65,000 samples from train-split**, which is equal to the number of training examples used in assignment 1. The model is evaluated using the **validation dataset which comrpises of 872 samples**. The **accuracy** of the model obtained on the validation set is **90.60%**
    
<font color='blue'> On using bert-base-uncased instead of bert-base-cased for sentiment classification task on SST-2 dataset does not have significant effect on accuracy. Accuracy with both bert-base-cased and bert-base-uncased is around 91%.
    
<font color='blue'> However, on evaluating the bert-base-uncased model qualitatively on some of the sentences given by me, I observe that this model works better than the bert-base-cased model and even mitigates/eliminates the failure cases observed in solution 2.3.
    
<font color='blue'>The primary difference between cased and uncased in english is that cased considers the inputs in capitals and small letters differently and as a results of this, it provides different predictions. However, uncased considers inputs in capitals and small as same (as it converts all inputs to small case before passing to the model) and hence provides same predictions irrespective of the case of the inputs.
    

## 3. Token Classification with BERT
**Problem 3.1** *(30 points)* Finetune your `bert-base-cased` model for `squad` question answering dataset, following a similar procedure to Problem 2.1. Report your accuracy here. For now, if the input is longer than 256, take the first 256 words as the input and truncate the rest. You are allowed to copy any code from the documentation.  *Hint*: If you are having difficulty in implementation, take a peek at  (but do not copy!) https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering, though keep in mind that the answer extraction module there is quite complex. It is okay to keep it simple here and sacrifice the accuracy a little.



## $\color{blue}{\text{Solution 3.1}}$

In [None]:
!pip install transformers datasets accelerate
!pip install easydict



In [None]:
import math
import random
import collections
from easydict import EasyDict
from tqdm import tqdm
from typing import Optional, Tuple
import torch
import os
from torch.utils.data.dataloader import DataLoader
from datasets import load_dataset, load_metric
import numpy as np
from transformers import set_seed
from transformers import default_data_collator
from transformers import AutoConfig, AutoTokenizer, AutoModel
from transformers import AutoModelForQuestionAnswering
from transformers import AdamW, get_scheduler
from transformers import EvalPrediction
from accelerate import Accelerator

In [None]:
######### Define arguments of the model and the data ###########
model_args = EasyDict({
    'model_name_or_path': 'bert-base-cased',
    'config_name': 'bert-base-cased',
    'tokenizer_name': 'bert-base-cased',
})

data_args = EasyDict({
    'dataset_name': 'squad',
    'dataset_config_name': None,
    'max_seq_length': 256,
    'pad_to_max_length': True,
    'max_train_samples': 30000,
    'max_eval_samples': None,
})

######### Define hyperparameters ############
training_args = EasyDict({
    'seed': 100,
    'do_train': True,
    'do_eval': False,
    'do_test': False,
    'learning_rate': 3e-5,
    'num_train_epochs': 3,
    'per_device_train_batch_size': 32,
    'per_device_eval_batch_size': 8,
    'weight_decay': 0.0,
    'gradient_accumulation_steps': 1,
    'max_train_steps': None,
    'lr_scheduler_type': 'linear',
    'num_warmup_steps': 0,
})

In [None]:
############ Define the accelerator #############
accelerator = Accelerator()
set_seed(training_args.seed)

############ Load the dataset and define config, model, and tokenizer ############
datasets = load_dataset(data_args.dataset_name)

config = AutoConfig.from_pretrained(model_args.config_name)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name)
model3 = AutoModelForQuestionAnswering.from_pretrained(model_args.model_name_or_path, config=config)

Reusing dataset squad (/home/radhika/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a)
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model fro

In [None]:
########### function for data preprocessing #################
def prepare_train_features(examples):
    tokenized_examples = tokenizer(
        examples[question_column_name if pad_on_right else context_column_name],
        examples[context_column_name if pad_on_right else question_column_name],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_seq_length,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length" if data_args.pad_to_max_length else False,
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = examples[answer_column_name][sample_index]
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

############# Function to preprocess the validation/test data ############
def prepare_validation_features(examples):
    tokenized_examples = tokenizer(
        examples[question_column_name if pad_on_right else context_column_name],
        examples[context_column_name if pad_on_right else question_column_name],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_seq_length,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length" if data_args.pad_to_max_length else False,
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
if training_args.do_train:
    column_names = datasets["train"].column_names
else:
    column_names = datasets["validation"].column_names

question_column_name = "question" if "question" in column_names else column_names[0]
context_column_name = "context" if "context" in column_names else column_names[1]
answer_column_name = "answers" if "answers" in column_names else column_names[2]

pad_on_right = tokenizer.padding_side == "right"
max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)

In [None]:
train_dataset = datasets['train']
if data_args.max_train_samples is not None:
    train_dataset = train_dataset.select(range(data_args.max_train_samples))

In [None]:
############ Preprocessing the training data #############
train_dataset = train_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=column_names,
)

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




In [None]:
if data_args.max_train_samples is not None:
    train_dataset = train_dataset.select(range(data_args.max_train_samples))

In [None]:
############## Processing the validation data #############
eval_examples = datasets["validation"]
if data_args.max_eval_samples is not None:
    eval_examples = eval_examples.select(range(data_args.max_eval_samples))
eval_dataset = eval_examples.map(
    prepare_validation_features,
    batched=True,
    remove_columns=column_names,
)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




In [None]:
if data_args.max_eval_samples is not None:
    eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))

In [None]:
############ Data loaders #############
if data_args.pad_to_max_length:
    data_collator = default_data_collator
else:
    data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=(8 if accelerator.use_fp16 else None))

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=data_collator, batch_size=training_args.per_device_train_batch_size
)

eval_dataset_for_model = eval_dataset.remove_columns(["example_id", "offset_mapping"])
eval_dataloader = DataLoader(
    eval_dataset_for_model, collate_fn=data_collator, batch_size=training_args.per_device_eval_batch_size
)

In [None]:
def configure_optimizers(model, training_args, train_dataloader):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": training_args.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(
        optimizer_grouped_parameters,
        lr=training_args.learning_rate,
        )
    
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / training_args.gradient_accumulation_steps)
    if training_args.max_train_steps is None:
        training_args.max_train_steps = training_args.num_train_epochs * num_update_steps_per_epoch
    else:
        training_args.num_train_epochs = math.ceil(training_args.max_train_steps / num_update_steps_per_epoch)

    lr_scheduler = get_scheduler(
        name=training_args.lr_scheduler_type,
        optimizer=optimizer,
        num_warmup_steps=training_args.num_warmup_steps,
        num_training_steps=training_args.max_train_steps,
    )
                
    return {"optimizer": optimizer, "lr_scheduler": lr_scheduler}

config_optims = configure_optimizers(model3, training_args, train_dataloader)

optimizer = config_optims['optimizer']
lr_scheduler = config_optims['lr_scheduler']

model3, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model3, optimizer, train_dataloader, eval_dataloader
)

In [None]:
def postprocess_qa_predictions(
    examples,
    features,
    predictions: Tuple[np.ndarray, np.ndarray],
    n_best_size: int = 20,
    max_answer_length: int = 30,
):
    """
    Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the
    original contexts. This is the base postprocessing functions for models that only return start and end logits.
    Args:
        examples: The non-preprocessed dataset (see the main script for more information).
        features: The processed dataset (see the main script for more information).
        predictions (:obj:`Tuple[np.ndarray, np.ndarray]`):
            The predictions of the model: two arrays containing the start logits and the end logits respectively. Its
            first dimension must match the number of elements of :obj:`features`.
        n_best_size (:obj:`int`, `optional`, defaults to 20):
            The total number of n-best predictions to generate when looking for an answer.
        max_answer_length (:obj:`int`, `optional`, defaults to 30):
            The maximum length of an answer that can be generated. This is needed because the start and end predictions
            are not conditioned on one another.
    """
    assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
    all_start_logits, all_end_logits = predictions

    assert len(predictions[0]) == len(features), f"Got {len(predictions[0])} predictions and {len(features)} features."

    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    all_predictions = collections.OrderedDict()
    all_nbest_json = collections.OrderedDict()

    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index]

        min_null_prediction = None
        prelim_predictions = []

        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            offset_mapping = features[feature_index]["offset_mapping"]
            token_is_max_context = features[feature_index].get("token_is_max_context", None)

            feature_null_score = start_logits[0] + end_logits[0]
            if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
                min_null_prediction = {
                    "offsets": (0, 0),
                    "score": feature_null_score,
                    "start_logit": start_logits[0],
                    "end_logit": end_logits[0],
                }

            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue
                    if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
                        continue
                    prelim_predictions.append(
                        {
                            "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
                            "score": start_logits[start_index] + end_logits[end_index],
                            "start_logit": start_logits[start_index],
                            "end_logit": end_logits[end_index],
                        }
                    )
        
        predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]

        context = example["context"]
        for pred in predictions:
            offsets = pred.pop("offsets")
            pred["text"] = context[offsets[0] : offsets[1]]

        if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
            predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})

        scores = np.array([pred.pop("score") for pred in predictions])
        exp_scores = np.exp(scores - np.max(scores))
        probs = exp_scores / exp_scores.sum()

        for prob, pred in zip(probs, predictions):
            pred["probability"] = prob

        all_predictions[example["id"]] = predictions[0]["text"]

        all_nbest_json[example["id"]] = [
            {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
            for pred in predictions
        ]

    return all_predictions

In [None]:
############# Post-processing################
def post_processing_function(examples, features, predictions, stage="eval"):
    predictions = postprocess_qa_predictions(
        examples=examples,
        features=features,
        predictions=predictions,
        n_best_size= 1, 
    )
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]

    references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples]
    return EvalPrediction(predictions=formatted_predictions, label_ids=references)

metric = load_metric("squad")

############## Create and fill numpy array of size len_of_validation_data * max_length_of_output_tensor ####################
def create_and_fill_np_array(start_or_end_logits, dataset, max_len):

    step = 0
    logits_concat = np.full((len(dataset), max_len), -100, dtype=np.float64)
    for i, output_logit in enumerate(start_or_end_logits): 

        batch_size = output_logit.shape[0]
        cols = output_logit.shape[1]

        if step + batch_size < len(dataset):
            logits_concat[step : step + batch_size, :cols] = output_logit
        else:
            logits_concat[step:, :cols] = output_logit[: len(dataset) - step]

        step += batch_size

    return logits_concat

In [None]:
############## Train the model #################
total_batch_size = training_args.per_device_train_batch_size * accelerator.num_processes * training_args.gradient_accumulation_steps

print("############# Training the model ##############")

progress_bar = tqdm(range(training_args.max_train_steps), disable=not accelerator.is_local_main_process)
steps = 0

for epoch in range(training_args.num_train_epochs):
    model3.train()
    for i, batch in enumerate(train_dataloader):
        outputs = model3(**batch)
        loss = outputs.loss
        loss = loss / training_args.gradient_accumulation_steps
        accelerator.backward(loss)
        if i % training_args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            steps += 1

        if steps >= training_args.max_train_steps:
            break


  0%|          | 0/2814 [18:32<?, ?it/s][A

############# Training the model ##############




  0%|          | 1/2814 [00:00<17:18,  2.71it/s][A
  0%|          | 2/2814 [00:00<16:31,  2.83it/s][A
  0%|          | 3/2814 [00:00<15:17,  3.06it/s][A
  0%|          | 4/2814 [00:01<14:36,  3.21it/s][A
  0%|          | 5/2814 [00:01<14:22,  3.26it/s][A
  0%|          | 6/2814 [00:01<14:51,  3.15it/s][A
  0%|          | 7/2814 [00:02<14:25,  3.24it/s][A
  0%|          | 8/2814 [00:02<14:17,  3.27it/s][A
  0%|          | 9/2814 [00:02<14:11,  3.29it/s][A
  0%|          | 10/2814 [00:03<14:15,  3.28it/s][A
  0%|          | 11/2814 [00:03<13:53,  3.36it/s][A
  0%|          | 12/2814 [00:03<13:20,  3.50it/s][A
  0%|          | 13/2814 [00:03<13:07,  3.56it/s][A
  0%|          | 14/2814 [00:04<13:02,  3.58it/s][A
  1%|          | 15/2814 [00:04<12:54,  3.61it/s][A
  1%|          | 16/2814 [00:04<12:50,  3.63it/s][A
  1%|          | 17/2814 [00:04<12:49,  3.63it/s][A
  1%|          | 18/2814 [00:05<13:25,  3.47it/s][A
  1%|          | 19/2814 [00:05<13:14,  3.52it/s][A


In [None]:
############ Testing on the validation split #############
all_start_logits = []
all_end_logits = []
for i, batch in enumerate(eval_dataloader):
    with torch.no_grad():
        outputs = model3(**batch)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

        if not data_args.pad_to_max_length:  
            start_logits = accelerator.pad_across_processes(start_logits, dim=1, pad_index=-100)
            end_logits = accelerator.pad_across_processes(end_logits, dim=1, pad_index=-100)

        all_start_logits.append(accelerator.gather(start_logits).cpu().numpy())
        all_end_logits.append(accelerator.gather(end_logits).cpu().numpy())
        
max_len = max([x.shape[1] for x in all_start_logits])

start_logits_concat = create_and_fill_np_array(all_start_logits, eval_dataset, max_len)
end_logits_concat = create_and_fill_np_array(all_end_logits, eval_dataset, max_len)

outputs_numpy = (start_logits_concat, end_logits_concat)
prediction = post_processing_function(eval_examples, eval_dataset, outputs_numpy)
eval_metric = metric.compute(predictions=prediction.predictions, references=prediction.label_ids)
print(f"Evaluation on validation split: {eval_metric}")

100%|██████████| 10570/10570 [00:31<00:00, 331.01it/s]


Evaluation on validation split: {'exact_match': 68.58088930936613, 'f1': 77.19380556654335}


<font color='blue'> In the above cells, a **token classification model** from **bert-base-cased** is finetuned on **Squad dataset**. The model is trained for 3 epochs on **30,000 samples from train-split**, which is much less than the number of training examples used in assignment 2. The model is evaluated using the **validation dataset**. The **F1** of the model obtained on the validation set is **77.18%** and **exact_match** is **68.89**

**Problem 3.2** *(10 points)* How does your question answering accuracy (F1 and EM) with BERT compares to your accuracy with LSTM and Attention in Assignment 2? How about training speed?


## $\color{blue}{\text{Solution 3.2}}$

#### $\color{blue}{\text{1. Accuracy:}}$
<font color='blue'> The **F1** and **exact_match** of **BERT model** on the validation-split of squad dataset are **77.18** and **68.89**, respectively [The BERT model is trained on 30000 samples for 3 epochs]. While the **F1** and **exact_match** of the **LSTM model** (trained in assignment 2) on the validation set of squad dataset are **13.6** and **4.642** [ LSTM model is trained on the entire training split for 10 epochs]. Hence, the **BERT model outperforms LSTM with attention model significantly**.

#### $\color{blue}{\text{2. Training speed:}}$
<font color='blue'> The training time of the BERT model trained for 3 epochs on just 30000 samples is much more than the time taken by LSTM with attention model trained for 10 epochs on entire train split. Hence, the **training speed of BERT is much slower than the training speed of an LSTM with attention model**.


**Problem 3.3** *(10 points)* Try your own context/questions and find three failure cases. Explain why you think the model got them wrong.

## $\color{blue}{\text{Solution 3.3}}$


**Problem 3.4 (bonus)** *(20 points)* Can we do better than truncating tokens if the input length is too long? Suggest (but do not code) a strategy for a problem like SQuAD when the input has an arbitrary length with a pretrained model like BERT that has a predefined input length.

<font color='blue'>I tried various examples but could not find any failure case. The model seems to work pretty well atleast on the inputs I tried.


## $\color{blue}{\text{Solution 3.4}}$

<font color='blue'>I believe that it is an **important issue in problems where we may want to summarize** **a large document or find a long answer/short answer from a long document**. There are many other problems where having a solution to this issue might be of great help.

<font color='blue'>I can think of a **better strategy but it is constrained on the condition that the length of the answer should be less than or equal to the predefined input length (of pretrained models like BERT)**.<br>
    

<font color='blue'>**Strategy:** Instead of truncating the input, we should **consider multiple inputs to the model**. In other words, **for same input question, consider multiple inputs and then predict start and end for each input**. Then, **select the start and end with maximum confidence/probablity of prediction (or probability)**. The input should be divided into multiple inputs by using stride of 1 as we use in CNNs (convolutional neural networks). <br>
Consider input length as 50 and predefined length as 25, then out multiple inputs will be as follows:<br>
    1. inp1 = input[1:25] <br>
    2. inp2 = input[2:26]<br>
    3. inp3 = input[3:27]<br>
            .<br>
            .<br>
            .<br>
    n. inp_n = input[26:50]<br>
    
<font color='blue'>Then, we pass this input(one at a time) along with the ques, and the model predicts the start and end. So, for n such inputs, we will have n starts and n ends. Now, we will select the start with maximum probabily or confidence value and similarly, we will select the end with maximum probability. <br>

<font color='blue'>In this way, we don't miss out any important information and allows model to select the answer using the complete context or input.
