#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **< Félicie Giraud-Sauveur >**
> - ✉️ Email: **< felicie.giraud-sauveur >@epfl.ch**
> - 🪪 SCIPER: **284220**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (fine-tuning) and evaluation of a pre-trained language model ([DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) ), on natural language inference (NLI) task for recognizing textual entailment (RTE).

- Following the first finetuning task, you will need to identify the shortcut (i.e. some salient or toxic features) that the model learnt for the specific task. 

- For part-3, you are supposed to annotate 100 randomly assigned test datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale. We provide the reference to some simple methods (EDA and Back Translation) but you are encouraged to explore other advanced mechanisms. You will evaluate the improvement of your model performance by using your data augmentation method.

For each part, you will need to complete the code in the corresponding `.py` files (`nli.py` for Part-1, `shortcut.py` for Part-2, `eda.py` for Part-4). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **[PART 1: Model Finetuning for NLI](#1)**
    - [1.1 Data Processing](#11)
    - [1.2 Model Training and Evaluation](#12)
- **[PART 2: Identify Model Shortcut](#2)**
    - [2.1 Word-Pair Pattern Extraction](#21)
    - [2.2 Distill Potentially Useful Patterns](#22)
    - [2.3 Case Study](#23)
- **[PART 3: Annotate New Data](#3)**
    - [3.1 Write an Annotation Guideline](#31)
    - [3.2 Annotate Your 100 Datapoints with Partner(s)](#32)
    - [3.3 Agreement Measure](#33)
    - [3.4 Robustness Check](#34)
- **[PART 4: Data Augmentation](#4)**
    
### Deliverables

- ✅ This jupyter notebook
- ✅ `nli.py` file
- ✅ `shortcut.py` file
- ✅ Finetuned DistilBERT models for NLI task (Part 1 and Part 4)
- ✅ Annotated and cross-annotated data files (Part 3)
- ✅ New dataset from data augmentation (Part 4)

</div>

### Google Colab Setup
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the popped window, sign in to your Google account. (The same account you used to store this notebook!)

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

Now first click the 4th left-side bar (named Files), then click the 2nd bar popped under Files column (named Refresh), under "/drive/MyDrive/" find the Assignment 2 folder that you uploaded to your Google Drive, copy its path and fill it in below. If everything is working correctly, then running the folowing cell should print the filenames from the assignment:

```
['Assignment2.ipynb', 'requirements.txt', 'runs', 'predictions', 'nli_data', 'testA2.py', 'nli.py', 'shortcut.py']
```

In [2]:
#import os
# TODO: Fill in the path where you download the Assignment folder into
#ROOT_PATH = "/content/drive/..." # Replace with your directory to A2 folder
#print(os.listdir(ROOT_PATH))

Before we start, we also need to run some boilerplate code to set up our environment, same as previous assignments. You'll need to rerun this setup code each time you start the notebook.

In [3]:
#requirements = ROOT_PATH + "/requirements.txt"
#!pip install -r {requirements}


Run this cell to load the autoreload extension. This allows us to edit .py source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [4]:
#%load_ext autoreload
#%autoreload 2

In [5]:
#from copy import deepcopy
#import numpy as np 
#from tqdm import tqdm
#import jsonlines
#import sys
#import time
#import random

#import torch
#import torch.utils.data
#from torch import nn, optim
#from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
#from transformers import AdamW, get_constant_schedule_with_warmup
#from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

Once you have successfully mounted your Google Drive and located the path to this assignment, run the following cell to allow us to import from the `.py` files of this assignment. If it works correctly, it should print the message:

```
Hello A2!
```

In [6]:
#sys.path.append(ROOT_PATH)

#from testA2 import hello_A2
#hello_A2()

Note that if CUDA is not enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

In [7]:
#if torch.cuda.is_available():
    #print('Good to go!')
#else:
    #print('Please set GPU via Edit -> Notebook Settings.')

### Local Setup
If you skip Google Colab setup, you still need to fill in the path where you download the Assignment folder, and install required packages.

In [8]:
#ROOT_PATH = "..." # Replace with your directory to A2 folder

In [9]:
#requirements = "requirements.txt"  #ROOT_PATH + "/requirements.txt"
#!pip install -r {requirements}

In [10]:
%load_ext autoreload
%autoreload 2

In [11]:
%reload_ext autoreload

In [2]:
from copy import deepcopy
import numpy as np 
from tqdm import tqdm
import jsonlines
import sys
import time, os
import random
from collections import defaultdict
import json
import itertools

import torch
import torch.utils.data
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_constant_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

from sklearn.metrics import cohen_kappa_score

<a name="1"></a>
## **PART 1: Finetuning DistilBERT for NLI**
---

### **What is the NLI task?🧐**
> Given a pair of sentences, denoted as a "premise" sentence and a "hypothesis" sentence, NLI (or RTE) aims to determine their logical relationship, i.e. whether they are logically follow (entailment), unfollow (contradiction) or are undetermined (neutral) to each other.

> Defined as a machine learning task, NLI can be considered as a 3-classes (entailment, contradiction, or neutral) classification task, with a sentence-pair input ("hypothesis" and “premise”).

> **You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- premise sentence (*'premise'*), 
- hypothesis sentence (*'hypothesis'*) 
- domain (*'domain'*): describing the topic of premise and hypothesis sentences (e.g., government regulations, telephone talks, etc.)
- label (*'label'*): indicating the logical relation between premise and hypothesis (i.e., entailment, contradiction, or neutral).

In [13]:
# If you use Google Colab, then data_dir = 'GOOGLE_DRIVE_PATH/nli_data'
data_dir = 'nli_data'  #ROOT_PATH+'/nli_data'
data_dev_path = os.path.join(data_dir, 'dev_in_domain.jsonl')
with jsonlines.open(data_dev_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        print(sample)
        if sid == 2:
            break

{'premise': 'The new rights are nice enough', 'hypothesis': 'Everyone really likes the newest benefits ', 'domain': 'slate', 'label': 'neutral'}
{'premise': 'This site includes a list of all award winners and a searchable database of Government Executive articles.', 'hypothesis': 'The Government Executive articles housed on the website are not able to be searched.', 'domain': 'government', 'label': 'contradiction'}
{'premise': "uh i don't know i i have mixed emotions about him uh sometimes i like him but at the same times i love to see somebody beat him", 'hypothesis': 'I like him for the most part, but would still enjoy seeing someone beat him.', 'domain': 'telephone', 'label': 'entailment'}


In [14]:
# Enter your Sciper number
SCIPER = '284220'
seed = int(SCIPER)

In [15]:
print('Your random seed is: ', seed)

Your random seed is:  284220


In [16]:
# We use the following pretrained tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

### **1.1 Dataset Processing**
Our first step is to load datasets for NLI task by constructing a Pytorch Dataset. Specifically, we will need to implement tokenization and padding with a HuggingFace pre-trained tokenizer.

**Complete `NLIDataset` class following the instructions in `nli.py`, and test by running the following cell.**

In [17]:
from nli import NLIDataset

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
dataset = NLIDataset("nli_data/dev_in_domain.jsonl", tokenizer)  #NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

from testA2 import test_NLIDataset

test_NLIDataset(dataset)

Building NLI Dataset...


9815it [00:07, 1378.52it/s]


NLIDataset test correct ✅


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felic\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **1.2 Model Training and Evaluation**
Next, we will implement the training and evaluation process to finetune the model. For model training, you will need to calculate the loss and update the model weights by update the optimizer. Additionally, we add a learning rate schedular to adopt an adaptive learning rate during the whole training process. 

For evaluation, you will need to compute accuracy and F1 scores to assess the model performance. 

**Complete the `compute_metric()`, `train()` and `evaluate()` functions following the instructions in the `nli.py` file, you can test compute_metric() by running the following cell.**

In [18]:
from nli import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

Try the following different hyperparameter settings, compare and discuss the results. (Other hyperparameters should not be changed.)

> A. learning_rate 2e-5

> B. learning_rate 5e-5

**Note:** *Each training will take about 1 hour using a GPU, please keep your computer and notebook active during the training.*

**Questions: Which learning rate is better? Explain your answers.**

In [19]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

train_dataset = NLIDataset("nli_data/train.jsonl", tokenizer) #NLIDataset(ROOT_PATH+"/nli_data/train.jsonl", tokenizer)
dev_dataset = NLIDataset("nli_data/dev_in_domain.jsonl", tokenizer) # NLIDataset(ROOT_PATH+"/nli_data/dev_in_domain.jsonl", tokenizer)

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = 'runs/' #ROOT_PATH+'/runs/'

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

Building NLI Dataset...


98176it [01:13, 1333.96it/s]


Building NLI Dataset...


9815it [00:07, 1382.74it/s]


In [20]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

learning_rate = 2e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:57<00:00,  6.41it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:26<00:00, 23.57it/s]


Epoch: 0 | Training Loss: 0.716 | Validation Loss: 0.591
Epoch 0 NLI Validation:
Accuracy: 76.07% | F1: (78.79%, 72.24%, 77.18%) | Macro-F1: 76.07%
Model Saved!


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:44<00:00,  6.49it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.86it/s]


Epoch: 1 | Training Loss: 0.465 | Validation Loss: 0.600
Epoch 1 NLI Validation:
Accuracy: 77.30% | F1: (80.44%, 72.62%, 78.49%) | Macro-F1: 77.18%
Model Saved!


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:46<00:00,  6.48it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.66it/s]


Epoch: 2 | Training Loss: 0.244 | Validation Loss: 0.758
Epoch 2 NLI Validation:
Accuracy: 76.95% | F1: (80.33%, 72.29%, 77.79%) | Macro-F1: 76.80%


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:45<00:00,  6.49it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.86it/s]


Epoch: 3 | Training Loss: 0.147 | Validation Loss: 1.024
Epoch 3 NLI Validation:
Accuracy: 76.96% | F1: (80.38%, 72.34%, 77.62%) | Macro-F1: 76.78%


In [21]:
learning_rate = 5e-5 # play around with this hyperparameter

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:49<00:00,  6.46it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.92it/s]


Epoch: 0 | Training Loss: 0.329 | Validation Loss: 0.708
Epoch 0 NLI Validation:
Accuracy: 76.11% | F1: (78.71%, 72.68%, 76.72%) | Macro-F1: 76.04%
Model Saved!


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:38<00:00,  6.54it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 24.02it/s]


Epoch: 1 | Training Loss: 0.222 | Validation Loss: 0.862
Epoch 1 NLI Validation:
Accuracy: 75.57% | F1: (79.03%, 69.97%, 77.13%) | Macro-F1: 75.38%


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:41<00:00,  6.52it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.87it/s]


Epoch: 2 | Training Loss: 0.192 | Validation Loss: 1.027
Epoch 2 NLI Validation:
Accuracy: 74.80% | F1: (77.21%, 70.42%, 76.66%) | Macro-F1: 74.76%


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:39<00:00,  6.53it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.84it/s]


Epoch: 3 | Training Loss: 0.170 | Validation Loss: 1.166
Epoch 3 NLI Validation:
Accuracy: 75.32% | F1: (78.15%, 71.23%, 76.27%) | Macro-F1: 75.22%


<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> The learning rate 2e-5 seems better than the learning rate 5e-5. Indeed, when we look at the validation scores, the 2e-5 learning rate has a better accuracy, better F1 scores for each category and a better Macro-F1. Thus, the model with the training with the 2e-5 learning rate seems to generalize better.  </b></li>
<li> <b> When we look at the training, we see that the training with a learning rate of 2e-5 has a higher training loss for the first epochs but that this loss decreases faster than for a training with a learning rate of 5e-5.  For the validation loss during the training, we see that it is lower for the training with a learning rate of 2e-5, again this model seems to generalize better. </b></li> 
</ul>
    
</div>

### **Fine-Grained Validation**

Use the model checkpoint saved under the first hyperparameter setting (learning_rate 2e-5) in 1.4, check the model performance on each domain subsets of the validation set, report the validation loss, accuracy, F1 scores and Macro-F1 on each domain, compare and discuss the results.

**Questions: On which domain does the model perform the best? the worst? Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.**

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the './predictions/' folder, by specifying the *result_save_file* of the *evaluate* function.

In [22]:
batch_size = 16
learning_rate = 2e-5
warmup_percent = 0.3
checkpoint = 'runs/lr{}-warmup{}'.format(learning_rate, warmup_percent) #ROOT_PATH+'/runs/lr{}-warmup{}'.format(learning_rate, warmup_percent)

# Split the validation sets into subsets with different domains and save the subsets under './nli_data/'

    # Split validation set in domains
split_domains = defaultdict(list)
with jsonlines.open("nli_data/dev_in_domain.jsonl", "r") as reader:
    for sample in tqdm(reader.iter()):
        split_domains[sample["domain"]].append(sample) 

    # Save each domain in a json file
for domain in split_domains.keys():
    with open("nli_data/"+domain+".jsonl", "w") as eval_domain:
        for ddict in split_domains[domain]:
            jout = json.dumps(ddict) + '\n'
            eval_domain.write(jout)

9815it [00:00, 208790.90it/s]


In [23]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)

for domain in ["fiction", "government", "slate", "telephone", "travel"]:
    
    # Evaluate and save prediction results in each domain
    dev_domain_dataset = NLIDataset("nli_data/"+domain+".jsonl", tokenizer)
    dev_loss, acc, f1_ent, f1_neu, f1_con = evaluate(dev_domain_dataset, model, device, batch_size, no_labels=False, result_save_file="predictions/"+domain)
    macro_f1 = (f1_ent + f1_neu + f1_con)/3
    
    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f} | Accuracy: {acc*100:.2f}%')
    print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


1973it [00:01, 1732.69it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 124/124 [00:03<00:00, 31.44it/s]


Domain: fiction
Validation Loss: 0.609 | Accuracy: 76.89%
F1: (79.41%, 72.18%, 78.77%) | Macro-F1: 76.79%
Building NLI Dataset...


1945it [00:02, 897.18it/s] 
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 122/122 [00:05<00:00, 23.10it/s]


Domain: government
Validation Loss: 0.483 | Accuracy: 81.85%
F1: (84.45%, 77.97%, 82.68%) | Macro-F1: 81.70%
Building NLI Dataset...


1955it [00:02, 938.39it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 123/123 [00:05<00:00, 23.86it/s]


Domain: slate
Validation Loss: 0.711 | Accuracy: 71.66%
F1: (74.58%, 67.31%, 72.83%) | Macro-F1: 71.58%
Building NLI Dataset...


1966it [00:02, 973.77it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 123/123 [00:06<00:00, 20.15it/s]


Domain: telephone
Validation Loss: 0.596 | Accuracy: 77.82%
F1: (81.08%, 71.63%, 80.33%) | Macro-F1: 77.68%
Building NLI Dataset...


1976it [00:02, 935.44it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 124/124 [00:05<00:00, 24.13it/s]

Domain: travel
Validation Loss: 0.588 | Accuracy: 78.19%
F1: (82.19%, 74.58%, 77.58%) | Macro-F1: 78.12%





<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> For all domains, the F1 score is always higher for "entailment" then for "contradiction" and finally for "neutral".  </b></li>
<li> <b> The model performs the best for the "government" domain with the highest accuracy and macro-F1. </b></li> 
<li> <b> The model performs the worst for the "slate" domain with the lowest accuracy and macro-F1, and the difference is rather marked with the other domains. </b></li> 
</ul>

</div>

**Let's look at some examples to see if we can propose an explanation for these results...**

In [3]:
# The "government" domain is the domain where the model perform the best
# Let's look at some examples

    # Save success and failures for government
    
succeed_government = []
failed_government = []

with jsonlines.open("predictions/government", "r") as reader:
    for gov in reader.iter():
        
        if gov["label"]==gov["prediction"]:
            succeed_government.append(gov)
        
        elif gov["label"]!=gov["prediction"]:
            failed_government.append(gov)

    # Print examples

print("Examples of success for government:")
for i in range(4):
    print("..................")
    print("Example{}: \n ---> premise: {} \n ---> hypothesis: {}".format(i, succeed_government[i]["premise"], succeed_government[i]["hypothesis"]))
    print("===> Label: {},  Prediction: {}".format(succeed_government[i]["label"], succeed_government[i]["prediction"]))

print("\n ########## \n")

print("Examples of failures for government:")
for i in range(4):
    print("..................")
    print("Example{}: \n ---> premise: {} \n ---> hypothesis: {}".format(i, failed_government[i]["premise"], failed_government[i]["hypothesis"]))
    print("===> Label: {},  Prediction: {}".format(failed_government[i]["label"], failed_government[i]["prediction"]))

Examples of success for government:
..................
Example0: 
 ---> premise: This site includes a list of all award winners and a searchable database of Government Executive articles. 
 ---> hypothesis: The Government Executive articles housed on the website are not able to be searched.
===> Label: contradiction,  Prediction: contradiction
..................
Example1: 
 ---> premise: 5 The share of gross national saving used to replace depreciated capital has increased over the past 40 years. 
 ---> hypothesis: Gross national saving was highest this year.
===> Label: neutral,  Prediction: neutral
..................
Example2: 
 ---> premise: So far, however, the number of mail pieces lost to alternative bill-paying methods is too small to have any material impact on First-Class volume. 
 ---> hypothesis: The amount of lost mail is huge and really impacts mail volume
===> Label: contradiction,  Prediction: contradiction
..................
Example3: 
 ---> premise: Conversely, an incr

In [4]:
# The "slate" domain is the domain where the model perform the worst
# Let's look at some examples

    # Save success and failures for slate
    
succeed_slate = []
failed_slate = []

with jsonlines.open("predictions/slate", "r") as reader:
    for slt in reader.iter():
        
        if slt["label"]==slt["prediction"]:
            succeed_slate.append(slt)
        
        elif slt["label"]!=slt["prediction"]:
            failed_slate.append(slt)

    # Print examples

print("Examples of success for slate:")
for i in range(4):
    print("..................")
    print("Example{}: \n ---> premise: {} \n ---> hypothesis: {}".format(i, succeed_slate[i]["premise"], succeed_slate[i]["hypothesis"]))
    print("===> Label: {},  Prediction: {}".format(succeed_slate[i]["label"], succeed_slate[i]["prediction"]))

print("\n ########## \n")

print("Examples of failures for slate:")
for i in range(4):
    print("..................")
    print("Example{}: \n ---> premise: {} \n ---> hypothesis: {}".format(i, failed_slate[i]["premise"], failed_slate[i]["hypothesis"]))
    print("===> Label: {},  Prediction: {}".format(failed_slate[i]["label"], failed_slate[i]["prediction"]))

Examples of success for slate:
..................
Example0: 
 ---> premise: The new rights are nice enough 
 ---> hypothesis: Everyone really likes the newest benefits 
===> Label: neutral,  Prediction: neutral
..................
Example1: 
 ---> premise: If that investor were willing to pay extra for the security of limited downside, she could buy put options with a strike price of $98, which would lock in her profit on the shares at $18, less whatever the options cost. 
 ---> hypothesis: THe strike price could be $8.
===> Label: contradiction,  Prediction: contradiction
..................
Example2: 
 ---> premise: 3)  Dare you rise to the occasion, like Raskolnikov, and reject the petty rules that govern lesser men? 
 ---> hypothesis: Would you rise up and defeaat all evil lords in the town?
===> Label: neutral,  Prediction: neutral
..................
Example3: 
 ---> premise: Blue says Blumenthal claimed Clinton had told him that Lewinsky had made unwanted sexual advances. 
 ---> hy

<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>

<b> When the model is wrong, we see that the error is rather small for the "government" domain, whereas for the "slate" domain, the label is often the opposite of the expected one. A hypothesis that could explain why the "government" domain performs well contrary to the "slate" domain is that the "government" domain uses a certain vocabulary that allows to deduce the "orientation" of the sentence from key words, while for the "slate" domain, the sentences have more often negations that make the "orientation" of the sentence is opposite to the key words. For example: For example1-success of "government", we have "increased" in premise and "highest" in hypothesis. Whereas for example1-failure of "slate", we see that there are many words that can change the "orientation" of the sentence with "never", "let" for premise and "never", "held" for hypothesis, which can confuse the model.
</b>
    
</div>

## **Task2: Identify Shortcuts**

We aim to find some shortcuts that the model in 1.4 (under the first hyperparameter setting) has learned.

### **2.1 Word-Pair Pattern Extraction**

We consider to extract simple word-pair patterns that the model may have learned from the NLI data. 

For this, we assume that a pair of words that occur in a premise-hypothesis sentence pair (one occurs in premise and the other occurs in hypothesis) may serve as a key indicator of the logical relationship between the premise and hypothesis sentences. For example:

>- Premise: Consider the United States Postal Service.
>- Hypothesis: Forget the United States Postal Service.

Here the word-pair "consider" and "forget" determine that the premise and hypothesis have a *contradiction* relationship, so (consider, forget) --> *contradiction* might be a good pattern to learn.

**Note:** 
- We do not consider the naive word pair patterns where the word from premise and the word from hypothesis are identical, e.g., (service, service) got from the above premise-hypothesis sentence pair.
- We do not consider stop words neither, punctuations and words that contain special prefix '##', e.g., '##s' in the pattern extraction.

In [26]:
# stop_words and puntuations to be removed from consideration in the pattern extraction

import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append('uh')

import string
puncs = string.punctuation

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felic\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Complete `word_pair_extraction()` function in `shortcut.py` file.**

The keys of the returned dictionary *word_pairs* should be **different word-pairs** appered in premise-hypothesis sentence pairs, i.e., (a word from the premise, a word from the hypothesis).

The value of a word-pair key records the counts of entailment, neutral and contradiction predictions **made by the model** when the word-pair occurs, i.e., \[#entailment_predictions, #neutral_predictions,  #contradiction_predictions\].

**Note:** Remember to remove naive word pairs (i.e., premise word identical to hypothesis word), stop_words, puntuations and words with special prefix '##' out of consideration.

### **2.2 Distill Potentially Useful Patterns**

Find and print the **top-100** word-pairs that are associated with the **largest total number** of model predictions, which might contain frequently used patterns.

In [27]:
from shortcut import word_pair_extraction

In [28]:
# all your saved model prediction results in 1.2 Fine-Grained Validation
prediction_files = ["predictions/fiction", "predictions/government", "predictions/slate", "predictions/telephone", "predictions/travel"]

tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
word_pairs, pairs_pred_files = word_pair_extraction(prediction_files, tokenizer)

# find top-100 word-pairs associated with the largest total number of model predictions
sum_word_pairs = {k:sum(v) for k,v in word_pairs.items()}
top_100_freq_pairs = list(dict(sorted(sum_word_pairs.items(), key=lambda item: item[1], reverse=True)).keys())[0:100]

print(top_100_freq_pairs)

[('legal', 'services'), ('postal', 'service'), ('could', 'would'), ('children', 'kids'), ('like', 'lot'), ('know', 'time'), ('like', 'think'), ('one', 'year'), ('know', 'like'), ('one', 'two'), ('like', 'one'), ('know', 'think'), ('like', 'yeah'), ('one', 'people'), ('people', 'would'), ('get', 'know'), ('many', 'people'), ('last', 'year'), ('know', 'money'), ('know', 'would'), ('ca', 'da'), ('time', 'would'), ('like', 'really'), ('year', 'years'), ('know', 'people'), ('use', 'used'), ('many', 'one'), ('get', 'going'), ('think', 'would'), ('bad', 'good'), ('time', 'well'), ('never', 'yeah'), ('get', 'one'), ('like', 'people'), ('new', 'york'), ('l', 'state'), ('help', 'legal'), ('legal', 'state'), ('cost', 'costs'), ('one', 'yeah'), ('time', 'yeah'), ('one', 'would'), ('income', 'people'), ('last', 'years'), ('know', 'never'), ('go', 'know'), ('never', 'well'), ('said', 'told'), ('last', 'one'), ('good', 'one'), ('good', 'well'), ('get', 'well'), ('like', 'way'), ('well', 'would'), ('k

**Among the top-100 frequent word-pairs above**, find out the **top-5** word-pairs whose occurances **most likely** lead to *entailment* predictions (entailment patterns), and the **top-5** word-pairs whose occurances **most likely** lead to *contradiction* predictions (contradiction patterns).

**Explain your rules for finding these word pairs.**

<div style="border-radius: 5px; border: 3px dashed#66CC66; padding: 10px;">

<b> <font color="#66CC66"> Method: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> We built the following dictionary: {word-pair: [#entailment_predictions, #neutral_predictions, #contradiction_predictions]}.  </b></li>
<li> <b> To find the top-100 word-pairs, we compute #entailment_predictions+#neutral_predictions+#contradiction_predictions and keep the 100 word-pairs with the highest sums. </b></li> 
<li> <b> Then among these 100 word-pairs selected, we keep only the 5 word-pairs with #entailment_predictions the highest on one side, and the 5 word-pairs with #contradiction_predictions the highest on the other side. We assume that this allows us to have the top-5 word-pairs whose occurrences most likely lead to entailment predictions (entailment patterns) on one side, and the top-5 word-pairs whose occurrences most likely lead to contradiction predictions on the other side. </b></li> 
</ul>

</div>

In [29]:
# find top-5 entailment and contradiction patterns
    # entailment
entailment_t100 = {key: word_pairs[key][0] for key in top_100_freq_pairs}
top_5_entailment = list(dict(sorted(entailment_t100.items(), key=lambda item: item[1], reverse=True)).keys())[0:5]
    # contradict
contradict_t100 = {key: word_pairs[key][2] for key in top_100_freq_pairs}
top_5_contradict = list(dict(sorted(contradict_t100.items(), key=lambda item: item[1], reverse=True)).keys())[0:5]

# print
print("Entailment Patterns:")
print(top_5_entailment)
print("Contradiction Patterns:")
print(top_5_contradict)

Entailment Patterns:
[('legal', 'services'), ('like', 'lot'), ('postal', 'service'), ('children', 'kids'), ('like', 'think')]
Contradiction Patterns:
[('never', 'yeah'), ('legal', 'services'), ('bad', 'good'), ('know', 'never'), ('postal', 'service')]


<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> If some patterns seem logical like ('bad', 'good') for "contradiction patterns", others seem much less obvious like ('like', 'lot') for "entailment patterns". Moreover, we find the same pattern ('postal', 'service') for "contradiction" and "entailment". </b></li>
<li> <b> The fact to have some patterns that seem "weird" could suggest that the model relies on some shorcuts that are not relevant to make its prediction, but of course it remains difficult to judge without more context. </b></li>
</ul>

</div>

### **2.3 Case Study**

Find out and study **4 representative** cases where the pattern that you have found in 2.2 **fails**, e.g., the premise-hypothesis sentence pair contains ('good', 'bad'), but has an *entailment* gold label.

**Based on your case study, explain the limitations of the word-pair patterns.**

In [30]:
# Find 4 representative cases where the pattern in 2.2 fails
failed_cases = []
for l in pairs_pred_files:
    if len(failed_cases) < 4:
        if (bool(set(l['pairs']) & set(top_5_entailment+top_5_contradict))) \
        & (l['label']!=l['prediction']) \
        & (l['label']!='neutral') \
        & (l['prediction']!='neutral'):
            failed_cases.append(l)
    else:
        break     

In [31]:
# Case 1
failed_cases[0]  # see ('like', 'think')

{'premise': 'I feel, though, that I should like to point out to you once more the risks you are running, especially if you pursue the course you indicate.',
 'hypothesis': 'I do not think that you understand the risks you are taking.',
 'pairs': [('pursue', 'think'),
  ('point', 'understand'),
  ('risks', 'though'),
  ('especially', 'understand'),
  ('pursue', 'understand'),
  ('course', 'understand'),
  ('risks', 'understand'),
  ('like', 'think'),
  ('point', 'risks'),
  ('feel', 'taking'),
  ('especially', 'risks'),
  ('pursue', 'risks'),
  ('course', 'risks'),
  ('indicate', 'understand'),
  ('running', 'understand'),
  ('feel', 'think'),
  ('indicate', 'risks'),
  ('like', 'understand'),
  ('point', 'taking'),
  ('especially', 'taking'),
  ('think', 'though'),
  ('taking', 'though'),
  ('course', 'taking'),
  ('risks', 'taking'),
  ('risks', 'running'),
  ('point', 'think'),
  ('like', 'risks'),
  ('especially', 'think'),
  ('feel', 'understand'),
  ('course', 'think'),
  ('risks'

In [32]:
# Case 2
failed_cases[1]  # see ('bad', 'good')

{'premise': "Waldemar Szary, a food technician at the OSM 'Paziocha', was having a very bad day - the kind of a very bad day, which normally comes after one of those very good days.",
 'hypothesis': 'The kind of day that Waldemar Szary was having was not a good one, at all. ',
 'pairs': [('bad', 'day'),
  ('technician', 'wal'),
  ('good', 'kind'),
  ('bad', 'good'),
  ('day', 'food'),
  ('day', 'kind'),
  ('paz', 'wal'),
  ('comes', 'kind'),
  ('food', 'one'),
  ('one', 'wal'),
  ('good', 'wal'),
  ('bad', 'kind'),
  ('one', 'technician'),
  ('good', 'technician'),
  ('food', 'good'),
  ('comes', 'wal'),
  ('good', 'paz'),
  ('days', 'wal'),
  ('kind', 'os'),
  ('good', 'normally'),
  ('days', 'one'),
  ('kind', 'wal'),
  ('one', 'os'),
  ('good', 'os'),
  ('comes', 'day'),
  ('kind', 'technician'),
  ('day', 'os'),
  ('food', 'kind'),
  ('kind', 'paz'),
  ('day', 'wal'),
  ('good', 'one'),
  ('days', 'good'),
  ('kind', 'normally'),
  ('one', 'paz'),
  ('day', 'one'),
  ('day', 'techn

In [33]:
# Case 3
failed_cases[2]  # see ('like', 'lot')

{'premise': "If she didn't like her restaurant so much, the woman'd be high-up in Applied by now.",
 'hypothesis': 'She liked her restaurant a lot.',
 'pairs': [('lot', 'woman'),
  ('much', 'restaurant'),
  ('applied', 'restaurant'),
  ('liked', 'much'),
  ('liked', 'restaurant'),
  ('like', 'liked'),
  ('lot', 'much'),
  ('high', 'liked'),
  ('lot', 'restaurant'),
  ('like', 'lot'),
  ('high', 'lot'),
  ('applied', 'liked'),
  ('restaurant', 'woman'),
  ('like', 'restaurant'),
  ('liked', 'woman'),
  ('high', 'restaurant'),
  ('applied', 'lot')],
 'domain': 'fiction',
 'label': 'entailment',
 'prediction': 'contradiction'}

In [34]:
# Case 4
failed_cases[3]  # see ('legal', 'services')

{'premise': 'LASNNY is one of the oldest and most cost-effective legal services organizations in the United States.',
 'hypothesis': 'LASNNY is an old legal services organization.',
 'pairs': [('cost', 'old'),
  ('oldest', 'services'),
  ('one', 'organization'),
  ('las', 'united'),
  ('legal', 'united'),
  ('las', 'organization'),
  ('legal', 'organization'),
  ('effective', 'services'),
  ('las', 'old'),
  ('effective', 'legal'),
  ('las', 'legal'),
  ('cost', 'organization'),
  ('cost', 'legal'),
  ('effective', 'las'),
  ('old', 'one'),
  ('services', 'states'),
  ('organization', 'united'),
  ('organizations', 'services'),
  ('organization', 'organizations'),
  ('old', 'states'),
  ('one', 'services'),
  ('las', 'organizations'),
  ('legal', 'organizations'),
  ('old', 'oldest'),
  ('las', 'services'),
  ('legal', 'services'),
  ('cost', 'services'),
  ('old', 'united'),
  ('cost', 'las'),
  ('organization', 'services'),
  ('las', 'one'),
  ('legal', 'one'),
  ('oldest', 'organiza

<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>

<b> The problem with shorcuts is that they are mainly based on two words without taking into account the context around them which can strongly change the direction of the sentence. For example, the presence of a negation can change the meaning of the pattern as we can see with these two examples: "should like VS do not think" or "very bad day VS was not a good one". Another example of context is the presence of an implicit meaning like with "If she didn't like VS liked a lot". </b>

</div>

## **Task3: Annotate New Data**

To check the robustness of developed model, **some additional sets of test data** are collected, which contain NLI samples that are out of the domains of the training and validation data.

However, the test data does not have gold labels of the relationships between premise and hypothesis sentences, i.e., all the labels are marked as *hidden*. **We consider to annotate the data by ourselves.**

### **3.1 Write an Annotation Guideline**

Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Task 3.2

<div style="border-radius: 5px; border: 3px dashed#66CC66; padding: 10px;">

<b> <font color="#66CC66"> Method: Annotation guideline: </font> </b>

<b> <i> Instructions: <br>
For each sample, two sentences are present each time: the "premise" sentence and the "hypothesis" sentence. The task is to label each sample with one of the three labels: "entailment" or "contradiction" or "neutral". <br>
In a general way, the sample should be labeled "entailment" if the "hypothesis" sentence follows from the "premise" sentence, it should be labeled "contradiction" if the "hypothesis" sentence opposes the "premise" sentence, and it should be labeled "neutral" if the "hypothesis" sentence has no direct link with the "premise" sentence. </i> </b> <br>

<b> More specifically: </b>

<ul style="list-style-type:circle"> 
<li> <b> Look to see if there is even one contradictory element between the "premise" and the "hypothesis". If there is, then the label will necessarily be "contradiction", regardless of the other part(s) of the sentence. </b> Example: "premise": "The sun is shining today and it is warm despite a few clouds"; "hypothesis": "The sun is shining today with its warmth, and there are no clouds on the horizon." => "contradiction".</li>
<li> <b> If no contradictory element is present but the "hypothesis" has one or more additions not directly deducible from the "premise" then the label will necessarily be "neutral" even if the other part(s) of the sentence are deducible. </b> Example: "premise": "The sun is shining today and it is warm despite a few clouds"; "hypothesis": "The sun is shining today with its warmth, and the birds are singing." => "neutral". </li>
<li> <b> If no contradictory elements and no additions are present in the "hypothesis", i.e. if everything in the "hypothesis" is deductible from the "premise", then the label will be "entailment". </b> Example: "premise": "The sun is shining today and it's warm despite a few clouds"; "hypothesis": "The sun is shining today with its warmth." => "entailement". </li>
</ul>

</div>

### **3.2 Annotate Your 100 Datapoints with Partner(s)**

Annotate your 100 test datapoints with your partner(s), by editing the value of the key "label_student1", "label_student2" and "label_student3" (if you are in a group of three students) in each datapoint.

**Note:** 
- You can download the assigned annotation file (`<your-testset-id>.jsonl`) by [this link](https://drive.google.com/drive/folders/146ExExmpnSUayu6ArGiN5gQzCPJp0myB?usp=share_link)
- Please find your annotation partner according to the "Student Pairing List for A2 Task3" shared on Ed.

**Name your annotated file as `<index>-<sciper_number>.jsonl`.** 

For example, if you get `01.jsonl` to annotate, you should name your deliverable as `01-<your_sciper_number>.jsonl`.

### **3.3 Agreement Measure**

Based on your and your partner's annotations on the 100 test datapoints in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators. Discuss the agreement measure results.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

> **Questions**: What is your interpretation of Cohen's Kappa or Krippendorff's Alpha value according to the above mapping? Which kind of disagreements are most frequently happen between you and your partner(s), i.e., *entailment* vs. *neutral*, *entailment* vs. *contradiction*, or *neutral* vs. *contradiction*? For the second question, give some examples to explain why that is the case. Are there possible ways to address the disagrrements between two annotators?

In [35]:
# Open jsonl file and record disagreements vs agreements

annotations = {"premise":[], "hypothesis":[], "domain":[], "label_student1":[], "label_student2":[]}

disagreements_number = {"eVSn":0, "eVSc":0, "nVSc":0}
agreements_number = {"e":0, "c":0, "n":0}

disagreements_eVSn = {"premise":[], "hypothesis":[], "label_student1":[], "label_student2":[]}
disagreements_eVSc = {"premise":[], "hypothesis":[], "label_student1":[], "label_student2":[]}
disagreements_nVSc = {"premise":[], "hypothesis":[], "label_student1":[], "label_student2":[]}

with jsonlines.open("nli_data/27-284220.jsonl", "r") as reader:
    for ann in reader.iter():
        
        annotations["premise"].append(ann["premise"])
        annotations["hypothesis"].append(ann["hypothesis"])
        annotations["domain"].append(ann["domain"])
        annotations["label_student1"].append(ann["label_student1"])
        annotations["label_student2"].append(ann["label_student2"])
        
        if (ann["label_student1"], ann["label_student2"]) in [("entailment", "neutral"), ("neutral", "entailment")]:
            disagreements_number["eVSn"] += 1
            disagreements_eVSn["premise"].append(ann["premise"])
            disagreements_eVSn["hypothesis"].append(ann["hypothesis"])
            disagreements_eVSn["label_student1"].append(ann["label_student1"])
            disagreements_eVSn["label_student2"].append(ann["label_student2"])
                
        elif (ann["label_student1"], ann["label_student2"]) in [("entailment", "contradictory"), ("contradictory", "entailment")]:
            disagreements_number["eVSc"] += 1
            disagreements_eVSc["premise"].append(ann["premise"])
            disagreements_eVSc["hypothesis"].append(ann["hypothesis"])
            disagreements_eVSc["label_student1"].append(ann["label_student1"])
            disagreements_eVSc["label_student2"].append(ann["label_student2"])
                        
        elif (ann["label_student1"], ann["label_student2"]) in [("neutral", "contradictory"), ("contradictory", "neutral")]:
            disagreements_number["nVSc"] += 1
            disagreements_nVSc["premise"].append(ann["premise"])
            disagreements_nVSc["hypothesis"].append(ann["hypothesis"])
            disagreements_nVSc["label_student1"].append(ann["label_student1"])
            disagreements_nVSc["label_student2"].append(ann["label_student2"])
                      
        elif (ann["label_student1"]==ann["label_student2"]) & (ann["label_student1"]=="entailment"):
            agreements_number["e"] += 1
                              
        elif (ann["label_student1"]==ann["label_student2"]) & (ann["label_student1"]=="contradictory"):
            agreements_number["c"] += 1
                                      
        elif (ann["label_student1"]==ann["label_student2"]) & (ann["label_student1"]=="neutral"):
            agreements_number["n"] += 1

# Calculate the Cohen's Kappa             
CohenKappa = cohen_kappa_score(annotations["label_student1"], annotations["label_student2"])

In [36]:
print("The Cohen's Kappa is: {:.2f}.".format(CohenKappa))
print("===============")
print("Disagreements:")
print("There is {} eVSn, {} eVSc, {} nVSc.".format(disagreements_number["eVSn"], disagreements_number["eVSc"], disagreements_number["nVSc"]))
print("===============")
print("Agreements:")
print("There is {} e, {} c, {} n.".format(agreements_number["e"], agreements_number["c"], agreements_number["n"]))

The Cohen's Kappa is: 0.64.
Disagreements:
There is 14 eVSn, 1 eVSc, 9 nVSc.
Agreements:
There is 20 e, 35 c, 21 n.


<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> Cohen's Kappa is 0.64. It is thus between 0.6 and 0.8, meaning a "substantial agreement". We can therefore consider here that the agreement between the two annotators is correct even if some disagreements are present. </b></li>
<li> <b> The most disagreements are at the level of "entailment" VS "neutral". This seems quite logical because my own rule is to favor the "neutral" label if there is ever a non-deductible addition and even if everything else is deductible. This rule can be different for the other annotator without it seeming weird, and could be for example "Put the label "entailment" if there is a deductible part and no contradictory part, even if a part of the sentence is not deductible.". </b></li>
<li> <b> The most common labels are "contradiction", which again makes sense because the presence of a contradiction is easily noticed, and the rule "You only need one contradiction to put the label "contradiction"." seems quite logical. </b></li>
</ul>

</div>

**Let's look at some examples where the annotations disagree in the case of entailment VS neutral:**

In [37]:
# Examples of disagreements entailment VS neutral

print("Examples of disagreements eVSn:")
for i in range(4):
    print("..................")
    print("Example{}: \n ---> premise: {} \n ---> hypothesis: {}".format(i, disagreements_eVSn["premise"][i], disagreements_eVSn["hypothesis"][i]))
    print("===> Label other student: {},  My label: {}".format(disagreements_eVSn["label_student1"][i], disagreements_eVSn["label_student2"][i]))

Examples of disagreements eVSn:
..................
Example0: 
 ---> premise: Gregory uses stream in the way normal in England; Giles consistently refers to gens in the hills of central Australia, though glen is not current (outside place-names) in contemporary Australian English. 
 ---> hypothesis: Glen is not contemporary Australian English right now.
===> Label other student: neutral,  My label: entailment
..................
Example1: 
 ---> premise: A rough comparison yields  
 ---> hypothesis: A tough juxtaposition results in 
===> Label other student: entailment,  My label: neutral
..................
Example2: 
 ---> premise: At 8:47, seconds after the impact of American 11, United 175's transponder code changed, and then changed again. 
 ---> hypothesis: After impact of American 11, the transponder code was altered on United 175.
===> Label other student: neutral,  My label: entailment
..................
Example3: 
 ---> premise:  Copy that, sir. 
 ---> hypothesis: I understand w

<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>

<b> The disagreements seem to be on the sentences where the label to put is quite ambiguous. The interpretation of certain words may differ slightly between the two annotators and lead to a slightly different label. For example, can "comparison" and "juxtaposition" in example 1 really be considered as having the same meaning? Here the disagreements seem to be quite subtle. One way to reconcile the labels would be to have a much more precise and common guideline annotation, since here the differences seem to be more about small details. In addition, it seems that the choice between "entailment" and "neutral" especially should be clarified. </b>

</div>

### **3.4 Robustness Check**

Take into account both your and your partner's annotations, determine the final labels of the 100 test datapoints, by editing the value of the key "label" in each of your datapoint.

Evaluate the performance of your developed model in 1.4 (still under the first hyperparameter setting) on your annotated 100 test datapoints, and compare with the model performance on the validation set.

> **Question**: Do you think that your developed model has a good robuestness of handling out-of-domain NLI predictions?

<div style="border-radius: 5px; border: 3px dashed#66CC66; padding: 10px;">

<b> <font color="#66CC66"> Method: Determine the final labels: </font> </b>

<b> To fill in the labels, I will choose to take my labels each time. As long as Cohen's Kappa is high enough, and in the examples above I prefer my labels, this choice seems reasonable. </b>

</div>

In [38]:
# Fill final labels with labels of student2 (my labels)

final_labels = []

with jsonlines.open("nli_data/27-284220.jsonl", "r") as reader:
    for ann in reader.iter():
        ann["label"] = ann["label_student2"]
        if ann["label"]=="contradictory":
            ann["label"] = "contradiction"
        final_labels.append(ann)

with open("nli_data/27-284220_final.jsonl", "w") as writer:
    for dict_ in final_labels:
        jout = json.dumps(dict_) + '\n'
        writer.write(jout)

In [39]:
# Evaluate the previous model with a learning_rate of 2e-5 on our new annotated file

batch_size = 16
learning_rate = 2e-5
warmup_percent = 0.3
checkpoint = 'runs/lr{}-warmup{}'.format(learning_rate, warmup_percent)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)

annotation_dataset = NLIDataset("nli_data/27-284220_final.jsonl", tokenizer)
val_loss, acc, f1_ent, f1_neu, f1_con = evaluate(annotation_dataset, model, device, batch_size, no_labels=False, result_save_file="predictions/our-ann")
macro_f1 = (f1_ent + f1_neu + f1_con)/3
    
print(f'Validation Loss: {val_loss:.3f} | Accuracy: {acc*100:.2f}%')
print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


100it [00:00, 797.99it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.89it/s]

Validation Loss: 0.466 | Accuracy: 81.00%
F1: (89.23%, 66.67%, 82.76%) | Macro-F1: 79.55%





<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> When we evaluate our model on our new annotation file, the accuracy (81.0%) and the macro-F1 (79.6%) are relatively high. Their values are slightly higher than the one on the validation set (we had 77.0% for the accuracy and 76.8% for the macro-F1 for epoch3). Moreover the values are almost as high as for the best domain (we had 81.9% for the accuracy and 81.7% for the macro-F1 for the best domain, i.e. the "government" domain). </b></li>
<li> <b> We can therefore deduce that our developed model has a good robuestness of handling out-of-domain NLI predictions. </b></li>
<li> <b> NB: we find again the best F1 for "entailment", then for "contradiction", and then for "neutral". </b></li>
</ul>

</div>

## **Task4: Data Augmentation**

Finally, we consider to use a data augmentation method to create more training data, and use the augmented data to improve the model performance. The data augmentation method we are going to use is [EDA](https://aclanthology.org/D19-1670/).

### **4.1 EDA: Easy Data Augmentation algorithm for Text**

For this section, we will need to implement the most simple data augmentation techniques on textual sentences, including **SR** (Synonym Replacement), **RD** (Random Deletion), **RS** (Random Swap), **RI** (Random Insertion). 

You should complete all the functions in `eda.py` script, and you can test them with a simple testcase by running the following cell.

- **Synonym Replacement (SR)**
> In Synonym Replacement, we randomly replace some words in the sentence with their synonyms.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
  
from nltk.corpus import wordnet

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felic\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\felic\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


You can test whether you get the synonyms right and see an example with synonym replacement.

In [None]:
from eda import get_synonyms
from testA2 import test_get_synonyms

test_get_synonyms(get_synonyms)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felic\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The synonyms for the word "task" are:  ['job', 'labor', 'chore', 'undertaking', 'project', 'tax']


In [None]:
from eda import synonym_replacement

print(f" Example of Synonym Replacement: {synonym_replacement('hey man how are you doing',3)}")

 Example of Synonym Replacement: hey gentleman how are you doing


- **Random Deletion (RD)**

> In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.

In [16]:
from eda import random_deletion

print(f" Example of Random Deletion: {random_deletion('hey man how are you doing', p=0.3, max_deletion_n=3)}")

 Example of Random Deletion: hey man are


- **Random Swap (RS)**
> In Random Swap, we randomly swap the order of two words in a sentence.

In [17]:
from eda import swap_word

print(f" Example of Random Swap: {swap_word('hey man how are you doing')}")

 Example of Random Swap: hey man are how you doing


- **Random Insertion (RI)**
> Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.
> Data augmentation operations should not change the true label of a sentence, as that would introduce unnecessary noise into the data. Inserting a synonym of a word in a sentence, opposed to a random word, is more likely to be relevant to the context and retain the original label of the sentence.

In [None]:
from eda import random_insertion

print(f" Example of Random Insertion: {random_insertion('hey man how are you doing', n=2)}")

 Example of Random Insertion: hey man human beings how are valet you doing


<b> <font color='red'> NB: Since we use `sentence.split()` to get the words, we end up with words that also contain punctuation, so we may not find a synonym for the word when in fact there is one.  
For example, for the sentence `'Monitoring, reporting, and recordkeeping requirements. '` present in the training dataset, `sentence.split()` gives `['Monitoring,', 'reporting,' 'and', 'recordkeeping', 'requirements.']` and therefore no word will have any synonyms when in fact there are some! </font> </b>

### **4.2 Augment Your Model**

Combine all the functions you have implemented in 4.1, you can come up with your own data augmentation pipeline with various p and n ;)

Next step is to expand the training data you used in Task1, re-train your model in 1.4 on your augmented data, and re-evaluate its performance on both the given validation set as well as on your manually annotated 100 test datapoints. 

Discuss the improvements that your data augmentation brings to your model. ***Include some examples of old vs. new model predictions to demonstrate the improvements.***

**Warning: In terms of data size and training time control, we stipulate that your augmented training data should not be larger than 100M.** (Currently the training data train.jsonl is about 25M.)

In [46]:
def aug(sent, n, p):
    print(f" Original Sentence : {sent}")
    print(f" SR Augmented Sentence : {synonym_replacement(sent, n)}")
    print(f" RD Augmented Sentence : {random_deletion(sent, p, n)}")
    print(f" RS Augmented Sentence : {swap_word(sent)}")
    print(f" RI Augmented Sentence : {random_insertion(sent, n)}")
    
aug('hey man how are you doing', p=0.2, n=2)

 Original Sentence : hey man how are you doing
 SR Augmented Sentence : hey gentleman how are you doing
 RD Augmented Sentence : hey man are you
 RS Augmented Sentence : man hey how are you doing
 RI Augmented Sentence : hey man humankind how are you come doing


- Augment training dataset and Re-train your model
> Notes: you can decide on your own how much data you want to augment. But there are two pitfalls: i) by EDA, more augmentation means more noises, which not necessarily increases the performance; ii) more data means longer training time. Please balance your data scale and GPU time ;) 

In [24]:
# Data augmentation pipeline with various p and n on one sentence

def aug_sent(sent):

    # synonym_replacement
    sent = synonym_replacement(sent, n=3)
    
    # random_deletion
    sent = random_deletion(sent, p=0.4, max_deletion_n=6)
    
    # swap_word
    sent = swap_word(sent)
    
    # random_insertion
    sent = random_insertion(sent, n=3)

    return sent

In [48]:
# Expand the training data from Task1 and re-train the model on this augmented data
# Re-evaluate performance of the model on the given validation set

    # 1. Set-up
    
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device)

    # 2. Expand training data

augm_training_data = []
with jsonlines.open("nli_data/train.jsonl", "r") as reader:
    for augm in tqdm(reader.iter()):
        augm["premise"] = aug_sent(augm["premise"])
        augm["hypothesis"] = aug_sent(augm["hypothesis"])
        augm_training_data.append(augm)

with open("nli_data/train_augmented.jsonl", "w") as writer:
    for dict_ in augm_training_data:
        jout = json.dumps(dict_) + '\n'
        writer.write(jout)

    # 3. Get train_dataset and dev_dataset
    
train_dataset = NLIDataset("nli_data/train_augmented.jsonl", tokenizer)
dev_dataset = NLIDataset("nli_data/dev_in_domain.jsonl", tokenizer)

    # 4. Train and evaluate

batch_size = 16
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
model_save_root = 'runs_augm/'
learning_rate = 2e-5

train(train_dataset, dev_dataset, model, device, batch_size, epochs,
      learning_rate, warmup_percent, max_grad_norm, model_save_root)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

Building NLI Dataset...


98176it [01:18, 1256.33it/s]


Building NLI Dataset...


9815it [00:07, 1370.47it/s]
Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:52<00:00,  6.44it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.82it/s]


Epoch: 0 | Training Loss: 1.047 | Validation Loss: 0.850
Epoch 0 NLI Validation:
Accuracy: 63.09% | F1: (67.53%, 62.29%, 57.79%) | Macro-F1: 62.54%
Model Saved!


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:44<00:00,  6.50it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.70it/s]


Epoch: 1 | Training Loss: 0.934 | Validation Loss: 0.830
Epoch 1 NLI Validation:
Accuracy: 62.07% | F1: (68.04%, 58.97%, 54.80%) | Macro-F1: 60.60%


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:49<00:00,  6.46it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:26<00:00, 23.53it/s]


Epoch: 2 | Training Loss: 0.723 | Validation Loss: 0.926
Epoch 2 NLI Validation:
Accuracy: 62.95% | F1: (68.77%, 55.05%, 61.49%) | Macro-F1: 61.77%


Training: 100%|████████████████████████████████████████████████████████████████████| 6136/6136 [15:49<00:00,  6.46it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████| 614/614 [00:25<00:00, 23.63it/s]


Epoch: 3 | Training Loss: 0.404 | Validation Loss: 1.210
Epoch 3 NLI Validation:
Accuracy: 60.98% | F1: (67.47%, 51.03%, 60.31%) | Macro-F1: 59.60%


In [49]:
# Re-evaluate performance of the model on the manually annotated 100 test datapoints

batch_size = 16
learning_rate = 2e-5
warmup_percent = 0.3
checkpoint = 'runs_augm/lr{}-warmup{}'.format(learning_rate, warmup_percent)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)

annotation_dataset = NLIDataset("nli_data/27-284220_final.jsonl", tokenizer)
val_loss, acc, f1_ent, f1_neu, f1_con = evaluate(annotation_dataset, model, device, batch_size, no_labels=False, result_save_file="predictions/our-ann_augm-model")
macro_f1 = (f1_ent + f1_neu + f1_con)/3
    
print(f'Validation Loss: {val_loss:.3f} | Accuracy: {acc*100:.2f}%')
print(f'F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building NLI Dataset...


100it [00:00, 975.04it/s]
Evaluation: 100%|████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 28.68it/s]

Validation Loss: 0.719 | Accuracy: 71.00%
F1: (73.24%, 62.07%, 76.06%) | Macro-F1: 70.45%





<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>
<ul style="list-style-type:circle"> 
<li> <b> The performance of the trained model on the augmented dataset is worse compared to the previous performance. Indeed, here the validation scores for epoch3 are 61.0% for the accuracy and 59.6% for the macro-F1 while they were previously 77.0% for the accuracy and 76.8% for the macro-F1. Similarly, the validation scores for our annotated dataset are 71.0% for accuracy and 70.5% for macro-F1 while they were previously 81.0% for accuracy and 79.6% for macro-F1. </b></li>
<li> <b> A solution could be to better optimize the increase of the training dataset by changing again the parameters n and p of each function of the pipeline for example. </b></li>
</ul>

</div>

**Let's look at some examples where the model trained on the augmented dataset does less well or does better:**

In [5]:
# Examples of old vs. new model predictions with our annotation dataset

    # Gather old and new results
    
comparison_augm = {"premise":[], "hypothesis":[], "domain":[], "label":[], "pred_old":[], "pred_new":[]}

with jsonlines.open("predictions/our-ann", "r") as reader:
    for l in reader.iter():        
        comparison_augm["premise"].append(l["premise"])
        comparison_augm["hypothesis"].append(l["hypothesis"])
        comparison_augm["domain"].append(l["domain"])
        comparison_augm["label"].append(l["label"])
        comparison_augm["pred_old"].append(l["prediction"])

with jsonlines.open("predictions/our-ann_augm-model", "r") as reader:
    for l in reader.iter():  
        comparison_augm["pred_new"].append(l["prediction"])

        
    # Keep only results where old or new predictions different from label

old_true_new_false = {"premise":[], "hypothesis":[], "domain":[], "label":[], "pred_old":[], "pred_new":[]}
old_false_new_true = {"premise":[], "hypothesis":[], "domain":[], "label":[], "pred_old":[], "pred_new":[]}

for i in range(len(comparison_augm["label"])):
    
    premise = comparison_augm["premise"][i]
    hypothesis = comparison_augm["hypothesis"][i]
    domain = comparison_augm["domain"][i]
    label = comparison_augm["label"][i]
    pred_old = comparison_augm["pred_old"][i]
    pred_new = comparison_augm["pred_new"][i]
    
    if (label==pred_old) & (label!=pred_new):
        old_true_new_false["premise"].append(premise)
        old_true_new_false["hypothesis"].append(hypothesis)
        old_true_new_false["domain"].append(domain)
        old_true_new_false["label"].append(label)
        old_true_new_false["pred_old"].append(pred_old)
        old_true_new_false["pred_new"].append(pred_new)
    
    elif (label!=pred_old) & (label==pred_new):
        old_false_new_true["premise"].append(premise)
        old_false_new_true["hypothesis"].append(hypothesis)
        old_false_new_true["domain"].append(domain)
        old_false_new_true["label"].append(label)
        old_false_new_true["pred_old"].append(pred_old)
        old_false_new_true["pred_new"].append(pred_new)

In [6]:
# See examples where old_label is right but new_label is false

print("old_label is right but new_label is false:")
for i in range(4):
    print(" \n ..................")
    print("Example{}:  \n \n ---> premise: {} \n ---> hypothesis: {}".format(i, old_true_new_false["premise"][i], old_true_new_false["hypothesis"][i]))
    print("===> Label: {},  old label: {}, new label: {}".format(old_true_new_false["label"][i], old_true_new_false["pred_old"][i], old_true_new_false["pred_new"][i]))

old_label is right but new_label is false:
 
 ..................
Example0:  
 
 ---> premise: Alamo , the site of the Texas defeat by Santa Ana; hoosegow from juzgado `court'; dinero `money,' a Spanish corruption of the Latin denarius; macho , from the same root as machete : he who wields a machete must be skillful and powerful, hence the word has come to mean `virile' and its associated noun, machismo , `virility. 
 ---> hypothesis: Alamo, the greatly revered site of the Texas defeat by Santa Ana. 
===> Label: neutral,  old label: neutral, new label: entailment
 
 ..................
Example1:  
 
 ---> premise: Controllers at centers rely so heavily on transponder signals that they usually do not display primary radar returns on their radar scopes. 
 ---> hypothesis:  Controllers at centers rely on sight the most.
===> Label: contradiction,  old label: contradiction, new label: entailment
 
 ..................
Example2:  
 
 ---> premise: Gregory uses stream in the way normal in Engla

In [7]:
# See examples where old_label is false and new_label is right

print("old_label is false but new_label is right:")
for i in range(4):
    print(" \n ..................")
    print("Example{}:  \n \n ---> premise: {} \n ---> hyptohesis: {}".format(i, old_false_new_true["premise"][i], old_false_new_true["hypothesis"][i]))
    print("===> Label: {},  old label: {}, new label: {}".format(old_false_new_true["label"][i], old_false_new_true["pred_old"][i], old_false_new_true["pred_new"][i]))

old_label is false but new_label is right:
 
 ..................
Example0:  
 
 ---> premise: And, um, she read, I forgot about the nursery rhymes. 
 ---> hyptohesis: She told "Peter Pan" so wonderfully, the kids didn't want to sleep.
===> Label: neutral,  old label: contradiction, new label: neutral
 
 ..................
Example1:  
 
 ---> premise: What a surprise to hear, first, an American commentator on a televised golf tourney describe a reverse-necked putter colloquially, and then to hear the Japanese broadcaster translate that description into a terse sentence or two ending with the expression  bassackawad putta.  
 ---> hyptohesis: The American commentator's description was incorrect, and the translation fixed it.
===> Label: neutral,  old label: contradiction, new label: neutral
 
 ..................
Example2:  
 
 ---> premise: And none of the information conveyed in the White House video teleconference, at least in the first hour, was being passed to the NMCC. 
 ---> hyptoh

<div style="border-radius: 5px; border: 3px dashed#6699FF; padding: 10px;">

<b> <font color="#6699FF"> Conclusion: </font> </b>

<b> The model trained on the augmented dataset seems to be more neutral, and seems to be more successful in determining when a label is "neutral". On the other hand, it seems to be more prone to misinterpretation. </b>

</div>

### **5 Upload Your Notebook, Data and Models**

Please **rename** your filled jupyter notebook as **your Sciper number** and upload it to your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your processed (e.g., anotated and augmented) datasets, as well as all your trained models in Task 1 and Task 4, in your GitHub Classroom repository.

The datasets and models that you need to submit include:

**1. The best model checkpoint you trained in the Section 1.2 "Start Training and Validation!"**

**2. The best model prediction results in the Section 1.2 "Fine-Grained Validation"**

**3. Your annotated test dataset in the Section 3.2 "Annotate Your 100 Datapoints with Partner(s)"**

**4. Your augmented training data and best model checkpoint in the Section 4.2 "Augment Your Model"**

**Note:** You may need to use [GitHub LFS](https://edstem.org/eu/courses/379/discussion/27240) for submitting large files.