#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **Abdurrahman Said Gürbüz**
> - ✉️ Email: **said.gurbuz@epfl.ch**
> - 🪪 SCIPER: **369141**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (finetuning) and evaluation of a pre-trained language model ([RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)) on a **Sentiment Analysis (SA)** task, which aims to determine whether a product review's emotional tone is positive or negative.

- For part-2, following the first finetuning task, you will need to identify the shortcuts (i.e. some salient or toxic features) that the model learnt for the specific task.

- For part-3, you are supposed to annotate 80 randomly assigned new datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale, e.g., paraphrasing each text input in different words without changing its meaning. You will use a [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) paraphrase model to expand the training data of sentiment analysis, and evaluate the improvement of data augmentation.

For Parts 1 and Part 2, you will need to complete the code in the corresponding `.py` files (`sa.py` for Part 1, `shortcut.py` for Part 2). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **PART 1: Sentiment Analysis (33 pts)**
    - 1.1 Dataset Processing (10 pts)
    - 1.2 Model Training and Evaluation (18 pts)
    - 1.3 Fine-Grained Validation (5 pts)
- **PART 2: Identify Model Shortcuts (22 pts)**
    - 2.1 N-gram Pattern Extraction (6 pts)
    - 2.2 Distill Potentially Useful Patterns (8 pts)
    - 2.3 Case Study (8 pts)
- **PART 3: Annotate New Data (25 pts)**
    - 3.1 Write an Annotation Guideline (5 pts)
    - 3.2 Annotate Your Datapoints with Partner(s) (8 pts)
    - 3.3 Agreement Measure (12 pts)
- **PART 4: Data Augmentation (20 pts)**
    - 4.1 Data Augmentation with Paraphrasing (15 pts)
    - 4.2 Retrain RoBERTa Model with Data Augmentation (5 pts)
    
### Deliverables

- ✅ This jupyter notebook: `assignment2.ipynb`
- ✅ `sa.py` and `shortcut.py` file
- ✅ Checkpoints for RoBERTa models finetuned on original and augmented SA training data (Part 1 and Part 4), including:
    - `models/lr1e-05-warmup0.3/`
    - `models/lr2e-05-warmup0.3/`
    - `models/augmented/lr1e-05-warmup0.3/`
- ✅ Model prediction results on each domain data (Part 1.3 Fine-Grained Validation): `predictions/`
- ✅ Cross-annotated new SA data (Part 3), including:
    - `data/<your_assigned_dataset_id>-<your_sciper_number>.jsonl`
    - `data/<your_assigned_dataset_id>-<your_partner_sciper_number>.jsonl`
    - (for group of 3) `data/<your_assigned_dataset_id>-<your_second_partner_sciper_number>.jsonl`
- ✅ Paraphrase-augmented SA training data (Part 4), including:
    - `data/augmented_train_sa.jsonl`
- ✅ `./tensorboard` directory with logs for all trained/finetuned models, including:
    - `tensorboard/part1_lr1e-05/`
    - `tensorboard/part1_lr2e-05/`
    - `tensorboard/part4_lr1e-05/`

### How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in details below:

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook `assignment2.ipynb`** and the **`sa.py`**, **`shortcut.py`** python files.

2. Along with above files, you need to additionally upload model files under the **`models/`** dir, regarding the following models:
    - finetuned RoBERTa models on original SA training data (PART 1)  
    - finetuned RoBERTa model on augmented SA training data (PART 4)

3. You also need to upload model prediction results in Part 1.3 Fine-Grained Validation, saved in **`predictions/`**.

4. You also need to upload new data files under the **`data/`** dir (along with our already provided data), including:
    - new SA data with your and your partner's annotations (Part 3)
    - paraphrase-augmented SA training data (Part 4)

5. Finally, you will need to log your training using Tensorboard. Please follow the instructions in the `README.md` of the **``tensorboard/``** directory.

**Note**: Large files such as model checkpoints and logs should be pushed to the repository with Git LFS. You may also find that training the models on a GPU can speed up the process, we recommend using Colab's free GPU service for this. A tutorial on how to use Git LFS and Colab can be found [here](https://github.com/epfl-nlp/cs-552-modern-nlp/blob/main/Exercises/tutorials.md).
    
</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Environment Setup**

### **Option 1: creating your own environment**

```
conda create --name mnlp-a2 python=3.10
conda activate mnlp-a2
pip install -r requirements.txt
```

**Note**: If some package versions in our suggested environment do not work, feel free to try other package versions suitable for your computer, but remember to update ``requirements.txt`` and explain the environment changes in your notebook (no penalty for this if necessary).

### **Option 2: using Google Colab**
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab, as shown below:
    
</div>

In [1]:
# This cell makes sure modules are auto-loaded when you change external python files
%load_ext autoreload
%autoreload 2

In [None]:
# If you are working in Colab, then consider mounting your assignment folder to your drive
#from google.colab import drive
#drive.mount('/content/drive')

# Direct to your assignment folder.
#%cd /content/drive/MyDrive/path-to-your-assignment-folder

Install packages that are not included in the Colab base envrionemnt:

In [1]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

# Install dependencies
!pip install -r requirements.txt



In [2]:
import numpy as np
import jsonlines
import random
from sklearn.metrics import cohen_kappa_score

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# TODO: Enter your Sciper number
SCIPER = '369141'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x156dc4c50>

In [4]:
# Check the availability of GPU (proceed only it returns True!)
if torch.cuda.is_available():
  print('Good to go!')
elif torch.backends.mps.is_available():
  print('MPS is enabled!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

MPS is enabled!


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">
    
# PART 1: Sentiment Analysis (33 pts)

In this part, we will finetune a pretrained language model (Roberta) on sentiment analysis(SA) task. 

> Specifically, we will focus on a binary sentiment classification task for multi-domain product reviews. It requires the model to **classify a given paragraph of review by its sentiment polarity (positive or negative)**. 

</div>

### Load Training Dataset (`train_sa.jsonl`) 

**You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- input review (*'review'*): a natural language sentence or a paragraph commenting about a product.
- domain (*'domain'*): describing the type of product being reviewed.
- label of sentiment (*'label'*): indicating whether the review states positive or negative views about the product.

In [5]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid % 200 == 0:
            print(sample)

{'review': "THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life", 'domain': 'books', 'label': 'negative'}
{'review': 'Sphere by Michael Crichton is an excellant novel. This was certainly the hardest to put down of all of the Crichton novels that I have read. The story revolves around a man named Norman Johnson. Johnson is a phycologist. He travels with 4 other civilans to a remote location in the Pacific Ocean to help the Navy in a top secret misssion. They quickly learn that under the ocean is a half mile long sp

In [6]:
# We use the following pretrained tokenizer and model
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json: 100%|██████████| 25.0/25.0 [00:00<00:00, 91.9kB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 3.27MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.61MB/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 12.2MB/s]
config.json: 100%|██████████| 481/481 [00:00<00:00, 4.50MB/s]
model.safetensors: 100%|██████████| 499M/499M [00:20<00:00, 24.3MB/s] 
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🎯 Q1.1: **Dataset Processing (10 pts)**

Our first step is to constructing a Pytorch Dataset for SA task. Specifically, we will need to implement **tokenization** and **padding** using a HuggingFace pre-trained tokenizer.

**TODO🔻: Complete `SADataset` class following the instructions in `sa.py`, and test by running the following cell.**

In [9]:
from sa import SADataset
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
dataset = SADataset("data/train_sa.jsonl", tokenizer)

Building SA Dataset...


1600it [00:00, 2207.36it/s]


In [20]:
from testA2 import test_SADataset
test_SADataset(dataset)

SADataset test correct ✅


## 🎯 Q1.2: **Model Training and Evaluation (18 pts)**

Next, we will implement the training and evaluation process to finetune the model. 

- For training: you will need to calculate the **loss** and update the model weights by using **Adam optimizer**. Additionally, we add a **learning rate schedular** to adopt an adaptive learning rate during the whole training process.

- For evaluation: you will need to compute the **confusion matrix** and **F1 scores** to assess the model performance.

**TODO🔻: Complete the `compute_metrics()`, `train()` and `evaluate()` functions following the instructions in the `sa.py` file, you can test compute_metrics() by running the following cell.**

In [6]:
from sa import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

compute_metric test correct ✅


#### **Start Training and Validation!**

TODO🔻: (1) [coding question] Train the model with the following two different learning rates (other hyperparameters should be kept consistent). 

> A. learning_rate = 1e-5

> B. learning_rate = 2e-5

**Note:** *Each training will take ~7-10 minutes using a T4 Colab GPU.*

In [26]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.mps.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu"))

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
learning_rate = 1e-5  # play around with this hyperparameter

train_dataset = SADataset("data/train_sa.jsonl", tokenizer)
test_dataset = SADataset("data/test_sa.jsonl", tokenizer)

train(train_dataset, test_dataset, model, device, batch_size=batch_size, epochs=epochs, max_grad_norm=max_grad_norm, warmup_percent=warmup_percent, learning_rate=learning_rate,
      model_save_root='models/', tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Building SA Dataset...


0it [00:00, ?it/s]

1600it [00:00, 2223.85it/s]


Building SA Dataset...


6400it [00:02, 2688.61it/s]
Training:   0%|          | 0/200 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Training: 100%|██████████| 200/200 [00:58<00:00,  3.41it/s]
Evaluation: 100%|██████████| 800/800 [00:50<00:00, 15.79it/s]


Epoch: 0 | Training Loss: 0.654 | Validation Loss: 0.392
Epoch 0 SA Validation:
Confusion Matrix:
[[2546  654]
 [ 197 3003]]
F1: (85.68%, 87.59%) | Macro-F1: 86.63%
Model Saved!


Training: 100%|██████████| 200/200 [00:54<00:00,  3.70it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.48it/s]


Epoch: 1 | Training Loss: 0.339 | Validation Loss: 0.463
Epoch 1 SA Validation:
Confusion Matrix:
[[3059  141]
 [ 580 2620]]
F1: (89.46%, 87.90%) | Macro-F1: 88.68%
Model Saved!


Training: 100%|██████████| 200/200 [00:54<00:00,  3.69it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.47it/s]


Epoch: 2 | Training Loss: 0.228 | Validation Loss: 0.406
Epoch 2 SA Validation:
Confusion Matrix:
[[2823  377]
 [ 233 2967]]
F1: (90.25%, 90.68%) | Macro-F1: 90.46%
Model Saved!


Training: 100%|██████████| 200/200 [00:54<00:00,  3.64it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.57it/s]


Epoch: 3 | Training Loss: 0.126 | Validation Loss: 0.513
Epoch 3 SA Validation:
Confusion Matrix:
[[2920  280]
 [ 294 2906]]
F1: (91.05%, 91.01%) | Macro-F1: 91.03%
Model Saved!


In [30]:
# For 2e-5 learning rate
learning_rate = 2e-5
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

train_dataset = SADataset("data/train_sa.jsonl", tokenizer)
test_dataset = SADataset("data/test_sa.jsonl", tokenizer)

train(train_dataset, test_dataset, model, device, batch_size=batch_size, epochs=epochs, max_grad_norm=max_grad_norm, warmup_percent=warmup_percent, learning_rate=learning_rate,
      model_save_root='models/', tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Building SA Dataset...


1600it [00:00, 2183.55it/s]


Building SA Dataset...


6400it [00:02, 2670.28it/s]
Training: 100%|██████████| 200/200 [00:59<00:00,  3.39it/s]
Evaluation: 100%|██████████| 800/800 [00:49<00:00, 16.22it/s]


Epoch: 0 | Training Loss: 0.545 | Validation Loss: 0.322
Epoch 0 SA Validation:
Confusion Matrix:
[[2930  270]
 [ 449 2751]]
F1: (89.07%, 88.44%) | Macro-F1: 88.76%
Model Saved!


Training: 100%|██████████| 200/200 [00:54<00:00,  3.64it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.54it/s]


Epoch: 1 | Training Loss: 0.332 | Validation Loss: 0.365
Epoch 1 SA Validation:
Confusion Matrix:
[[2962  238]
 [ 375 2825]]
F1: (90.62%, 90.21%) | Macro-F1: 90.42%
Model Saved!


Training: 100%|██████████| 200/200 [00:55<00:00,  3.63it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.57it/s]


Epoch: 2 | Training Loss: 0.169 | Validation Loss: 0.545
Epoch 2 SA Validation:
Confusion Matrix:
[[3043  157]
 [ 584 2616]]
F1: (89.15%, 87.59%) | Macro-F1: 88.37%


Training: 100%|██████████| 200/200 [00:53<00:00,  3.71it/s]
Evaluation: 100%|██████████| 800/800 [00:48<00:00, 16.58it/s]

Epoch: 3 | Training Loss: 0.095 | Validation Loss: 0.523
Epoch 3 SA Validation:
Confusion Matrix:
[[2987  213]
 [ 451 2749]]
F1: (90.00%, 89.22%) | Macro-F1: 89.61%





TODO🔻: (2) [textual question] compare and discuss the results. 

- Which learning rate is better? Explain your answers.

- Answer: Although the training loss become smaller in learning rate 2e-5, with learning rate 1e-5, the model achieved a higher F1 scores of ~0.91, while with learning rate 2e-5, we get macro-F1 score of ~0.895. This shows that the model overfit a bit with learning rate 2e-5. However, both learning rates converge to decent f1 scores which can be due to the learning rate scheduling.  

## 🎯 Q1.3: **Fine-Grained Validation (5 pts)**

TODO🔻: (1) [coding question] Use the model checkpoint trained from the first learning_rate setting (lr=1e-5), check the model performance on each domain subsets of the validation set. You should report **the validation loss**, **confusion matrix**, **F1 scores** and **Macro-F1 on each domain**. 

In [31]:
# Split the test sets into subsets with different domains
# Save the subsets under 'data/'
# Replace "..." with your code
domain_data = {}
with jsonlines.open("data/test_sa.jsonl", mode="r") as reader:
    for sample in reader:
        domain = sample["domain"]
        if domain not in domain_data:
            domain_data[domain] = []
        domain_data[domain].append(sample)

for domain, samples in domain_data.items():
    with jsonlines.open("data/test_sa_"+domain+".jsonl", mode="w") as writer:
        for sd in samples:
            writer.write(sd)

In [34]:
tokenizer = RobertaTokenizer.from_pretrained("models/lr1e-05-warmup0.3")

In [39]:
learning_rate = 1e-5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("models/lr1e-05-warmup0.3")
model = RobertaForSequenceClassification.from_pretrained("models/lr1e-05-warmup0.3")
model.to(device)

results_save_dir = 'predictions/'

# Evaluate and save prediction results in each domain
# Replace "..." with your code
for domain in [domain for domain in domain_data.keys()]:
    test_dataset = SADataset("data/test_sa_"+domain+".jsonl", tokenizer)
    dev_loss, confusion, f1_pos, f1_neg = evaluate(test_dataset, model, device, batch_size=batch_size,
                                                   result_save_file='predictions/test_'+domain+'.jsonl')
    macro_f1 = (f1_pos + f1_neg) / 2

    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f}')
    print(f'Confusion Matrix:')
    print(confusion)
    print(f'F1: ({f1_pos*100:.2f}%, {f1_neg*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building SA Dataset...


1600it [00:00, 1638.81it/s]
Evaluation: 100%|██████████| 200/200 [00:51<00:00,  3.87it/s]


Domain: books
Validation Loss: 0.509
Confusion Matrix:
[[735  65]
 [ 84 716]]
F1: (90.80%, 90.58%) | Macro-F1: 90.69%
Building SA Dataset...


1600it [00:00, 2051.59it/s]
Evaluation: 100%|██████████| 200/200 [00:52<00:00,  3.84it/s]


Domain: dvd
Validation Loss: 0.621
Confusion Matrix:
[[721  79]
 [ 86 714]]
F1: (89.73%, 89.64%) | Macro-F1: 89.69%
Building SA Dataset...


1600it [00:00, 3256.27it/s]
Evaluation: 100%|██████████| 200/200 [00:54<00:00,  3.65it/s]


Domain: electronics
Validation Loss: 0.497
Confusion Matrix:
[[727  73]
 [ 69 731]]
F1: (91.10%, 91.15%) | Macro-F1: 91.12%
Building SA Dataset...


1600it [00:00, 4173.76it/s]
Evaluation: 100%|██████████| 200/200 [00:54<00:00,  3.70it/s]

Domain: housewares
Validation Loss: 0.424
Confusion Matrix:
[[737  63]
 [ 55 745]]
F1: (92.59%, 92.66%) | Macro-F1: 92.62%





TODO🔻: (2) [textual question] compare and discuss the results. 

**Questions:**
- On which domain does the model perform the best? the worst?
- Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the `predictions/` folder, by specifying the `result_save_file` parameter in the *evaluate* function.


(Write your answer to the questions here.) \
**Answers:**
- The model performed best on the 'housewares' domain with a macro-F1 score of ~0.925, and the worst on the 'dvd' domain with a macro-F1 score of ~0.897. 
- One of the reason for this difference is, The reviews in the 'housewares' domain is easier for the model to predict because they are in general very straightforward and less ambiguous (e.g. predictions/test_housewares.jsonl, line-14 *"Easily the worst toaster ever. ... I'd give it a minus 5 stars, but they do not even have a 0 star vote"*). On the other hand, the 'dvd' domain is more challenging because the reviews in this domain are longer, more ambiguous and contain more complicated statements (as a movie can contains too many positive and negative aspects together). For instance (here I copy past whole review to better explain the complexity), predictions/test_dvd.jsonl, line-3: *"I've been waiting for this season, literally, for years.  It contained (notice the past tense) the funniest scene ever in the entire run of the show.  It was in the emergency clinic in the episode \"My Sister, My Sitter\" where the scene pans around showing the patients.  Smithers appears holding an empty gerbil cage and is the only patient not sitting down.  When I first saw that, many years ago, I literally rolled off of the couch with laughter as my wife stared at me like I needed a padded room.  She didn't even understand the joke when I explained it as an alternate lifestyle practice of \"hiding\" the rodent so as to make sitting down more than a little uncomfortable.  This DVD cut has been cut and makes Smithers reply about not letting Lisa ahead of him make no sense.  One of my favorite things about the Simpsons was that they didn't care who they offended.  Well they went PC and now I'm highly offended!  If they hadn't messed with it I would give them 5 stars, or more realistically I wouldn't have bothered to write this review.  Quit tampering with the show, FOX!!"*

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# PART 2: Identify Model Shortcuts (22 pts)

In this part, We aim to find out the shortcut features learnt by the sentiment analysis model we have trained in Part1. We will be using the model checkpoint trained with `learning rate=1e-5`.

</div>

## 🎯 Q2.1: **N-gram Pattern Extraction (6 pts)**
We hypothesize that `n-gram`s could be the potential shortcut features learnt by the SA model. An `n-gram` is defined as a sequence of n consecutive words appeared in a natural language sentence or paragraph. 

Thus, we aim to extract that an n-gram that appears in a review may serve as a key indicator of the polarity of the review's sentiment, for example:

>- **Review 1**: This book was **horrible**. If it was possible to rate it **lower than one star** I would have.
>- **Review 2**: **Excellent** book, **highly recommended**. Helps to put a realistic perspective on millionaires.

For Review 1, the `1-gram "horrible"` and the `4-gram "lower than one star"` serve as two key indicators of negative sentiment. While for Review 2, the `1-gram "excellent"` and the `2-gram "highly recommended"` obviously indicate positive sentiment.

TODO🔻: (1) [coding question] Complete `ngram_extraction()` function in `shortcut.py` file.

The returned *ngrams* contains a **list** of dictionaries. The `n-th` **dictionary** corresponds the `n-grams` (n=1,2, 3, 4).

The keys of each dictionary should be a **unique n-gram string** appeared in reviews, and the value of each n-gram key records the frequency of positive/negative predictions **made by the model** when the n-gram appears in the review, i.e., `\[#positive_predictions, #negative_predictions\]`.

> Example: **`ngrams`[0]['horrible'][0]** should return the number of the positive predictions made by the model when the 1-gram token 'horrible' appear in the given review. i.e., \[#positive_predictions, #negative_predictions\].

**Note:** (1) All the sequences contain punctuations should NOT be counted as a n-gram (e.g. `it is great .` is NOT a 4-gram, but `it is great` is a 3-gram); (2) All stop-words should NOT be counted as 1-grams, but can appear in other n-gram sequences (e.g. `is` is NOT a 1-gram token, but `it is great` can be a 3-gram token.)

## 🎯 Q2.2: **Distill Potentially Useful Patterns (8 pts)**

TODO🔻: (2) [coding question] For each group of n-grams (n=1,2,3,4), find and **print** the **top-100 n-gram sequences** with the **greatest frequency of appearance**, which could contain frequent semantic features and would be used as our feature list.

In [11]:
from shortcut import ngram_extraction

In [8]:
# all your saved model prediction results from 1.3 Fine-Grained Validation
prediction_files = ["predictions/test_"+domain+".jsonl" for domain in domain_data.keys()]

# TODO: Define your tokenizer
tokenizer = RobertaTokenizer.from_pretrained("models/lr1e-05-warmup0.3")
ngrams = ngram_extraction(prediction_files, tokenizer)

top_100 = {}
for n, counts in enumerate(ngrams):
    # TODO: find top-100 n-grams (n=1,2,3 or 4) associated with the greatest frequency of appearance
    top_100_freq = sorted(counts.items(), key=lambda x: x[1][0] + x[1][1], reverse=True)[:100]

    print(f'Top-100 most frequent {n+1}-grams:')
    print(top_100_freq)

    top_100[n] = top_100_freq

100%|██████████| 1600/1600 [00:03<00:00, 446.27it/s]
100%|██████████| 1600/1600 [00:03<00:00, 486.47it/s]
100%|██████████| 1600/1600 [00:02<00:00, 788.84it/s]
100%|██████████| 1600/1600 [00:01<00:00, 957.96it/s] 


Top-100 most frequent 1-grams:
[('one', [2083, 1904]), ('book', [1895, 1856]), ('like', [1257, 1305]), ('would', [902, 1316]), ('good', [1146, 984]), ('movie', [919, 1002]), ('get', [872, 1035]), ('time', [986, 921]), ('great', [1336, 538]), ('well', [1085, 634]), ('even', [714, 878]), ('use', [913, 613]), ('much', [722, 781]), ('film', [797, 623]), ('really', [711, 701]), ('first', [707, 699]), ('also', [857, 527]), ('read', [688, 672]), ('j', [740, 540]), ('way', [570, 576]), ('work', [550, 595]), ('many', [618, 480]), ('better', [484, 582]), ('could', [458, 606]), ('k', [686, 374]), ('new', [552, 484]), ('two', [534, 500]), ('b', [559, 466]), ('people', [525, 497]), ('2', [487, 532]), ('make', [489, 523]), ('l', [615, 363]), ('little', [556, 417]), ('back', [409, 562]), ('story', [532, 430]), ('see', [524, 433]), ('r', [541, 407]), ('vd', [507, 441]), ('love', [696, 251]), ('g', [544, 393]), ('man', [526, 392]), ('think', [390, 516]), ('buy', [363, 541]), ('h', [465, 415]), ('never'

**Among each type of top-100 frequent n-grams above**, we aim to further find out the n-grams which **most likely** lead to *positive*/*negative* predictions (positive/negative shortcut features). 

TODO🔻: (3) [coding&text question] Design **two different methods to re-rank** the top-100 n-grams to extract shortcut features. For each method, you should extract **1** feature in each of n-grams group (n=1, 2, 3, 4) for positve and negative prediction (1\*4\*2=8 features in total for 1 method).

Explain each of your design choices in natural language, and compare which method finds more reasonable patterns.


In [31]:
# TODO: [Method 1] find top-1 positive and negative patterns
top_1 = [[{}, {}, {}, {}], [{}, {}, {}, {}]]
scores = [[{}, {}, {}, {}], [{}, {}, {}, {}]]
for n, occurrences in top_100.items():
    for ngram, (pos, neg) in occurrences:
        pos_score = (pos + neg) * (pos / (neg + 1))
        neg_score = (pos + neg) * (neg / (pos + 1))
        scores[0][n][ngram] = pos_score
        scores[1][n][ngram] = neg_score

for n in range(len(top_100)):
    top_1[0][n] = max(scores[0][n], key=scores[0][n].get)
    top_1[1][n] = max(scores[1][n], key=scores[1][n].get)

print(f'Top-1 positive patterns:')
print(top_1[0])
print(f'Top-1 negative patterns:')
print(top_1[1])

# TODO: [Explanation of Method 1]
# I used the formula (total occurence) * ratio of positive occurence to negative occurence
# to calculate the positive score for each n-gram. (and vice versa for negative score) To support
# the formula, I added 1 to the denominator to avoid division by zero.
# As an analysis, only the top-1 positive pattern for 1-gram give decent shortcut feature (great word for positive sentiment) 

Top-1 positive patterns:
['great', 'k agan', 'b erg man', 'ris ben oit vs']
Top-1 negative patterns:
['book', 'customer service', 'ge h ry', 'dean ko ont z']


In [33]:
# TODO: [Method 2] find top-1 positive and negative patterns
top_1 = [[{}, {}, {}, {}], [{}, {}, {}, {}]]
scores = [[{}, {}, {}, {}], [{}, {}, {}, {}]]
for n, occurrences in top_100.items():
    for ngram, (pos, neg) in occurrences:
        pos_score = (pos + neg) * np.log((pos + 1) / (neg + 1))
        neg_score = (pos + neg) * np.log((neg + 1) / (pos + 1))
        scores[0][n][ngram] = pos_score
        scores[1][n][ngram] = neg_score

for n in range(len(top_100)):
    top_1[0][n] = max(scores[0][n], key=scores[0][n].get)
    top_1[1][n] = max(scores[1][n], key=scores[1][n].get)

print(f'Top-1 positive patterns:')
print(top_1[0])
print(f'Top-1 negative patterns:')
print(top_1[1])

# TODO: [Explanation of Method 2]
# In the first method, I observed that, if positive or negative occurence of a word is very low or zero, it affects the score
# significantly. To avoid this, I used the log function to calculate the score.
# Score = (total occurence) * log(ratio of positive occurence to negative occurence) for positive score (and vice versa for negative score)
# As an analysis, the top-1 positive pattern for 1-gram and 2-gram give decent shortcut features 
# ('great' word and 'highly recommend' words for positive sentiment) So, I would prefer the second method over the first one.
# However, still the other features are not very useful for sentiment analysis.

Top-1 positive patterns:
['great', 'highly recommend', 'b erg man', 'ris ben oit vs']
Top-1 negative patterns:
['would', 'customer service', 'cu isin art', 'dean ko ont z']


Ellipsis

TODO🔻: Compare and discuss the results from two methods above.

## 🎯 Q2.3: **Case Study (8 pts)**

TODO🔻: Among the shortcut features you found in 2.1, find out **4 representative** cases (pair of `\[review, n-gram feature\]`) where the shortcut feature **will lead to a wrong prediction**. 

For example, the 1-gram feature "excellent" has been considered as a shortcut for *positive* sentiment, while the ground-truth label of the given review containing "excellent" is *negative*.

**Questions:**
- Based on your case study, do you detect any limitations of the n-gram patterns?
- Which type of n-gram (1/2/3/4-gram) pattern is more robust to be used for sentiment prediction shortcut and why?

In [38]:
# TODO: you can fill your code for finding cases here
# 4 cases in total: 1-grams and 2-grams for positive and negative sentiments
with jsonlines.open("data/test_sa.jsonl", mode="r") as reader:
    for pattern in [top_1[0][0], top_1[0][1]]:
        for sample in reader:
            if sample["review"].find(pattern) != -1 and sample["label"] == "negative":
                print(sample)
                break
    for pattern in [top_1[1][0], top_1[1][1]]:
        for sample in reader:
            if sample["review"].find(pattern) != -1 and sample["label"] == "positive":
                print(sample)
                break

{'review': "When I bought this book, I didn't realize it was mostly just a consolidated writing of Friedman's collumns in the Times.  I think Friedman is a great author with lots of great insights, but he isn't able to go into his ideas in depth as much as I would have liked in a bunch of detached 750-1000 word segments.  Since the sections are arranged chronologically, there also isn't the opportunity to tie the themes together. If you really like Friedman, then it's worth a read, but if you are in it for just one, I'd read The Lexus and the Olive Tree first.  It is by far his best", 'domain': 'books', 'label': 'negative'}
{'review': 'The title is a misrepresentation.  This is not a handbook for driving a Porsche quickly or professionally.  I bought this book because in 2 months I am going to drive my rear engined 993 on a F1 circuit.  I have no circuit experience, and cannot get any coaching or any circuit experience in the country I live in.  Since this is a rear engined car which I

TODO🔻: (Write your case study discussions and answers to the questions here.)

**Answers:** 

- One of the most important limitations of n-gram patterns is that they are not able to identify the global context of the review. And since they cant get the global context, even reviewer telling "great" to another movie in the review, it seems as a positive review for n-gram patterns. Actually this results show the importance of getting word embedding of the words based on the context. In addition to this, it becomes harder to get frequencies exponentially when we increase the n-gram size. Therefore it can be hard to collect enough data to make a decision when we increase n (actually in our case we get nonsense patterns for 3-gram and 4-grams which support this claim).
- 1-gram or 2-grams patterns can be more robust due to data sparsity issue in 3-grams and 4-grams. Since the 1-gram patterns are more frequent in the reviews, they can be more robust to be used for sentiment prediction shortcut. However, 2-grams can be more robust than 1-grams if there are enough samples because they can capture more context than 1-grams.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 3: Annotate New Data (25 pts)**

In this part, you will **annotate** the gold labels of some **new** SA data samples, and measure the degree of **agreement** between your and **one or two partners'** annotations.
    
</div>

## 🎯 Q3.1: **Write an Annotation Guideline (5 pts)**

TODO🔻: Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Part 3.2

(Write your annotation guideline here.)

**Annotation Guideline:**
- **Description of the Task:** 
    - There are set of reviews about different products on 4 different domains, namely; books, dvd, electronics, and housewares.
    - Your task is to read each review and decide/label whether emotional tone of the review is positive or negative. If the review is praising the product, or the reviewer is happy with the product, you should label it as positive.
    - On the other hand, if the review is criticizing the product, or the reviewer is stating unhappinness about the product, you should label it as negative. You should consider the context of the review while deciding the sentiment. For instance, if the reviewer is criticizing the product but also praising some aspects of it and the tone is positive, you should consider the overall sentiment of the review.
- **Review Examples:** You can see two example reviews below to understand the task better: (These are from domain 'books' as an example)
    - Positive Review: "The book arrived as expected and was in great shape.  Thanks"
    - Negative Review: "This seemed too long and too drawn out"
- **Final Notes:** 
    - Please read the reviews carefully and try to understand the overall sentiment and emotion of the review. If you are not sure about the sentiment, please reread the review and try to understand the context better and label the review according to the overall emotion.
    - We will use your annotations to train an accurate model to predict the sentiment/emotion of the reviews, so please be careful while annotating the reviews.
    - If you have any questions or need help, feel free to ask.



## 🎯 Q3.2: **Annotate Your Datapoints with Partner(s) (8 pts)**

TODO🔻: Annotate 80 datapoints (20 in each domain of "books", "dvd", "electronics" and "housewares") assigned to you and your partner(s), by editing the value of the key **"label"** in each datapoint. You and your partner(s) should annotate **independently of each other**, i.e., each of you provide your own 80 annotations.

Please find your assigned annotation dataset **ID** and **your partner(s)** according to this [list](https://docs.google.com/spreadsheets/d/1hOwBUb8XE8fitYa4hlAwq8mARZe3ZsL4/edit?usp=sharing&ouid=108194779329215429936&rtpof=true&sd=true). Your annotation dataset can be found [here](https://drive.google.com/drive/folders/1IHXU_v3PDGbZG6r9T5LdjKJkHQ351Mb4?usp=sharing).

**Name your annotated file as `<your_assigned_dataset_id>-<your_sciper_number>.jsonl`.**

**You should also submit your partner's annotated file `<assigned_dataset_id>-<your_partner_sciper_number>.jsonl`.**

IMPORTANT NOTE: I contacted with my partner but could not get any response in 1 week. Therefore, I contacted with other person with same dataset id assigned (Pablo Nicolas Soto Gomez (383334)) and used his annotations for comparison.

## 🎯 Q3.3: **Agreement Measure (12 pts)**

TODO🔻: Based on your and your partner's annotations in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators on **each domain** and **across all domains**.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

**Questions:**
- What is the overall degree of agreement between you and your partner(s) according to the above interpretation of score ranges?
- In which domain are disagreements most and least frequently happen between you and your partner(s)? Give some examples to explain why that is the case.
- Are there possible ways to address the disagreements between annotators?

In [3]:
# Fill your code for calculating agreement scores here.
my_labels = []
partner_labels = []

with jsonlines.open("data/53-369141.jsonl", mode="r") as reader:
    for sample in reader:
        my_labels.append(sample["label"])

with jsonlines.open("data/53-383334.jsonl", mode="r") as reader:
    for sample in reader:
        partner_labels.append(sample["label"])

print(f'Agreement score: {cohen_kappa_score(my_labels, partner_labels)}')

Agreement score: 0.8439024390243902


(Write your answers to the questions here.)

**Answers:**
- The overall degree of agreement between me and my partner is near perfect with a Cohen's Kappa score of ~0.844.
- Our disagreements happen 1 in dvd, 1 in electronics and 2 in housewares domains. The main reason for the disagreements is the ambiguity of the reviews. For instance, in the dvd domain, in 28th review, the reviewer first mention that "I still love this movie, so I'm giving it 4 stars" and also mention "Actor's voices have lost the human richness" and "Even the music and sound effects sound a little "off."". So, both positive and negative emotion and  semantic occurs in the same review which makes it hard to label. Another reason for disagreement in 2 housewares domain was, the reviewer first mention the other reviews and then mention his/her perspective which can be confusing for annotators if they dont give enough attention to the context. 
- The possible way to address the disagreements are to provide more detailed annotation guidelines and examples to the annotators. Also, when I checked the agreements I observed that most of the disagreements are due to the ambiguity of the reviews. Therefore, it would be better to decide annotating a review solely on its emotion or sentiment, or language style.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 4: Data Augmentation (20 pts)**

Since we only used 20% of the whole dataset for training, which might limit the model performance. In the final part, we will try to enlarge the training set by **data augmentation**.  

Specifically, we will **`Rephrase`** some current training samples using pretrained paraphraser. So that the paraphrased synthetic samples would preserve the semantic similarity while change the surface format.

You can use the pretrained T5 paraphraser [here](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base).

</div>

## 🎯 Q4.1: **Data Augmentation with Paraphrasing (15 pts)**
TODO🔻: Implement functions named `get_paraphrase_batch` and `get_paraphrase_dataset` with the details in the below two blocks. 

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# get the given pretrained paraphrase model and the corresponding tokenizer (https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base)
paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def get_paraphrase_batch(
    model,
    tokenizer,
    input_samples,
    n,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=256,
    device='mps'):
    '''
    Input
      model: paraphraser
      tokenizer: paraphrase tokenizer
      input_samples: a batch (list) of real samples to be paraphrased
      n: number of paraphrases to get for each input sample
      for other parameters, please refer to:
          https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig
    Output: Tuple.
      synthetic_samples: a list of paraphrased samples
    '''

    # TODO: implement paraphrasing on a batch of imput samples
    synthetic_samples = []
    inputs = tokenizer([sample['review'] for sample in input_samples], return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
      outputs = model.generate(**inputs, repetition_penalty=repetition_penalty, diversity_penalty=diversity_penalty, no_repeat_ngram_size=no_repeat_ngram_size, temperature=temperature, max_length=max_length, num_return_sequences=n, num_beams=n, num_beam_groups=n)
    
    for i, sample in enumerate(input_samples):
      paraphrases = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs[i * n : (i + 1) * n]]
      synthetic_samples.extend([{'review': paraphrase, 'domain': sample['domain'], 'label': sample['label']} for paraphrase in paraphrases])

    return synthetic_samples

  return self.fget.__get__(instance, owner)()


In [6]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu"))

data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
BATCH_SIZE = 4
N_PARAPHRASE = 2

def get_paraphrase_dataset(model, tokenizer, data_path, batch_size, n_paraphrase):
    '''
    Input
      model: paraphrase model
      tokenizer: paraphrase tokenizer
      data_path: path to the `jsonl` file of training data
      batch_size: number of input samples to be paraphrases in one batch
      n_paraphrase: number of paraphrased sequences for each sample
    Output:
      paraphrase_dataset: a list of all paraphrase samples. Do not include the original training data.
    '''
    paraphrase_dataset = []
    with jsonlines.open(data_path, "r") as reader:

        # TODO: get paraphrases for the whole training dataset using get_paraphrase_batch
        samples = []
        for sample in reader:
            samples.append(sample)
            if len(samples) == batch_size:
                print(f"Getting paraphrases for {len(paraphrase_dataset) + len(samples)} samples")
                paraphrase_dataset.extend(get_paraphrase_batch(model, tokenizer, samples, n_paraphrase))
                samples = []
        
        if len(samples) > 0:
            paraphrase_dataset.extend(get_paraphrase_batch(model, tokenizer, samples, n_paraphrase))

    return paraphrase_dataset

**Note:** run paraphrasing, which will take ~20-30 minutes using a T4 Colab GPU. But the running time could depend on various implementations.

In [None]:
paraphrase_dataset = get_paraphrase_dataset(paraphrase_model, paraphrase_tokenizer, data_train_path, BATCH_SIZE, N_PARAPHRASE)

In [9]:
# Original training dataset
with jsonlines.open(data_train_path, "r") as reader:
    origin_data = [dt for dt in reader.iter()]

all_data = origin_data + paraphrase_dataset

# Write all the original and paraphrased data samples into training dataset
augmented_data_train_path = os.path.join(data_dir, 'augmented_train_sa.jsonl')
with jsonlines.open(augmented_data_train_path, "w") as writer:
    writer.write_all(all_data)

assert len(all_data) == 3 * len(origin_data)

## 🎯 Q4.2: **Retrain RoBERTa Model with Data Augmentation (5 pts)** 
TODO🔻: Retrain the sentiment analysis model with the augmented (original+paraphrased), larger dataset :)

**Note:** *Training on the augmented data will take about 15 minutes using a T4 Colab GPU.*

In [7]:
# Re-train a RoBERTa SA model on the augmented training dataset
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu"))

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3
learning_rate = 1e-5

train_dataset = SADataset("data/augmented_train_sa.jsonl", tokenizer)
test_dataset = SADataset("data/test_sa.jsonl", tokenizer)

train(train_dataset, test_dataset, model, device, batch_size=batch_size, epochs=epochs, max_grad_norm=max_grad_norm, warmup_percent=warmup_percent, learning_rate=learning_rate,
      model_save_root='models/augmented/', tensorboard_path="./tensorboard/part4_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Building SA Dataset...


4800it [00:01, 3904.73it/s]


Building SA Dataset...


6400it [00:02, 2678.75it/s]
Training:   0%|          | 0/600 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Training: 100%|██████████| 600/600 [02:30<00:00,  3.99it/s]
Evaluation: 100%|██████████| 800/800 [00:46<00:00, 17.17it/s]


Epoch: 0 | Training Loss: 0.530 | Validation Loss: 0.302
Epoch 0 SA Validation:
Confusion Matrix:
[[2930  270]
 [ 413 2787]]
F1: (89.56%, 89.08%) | Macro-F1: 89.32%
Model Saved!


Training: 100%|██████████| 600/600 [02:17<00:00,  4.37it/s]
Evaluation: 100%|██████████| 800/800 [00:46<00:00, 17.10it/s]


Epoch: 1 | Training Loss: 0.347 | Validation Loss: 0.306
Epoch 1 SA Validation:
Confusion Matrix:
[[2854  346]
 [ 311 2889]]
F1: (89.68%, 89.79%) | Macro-F1: 89.73%
Model Saved!


Training: 100%|██████████| 600/600 [02:16<00:00,  4.38it/s]
Evaluation: 100%|██████████| 800/800 [00:46<00:00, 17.23it/s]


Epoch: 2 | Training Loss: 0.195 | Validation Loss: 0.588
Epoch 2 SA Validation:
Confusion Matrix:
[[2729  471]
 [ 212 2988]]
F1: (88.88%, 89.74%) | Macro-F1: 89.31%


Training: 100%|██████████| 600/600 [02:16<00:00,  4.39it/s]
Evaluation: 100%|██████████| 800/800 [00:46<00:00, 17.18it/s]


Epoch: 3 | Training Loss: 0.104 | Validation Loss: 0.555
Epoch 3 SA Validation:
Confusion Matrix:
[[2888  312]
 [ 306 2894]]
F1: (90.33%, 90.35%) | Macro-F1: 90.34%
Model Saved!


TODO🔻: Discuss your results by answering the following questions

- Compare the performances of models in Part 1 and Part 4. Does the data augmentation help with the performance and why (give possible reasons)?
- No matter whether the data augmentation helps or not, list **three** possible ways to improve our current data augmentation method.

(Write your answers to the questions here.)

**Answers:**
- The performances after data augmentation is not better and even slightly worse than the model trained on the original data. This can be due to the quality of the paraphrases generated by the model. Some of the paraphrases generated by the model are not complete sentences or do not make sense. And I observed that, most of the paraphrases' lengths are much shorter than the original reviews even though I set large max_length argument which causes the model to generate incomplete paraphrases sometimes. This can be the reason why the model trained on the augmented data performed worse than the model trained on the original data. One possible reason for getting slightly worse results can be related to not using the best hyperparameters for the model training after data augmentation.(So, actually comparing the results after finding best hyperparameters configurations (e.g. with grid-search) would be more accurate to decide if data augmentation helps or not)
- Three possible ways to improve our current data augmentation method are:
    - Using more advanced paraphrasing models that can generate more diverse, complete, and longer paraphrases.
    - Finetuning the paraphrasing model on the specific domain data to generate more domain-specific paraphrases which can provide semantically/emotionally more accurate paraphrases.
    - Using different hyperparameter configurations for the model which is optimized for the augmented data. (such as smaller learning rate as we have more data now) 

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

### **5 Upload Your Notebook, Data and Models**

Please upload your filled jupyter notebook in your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your **datasets** **(anotated and augmented)**, as well as **all your trained models** in Part 1 and Part 4, in your GitHub Classroom repository.
    
</div>