## Extra credit assignment, CS685 Fall 2021




---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say `YOUR CODE HERE`.

- For text-based answers, you should replace the text that says "Write your answer here..." with your actual answer.
 
- This assignment is designed such that each cell takes a few minutes (if that) to run. If it is taking longer than that, you might have made a mistake in your code.

---

##### *How to submit this problem set:*
- Write all the answers in this Colab notebook. Once you are finished, generate a PDF via (File -> Print -> Save as PDF) and upload it to Gradescope.
  
- **Important:** check your PDF before you submit to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early.

- **Important:** on Gradescope, please make sure that you tag each page with the corresponding question(s). This makes it significantly easier for our graders to grade submissions, especially with the long outputs of many of these cells. We will take off points for submissions that are not tagged.

- When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your PDF. If you turn in correct answers on your PDF without code that actually generates those answers, we will consider this a serious case of cheating. See the course page for honesty policies.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

# Part 0: Setup

## Adding a hardware accelerator
The purpose of this homework is to get you acquainted with using large-scale pretrained language models specifically in the context of transfer learning. Since we will be training large neural networks we will attach a GPU, otherwise training will take a very long time.

Please go to the menu and add a GPU as follows:

`Edit > Notebook Settings > Hardware accelerator > (GPU)`

Run the following cell to confirm that the GPU is detected.

In [None]:
import torch

# Confirm that the GPU is detected

assert torch.cuda.is_available()

# Get the GPU device name.
device_name = torch.cuda.get_device_name()
n_gpu = torch.cuda.device_count()
print(f"Found device: {device_name}, n_gpu: {n_gpu}")

Found device: Tesla K80, n_gpu: 1


## Installing Hugging Face's Transformers library
We will use Hugging Face's Transformers (https://github.com/huggingface/transformers), an open-source library that provides general-purpose architectures for natural language understanding and generation with a collection of various pretrained models made by the NLP community. This library will allow us to easily use pretrained models like `BERT` and perform experiments on top of them. We can use these models to solve downstream target tasks, such as text classification, question answering, and sequence labeling.

Run the following cell to install Hugging Face's Transformers library, download data and supporting code for the homework, and install some additional packages. Note that you will be asked to link with your Google Drive account to download some of these files. If you're concerned about security risks (there have not been any issues in previous semesters), feel free to make a new Google account and use it for this homework!

In [None]:
!pip install transformers==3.4.0
!pip install googletrans==3.1.0a0
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
!pip uninstall googletrans
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print('success!')

import os
import zipfile

data_file = drive.CreateFile({'id': '1zeo8FcaNUnhN660mGMNEAPvxOE4DPOnE'})
data_file.GetContentFile('hw1.zip')

Collecting googletrans==3.1.0a0
  Using cached googletrans-3.1.0a0-py3-none-any.whl
Installing collected packages: googletrans
  Attempting uninstall: googletrans
    Found existing installation: googletrans 4.0.0rc1
    Uninstalling googletrans-4.0.0rc1:
      Successfully uninstalled googletrans-4.0.0rc1
Successfully installed googletrans-3.1.0a0


Found existing installation: googletrans 3.1.0a0
Uninstalling googletrans-3.1.0a0:
  Would remove:
    /usr/local/bin/translate
    /usr/local/lib/python3.7/dist-packages/googletrans-3.1.0a0.dist-info/*
    /usr/local/lib/python3.7/dist-packages/googletrans/*
Proceed (y/n)? y
  Successfully uninstalled googletrans-3.1.0a0


Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
    from . import file_cache
  File "/usr/local/lib/python3.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    "file_cach

success!


In [None]:
# Extract data from the zipfile and put it into the current directory
with zipfile.ZipFile('hw1.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('hw1.zip')
# We will use hw1 as our working directory
os.chdir('hw1')
print("Data and supporting code downloaded!")

Data and supporting code downloaded!


In [None]:
pretrained_models_dir = './pretrained_models_dir'
if not os.path.isdir(pretrained_models_dir):
  os.mkdir(pretrained_models_dir)   # directory to save pretrained models
print('model directory created')
!pip install googletrans==4.0.0rc1
!pip install -r requirements.txt
print('everything set up!')

model directory created
Collecting googletrans==4.0.0rc1
  Using cached googletrans-4.0.0rc1-py3-none-any.whl
Installing collected packages: googletrans
Successfully installed googletrans-4.0.0rc1


everything set up!


# Part 1. Masked Language Modeling (15 points)

In this part, we will use large-scale pretrained language models (e.g., `BERT, XLNet, T5`) for different applications, including masked word completion, text generation, machine translation, and finally for solving downstream target tasks across several classes of problems, i.e., text classification, question answering, and sequence labeling. Let's begin!

We'll use `BERT` [(Devlin et al., 2019)](https://arxiv.org/pdf/1810.04805.pdf) for the task of masked word completion: given an input sentence with some words masked out, predict the masked word(s) based on its context. Run the following cell to download the pretrained `BERT` base model (cased) and tell PyTorch to use the GPU to run it.

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name_or_path = "bert-base-cased"
cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = AutoModelForMaskedLM.from_pretrained(model_name_or_path, cache_dir=cache_dir)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print('success!')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


success!


### Question 1.1 (5 points)
The below cell passes a single sentence with a [MASK] token into BERT, and returns the logits (i.e., the unnormalized probabilities) of the token prediction at each position of this sequence. Write some code in this cell that prints out the five most probable words for the masked position from the `token_logits` variable. If you did it right, you'll notice that these words mostly will make sense in the given context. 

*Hints*

*   Use `torch.where` to find the index of a masked token within the input tensor (note that `tokenizer.mask_token_id` gives us the index of the mask token in the vocabulary).
*   Use `torch.topk` to get the `k` largest elements of a given tensor along a given dimension.
*   Use `tokenizer.decode([token_id])` to convert a single integer `token_id` to a token string.



In [None]:
sentence = f"""We know it’s hard, but the most effective way to get back to the 
normal sooner is to wear a {tokenizer.mask_token} over your nose 
and under your chin in public spaces (indoors and outdoors)."""

# Encode the input sentence and get the model's output
input = tokenizer.encode(sentence, return_tensors="pt").to(device)
# The model outputs the masked language modeling logits of shape 
# [batch_size, sequence_length, vocab_size] 
token_logits = model(input)[0]

# YOUR CODE HERE!
import torch.nn.functional as F
mask_index = torch.where(input[0] == tokenizer.mask_token_id)
softmax = F.softmax(token_logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_5 = torch.topk(mask_word, 5, dim = 1)[1][0]
for token in top_5:
   word = tokenizer.decode([token])
   new_sentence = sentence.replace(tokenizer.mask_token, word)
  #  print("Output Sentence is: ", new_sentence)
   print("The predicted word is:" , word)
   print("*************************************************")

The predicted word is: mask
*************************************************
The predicted word is: hat
*************************************************
The predicted word is: scarf
*************************************************
The predicted word is: cap
*************************************************
The predicted word is: handkerchief
*************************************************


### Question 1.2 (5 points)
The below cell contains the same context as before but with an increasing number of contiguous [MASK] tokens (run the cell to print out each context).  For each input, replace each [MASK] token with the most probable token (i.e., the argmax of the probability distribution) as predicted by BERT, and then print out the resulting unmasked string. To be clear, your output should be six strings without any [MASK] tokens.


In [None]:
sentence = f"""We know it’s hard, but the most effective way to get back to the 
    normal sooner is to wear a {tokenizer.mask_token} over your nose 
    and under your chin in public spaces (indoors and outdoors)."""
logit_list = []
for idx in range(1,7):
  x = sentence.split()
  for mask_idx in range(idx):
      x[20+mask_idx] = tokenizer.mask_token
  x = ' '.join(x)
  print('input %d: %s' %(idx, x))
  input = tokenizer.encode(x, return_tensors="pt").to(device)
  token_logits = model(input)[0]
  logit_list.append((input, token_logits))
  # YOUR CODE HERE!
  mask_indexes = torch.where(input[0] == tokenizer.mask_token_id)
  softmax = F.softmax(token_logits, dim = -1)
  word_list = []
  for mask_idx in range(idx):
    mask_word = softmax[0, 24+mask_idx, :]
    word = tokenizer.decode([torch.argmax(mask_word)])
    word_list.append(word)
  # nsent = sentence.split()
  # for mask_idx in range(idx):
  #   nsent[20+mask_idx] = word_list[mask_idx-1]
  # nsent = ' '.join(nsent)
  print("Predicted Words are: ", word_list)
  print("****************************************************")

input 1: We know it’s hard, but the most effective way to get back to the normal sooner is to wear a [MASK] over your nose and under your chin in public spaces (indoors and outdoors).
Predicted Words are:  ['mask']
****************************************************
input 2: We know it’s hard, but the most effective way to get back to the normal sooner is to wear a [MASK] [MASK] your nose and under your chin in public spaces (indoors and outdoors).
Predicted Words are:  ['mask', 'under']
****************************************************
input 3: We know it’s hard, but the most effective way to get back to the normal sooner is to wear a [MASK] [MASK] [MASK] nose and under your chin in public spaces (indoors and outdoors).
Predicted Words are:  ['very', '-', 'your']
****************************************************
input 4: We know it’s hard, but the most effective way to get back to the normal sooner is to wear a [MASK] [MASK] [MASK] [MASK] and under your chin in public spaces (i

### Question 1.3 (5 points)
What do you notice about your outputs as the size of the masked span increases? Explain why this is happening.





***When the span increases, the sentences starts making lesser sense. When there was only one [MASK], it was able it predict "mask" for the sentence, for two [MASK] tokens, it was able to predict "mask under" which made sense in the context but afterwards, the meaning became less clear and in the end it started repeating the predicted words as well. This is happening becuase BERT was trained to predict only one mask token and when we give multiple mask tokens, it has very little context which it can use for prediction of the mask token*** 



# Part 2: Transfer learning with BERT (35 points)

With the advent of methods such as `BERT` [(Devlin et al., 2019)](https://arxiv.org/pdf/1810.04805.pdf), the dominant paradigm for developing NLP models has shifted to transfer learning: first, pretrain a large language model on large amounts of unlabeled data, and then fine-tune the resulting model on the downstream target task. In this section, we will use `BERT` to solve downstream target tasks across several classes of problems, including classification, question answering, and sequence labeling.

Run the cell below to import necessary packages and set some things up for fine-tuning `BERT`.

In [None]:
# coding=utf-8

import dataclasses
import logging
import math
import os
import timeit
from dataclasses import dataclass, field
from typing import Callable, Dict, List, Tuple, Optional

import numpy as np
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score
import torch
from torch.utils.data import DataLoader, SequentialSampler
from tqdm import tqdm

from transformers import (
    AutoConfig,
    AutoModelWithLMHead,
    AutoModelForSequenceClassification,
    AutoModelForQuestionAnswering,
    AutoModelForTokenClassification,
    AutoTokenizer,
    PreTrainedTokenizer,
    EvalPrediction
)
from transformers import (
    GlueDataset,
    SquadDataset,
    LineByLineTextDataset,
    TextDataset,
    DataCollatorForLanguageModeling,
)
from transformers import GlueDataTrainingArguments, SquadDataTrainingArguments
from transformers import (
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_output_modes,
    glue_tasks_num_labels,
    set_seed,
)
from transformers.data.processors.squad import SquadResult
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate,
)
from tasks import NER
from utils_ner import Split, TokenClassificationDataset, TokenClassificationTask

from transformers import glue_processors
from transformers.data.processors.utils import InputExample
from langdetect import detect

logger = logging.getLogger(__name__)


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """
    model_type: str = field(
        default="bert",
        metadata={"help": "Model type, e.g., bert."}
    )
    model_name_or_path: str = field(
        default="bert",
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models."}
    )
    do_lower_case: Optional[bool] = field(
        default=False,
        metadata={"help": "Whether you want to do lower case on input before tokenization."}
    )
    model_cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where you want to store the pretrained models downloaded from s3."}
    )
    data_cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where you want to store the cached features for the task."}
    )


@dataclass
class NerDataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    data_dir: str = field(
        metadata={"help": "The input data dir. Should contain data files for the task."}
    )
    labels: Optional[str] = field(
        default=None,
        metadata={"help": "Path to a file containing all labels for the task."},
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets."}
    )


@dataclass
class LMDataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    train_data_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a text file)."}
    )
    eval_data_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    line_by_line: bool = field(
        default=False,
        metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
    )

    mlm: bool = field(
        default=False, metadata={"help": "Train with masked-language modeling loss instead of language modeling."}
    )
    mlm_probability: float = field(
        default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
    )
    block_size: int = field(
        default=-1,
        metadata={
            "help": "Optional input sequence length after tokenization."
            "The training dataset will be truncated in block of this size for training."
            "Default to the model max input length for single sentence inputs (take into account special tokens)."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )


def get_dataset(
    args: LMDataTrainingArguments,
    tokenizer: PreTrainedTokenizer,
    evaluate: bool = False,
    cache_dir: Optional[str] = None,
):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
            tokenizer=tokenizer,
            file_path=file_path,
            block_size=args.block_size,
            overwrite_cache=args.overwrite_cache,
            cache_dir=cache_dir,
        )


DATA_TRAINING_ARGUMENTS = {
    "text_classification": GlueDataTrainingArguments,
    "question_answering": SquadDataTrainingArguments,
    "sequence_labeling": NerDataTrainingArguments,
}


AUTO_MODEL = {
    "text_classification": AutoModelForSequenceClassification,
    "question_answering": AutoModelForQuestionAnswering,
    "sequence_labeling": AutoModelForTokenClassification,
}


DATASET = {
    "text_classification": GlueDataset,
    "question_answering": SquadDataset,
    "sequence_labeling": TokenClassificationDataset,
}


# some functions for fine-tuning BERT on a downstream target task
def do_target_task_finetuning(model_name_or_path, task_type, output_dir, **kwargs):
    r""" Fine-tuning BERT on a downstream target task.
    Params:
        **model_name_or_path**: either:
            - a string with the `shortcut name` of a pre-trained model configuration to load from cache
                or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
            - a path to a `directory` containing a configuration file saved
                using the `save_pretrained(save_directory)` method.
            - a path or url to a saved configuration `file`.
        **task_type**: string:
            The class of the task to train, selected in
            ["text_classification", "question_answering", "sequence_labeling"].
        **output_dir**: string:
            The output directory where the model predictions and checkpoints will be written.
        **kwargs**: (`optional`) dict:
            Dictionary of key/value pairs with which to update the configuration object after loading.
            - The values in kwargs of any keys which are configuration attributes will be used
            to override the loaded values.
    """
    # See all possible arguments in src/transformers/training_args.py

    assert task_type in DATA_TRAINING_ARGUMENTS
    model_args = ModelArguments(model_name_or_path=model_name_or_path)
    data_args_params = {}
    for param in ["task_name", "data_dir"]:
        if param in kwargs:
            data_args_params.update({param: kwargs[param]})

    data_args = DATA_TRAINING_ARGUMENTS[task_type](**data_args_params)
    training_args = TrainingArguments(output_dir=output_dir)

    # override the loaded configs
    configs = (model_args, data_args, training_args)
    for config in configs:
        for key, value in kwargs.items():
            if hasattr(config, key):
                setattr(config, key, value)

    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. "
            f"Use --overwrite_output_dir to overcome."
        )

    for p in [model_args.model_cache_dir, model_args.data_cache_dir, training_args.output_dir]:
        if not os.path.exists(p):
            os.makedirs(p)

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )

    logger.info("Process device: %s, n_gpu: %s", training_args.device, training_args.n_gpu)
    logger.info("Training/evaluation parameters %s", training_args)


    # Set seed
    set_seed(training_args.seed)

    if task_type == "text_classification":
        try:
            data_args.task_name = data_args.task_name.lower()
            num_labels = glue_tasks_num_labels[data_args.task_name]
            output_mode = glue_output_modes[data_args.task_name]
        except KeyError:
            raise ValueError("Task not found: %s" % (data_args.task_name))
    elif task_type == "sequence_labeling":
        token_classification_task = NER() # You might want to this to Chunk() or POS()
        # if you are working with a Chunk or POS task, respectively
        labels = token_classification_task.get_labels(data_args.labels)
        label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)}
        num_labels = len(labels)

    # Load pretrained model and tokenizer

    AutoModel = AUTO_MODEL[task_type]
    auto_config_params = {
        'pretrained_model_name_or_path': model_args.model_name_or_path,
        'cache_dir': model_args.model_cache_dir,
    }

    if task_type == "text_classification":
        auto_config_params.update({
            "num_labels": num_labels,
            "finetuning_task": data_args.task_name,
        })
    elif task_type == "sequence_labeling":
        auto_config_params.update({
            "num_labels": num_labels,
            "id2label": label_map,
            "label2id": {label: i for i, label in enumerate(labels)},
        })

    config = AutoConfig.from_pretrained(**auto_config_params)

    auto_tokenizer_params = {
        "pretrained_model_name_or_path": model_args.model_name_or_path,
        "cache_dir": model_args.model_cache_dir,
        "do_lower_case": model_args.do_lower_case,
    }
    tokenizer = AutoTokenizer.from_pretrained(**auto_tokenizer_params)

    auto_model_params = {
        "pretrained_model_name_or_path": model_args.model_name_or_path,
        "from_tf": False,
        "config": config,
        "cache_dir": model_args.model_cache_dir,
    }

    if "model_load_mode" in kwargs and kwargs["model_load_mode"] == "base_model_only":
        WEIGHTS_NAME = "pytorch_model.bin"
        archive_file = os.path.join(model_args.model_name_or_path, WEIGHTS_NAME)
        # Use torch.load with map_location=torch.device() to map the pretrained model to our device.
        model_state_dict = torch.load(archive_file, map_location=torch.device(training_args.device))
        
        state_dict_with_prefix = {}
        for key, value in model_state_dict.items():
            if key.startswith(model_args.model_type):
                state_dict_with_prefix[key] = value

        auto_model_params.update({"state_dict": state_dict_with_prefix})
        
    model = AutoModel.from_pretrained(**auto_model_params)

    # Get datasets
    Dataset = DATASET[task_type]
    dataset_params = {
        "tokenizer": tokenizer,
    }
    if task_type == "sequence_labeling":
        dataset_params.update({
            "token_classification_task": token_classification_task,
            "data_dir": data_args.data_dir,
            "labels": labels,
            "model_type": model_args.model_type,
            "max_seq_length": data_args.max_seq_length
        })

    else:
        dataset_params.update({
            "args": data_args,
            "cache_dir": model_args.data_cache_dir,
        })

    train_dataset = (Dataset(**dataset_params) if training_args.do_train else None)

    dataset_params.update({"mode": Split.dev if task_type == "sequence_labeling" else "dev"})
    eval_dataset = (Dataset(**dataset_params) if training_args.do_eval else None)

    # Initialize our Trainer
    trainer_params = {
        "model": model,
        "args": training_args,
        "train_dataset": train_dataset,
        "eval_dataset": eval_dataset,
    }
    trainer = Trainer(**trainer_params)

    # Training
    if training_args.do_train:
        trainer.train(
            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()
        # For convenience, we also re-save the tokenizer to the same directory
        tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
    eval_results = {}
    if training_args.do_eval:
        if task_type == "text_classification":
            def build_compute_metrics_fn(task_name: str) -> Callable[[EvalPrediction], Dict]:
                def compute_metrics_fn(p: EvalPrediction):
                    if output_mode == "classification":
                        preds = np.argmax(p.predictions, axis=1)
                    elif output_mode == "regression":
                        preds = np.squeeze(p.predictions)
                    return glue_compute_metrics(task_name, preds, p.label_ids)
                return compute_metrics_fn

            logger.info("*** Evaluate ***")
            # Loop to handle MNLI double evaluation (matched, mis-matched)
            eval_datasets = [eval_dataset]
            if data_args.task_name == "mnli":
                mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
                eval_datasets.append(
                    Dataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev", cache_dir=model_args.data_cache_dir)
                )

            for eval_dataset in eval_datasets:
                trainer.compute_metrics = build_compute_metrics_fn(eval_dataset.args.task_name)
                eval_result = trainer.evaluate(eval_dataset=eval_dataset)

                output_eval_file = os.path.join(training_args.output_dir, f"eval_results.txt")
                with open(output_eval_file, "w") as writer:
                    logger.info("***** Eval results *****")
                    for key, value in eval_result.items():
                        logger.info("  %s = %s", key, value)
                        writer.write("%s = %s\n" % (key, value))

                eval_results.update(eval_result)

        elif task_type == "question_answering":
            # We don't use trainer.evaluate here since it currently does not support question answering tasks
            # (https://github.com/huggingface/transformers/issues/7032)
            model = AutoModel.from_pretrained(model_args.model_cache_dir)
            tokenizer = AutoTokenizer.from_pretrained(model_args.model_cache_dir, do_lower_case=model_args.do_lower_case)
            model.to(training_args.device)


            dataset = eval_dataset.dataset
            examples = eval_dataset.examples
            features = eval_dataset.features
            eval_batch_size = training_args.per_gpu_eval_batch_size * max(1, training_args.n_gpu)

            eval_sampler = SequentialSampler(dataset)
            eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=eval_batch_size)

            logger.info("*** Evaluate ***")
            description = "Evaluation"
            logger.info("***** Running %s *****", description)
            logger.info("  Num examples = %d", len(dataset))
            logger.info("  Batch size = %d", eval_batch_size)

            all_results = []
            start_time = timeit.default_timer()

            for batch in tqdm(eval_dataloader, desc=description):
                model.eval()
                batch = tuple(t.to(training_args.device) for t in batch)

                with torch.no_grad():
                    inputs = {
                        "input_ids": batch[0],
                        "attention_mask": batch[1],
                        "token_type_ids": batch[2],
                    }
                    feature_indices = batch[3]
                    outputs = model(**inputs)

                for i, feature_index in enumerate(feature_indices):
                    eval_feature = features[feature_index.item()]
                    unique_id = int(eval_feature.unique_id)
                    output = [output[i].detach().cpu().tolist() for output in outputs]
                    start_logits, end_logits = output
                    result = SquadResult(unique_id, start_logits, end_logits)
                    all_results.append(result)

            evalTime = timeit.default_timer() - start_time
            logger.info("  Evaluation done in total %f secs (%f sec per example)", evalTime, evalTime / len(dataset))

            # Compute predictions
            output_prediction_file = os.path.join(training_args.output_dir, "predictions.json")
            output_nbest_file = os.path.join(training_args.output_dir, "nbest_predictions.json")

            output_null_log_odds_file = os.path.join(training_args.output_dir, "null_odds.json") \
                if data_args.version_2_with_negative else None

            predictions = compute_predictions_logits(
                all_examples=examples,
                all_features=features,
                all_results=all_results,
                n_best_size=data_args.n_best_size,
                max_answer_length=data_args.max_answer_length,
                do_lower_case=model_args.do_lower_case,
                output_prediction_file=output_prediction_file,
                output_nbest_file=output_nbest_file,
                output_null_log_odds_file=output_null_log_odds_file,
                verbose_logging=False,
                version_2_with_negative=data_args.version_2_with_negative,
                null_score_diff_threshold=data_args.null_score_diff_threshold,
                tokenizer=tokenizer,
            )

            # Compute the F1 and exact scores.
            eval_result = squad_evaluate(examples, predictions)

            output_eval_file = os.path.join(training_args.output_dir, f"eval_results.txt")
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

            eval_results.update(eval_result)


        elif task_type == "sequence_labeling":
            def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
                preds = np.argmax(predictions, axis=2)
                batch_size, seq_len = preds.shape
                label_list = [[] for _ in range(batch_size)]
                pred_list = [[] for _ in range(batch_size)]

                for i in range(batch_size):
                    for j in range(seq_len):
                        if label_ids[i, j] != torch.nn.CrossEntropyLoss().ignore_index:
                            label_list[i].append(label_map[label_ids[i][j]])
                            pred_list[i].append(label_map[preds[i][j]])
                return pred_list, label_list

            def compute_metrics_fn(p: EvalPrediction) -> Dict:
                pred_list, label_list = align_predictions(p.predictions, p.label_ids)
                return {
                    "accuracy_score": accuracy_score(label_list, pred_list),
                    "precision": precision_score(label_list, pred_list),
                    "recall": recall_score(label_list, pred_list),
                    "f1": f1_score(label_list, pred_list),
                }

            trainer.compute_metrics = compute_metrics_fn
            eval_result = trainer.evaluate(eval_dataset=eval_dataset)

            output_eval_file = os.path.join(training_args.output_dir, f"eval_results.txt")
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key, value in eval_result.items():
                    logger.info("  %s = %s", key, value)
                    writer.write("%s = %s\n" % (key, value))

            eval_results.update(eval_result)

        else:
            raise ValueError("Invalid task type.")
    return eval_results


print('setup complete')

setup complete


## Fine-tuning BERT for text classification
Now, let's use `BERT` to solve a sentiment classification task. Specifically, we'll be using the Stanford Sentiment Treebank [(Socher et al., 2013)](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf), which was constructed from movie reviews data. We provide code to fine-tune BERT in a separate ["useful code" Colab notebook](https://colab.research.google.com/drive/1nJWA9rPkPrjjjtwN_vKUSQoomdfWLAFV?usp=sharing), so check that out if you're interested. However, since training on the full `SST` dataset (67K examples) takes a while, we provide you with a fine-tuned model to save time. Run the following cell to download the model. 

 

In [None]:
data_file = drive.CreateFile({'id': '1ZJ1_gWahH_OOBIrRm0aN9i8nvLB2olZC'})
data_file.GetContentFile('bert-base-cased-finetuned-sst.zip')

# Extract the data from the zipfile and put it into pretrained_models_dir
with zipfile.ZipFile('bert-base-cased-finetuned-sst.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-finetuned-sst.zip')
print("bert-base-cased-finetuned-sst downloaded!")

bert-base-cased-finetuned-sst downloaded!


Run the cell below to evaluate the trained model on the dev set. You should get an accuracy around 92%.

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/small{task_name}"
model_name_or_path = "bert-base-cased-finetuned-sst"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/small{task_name}"
output_dir = f"./output/finetuning/small{task_name}"

do_target_task_finetuning(
    model_name_or_path=model_cache_dir,
    task_name=f"{task_name}-2",
    task_type="text_classification",
    do_train=False,
    do_eval=True, 
    do_lower_case=True,
    data_dir=data_dir,
    max_seq_length=128,
    model_cache_dir=model_cache_dir,
    data_cache_dir=data_cache_dir,
    output_dir=output_dir,

)
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 18:17:39 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:17:39 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/finetuning/smallSST', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-17-39_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fa

12/14/2021 18:17:56 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:17:56 - INFO - __main__ -     eval_loss = 0.27018672227859497
12/14/2021 18:17:56 - INFO - __main__ -     eval_acc = 0.9197247706422018


Time elapsed: 17.14189815899954 seconds


### Question 2.1 (5 points)
Let's use the trained model to predict the sentiment of a given sentence. We will make a few predictions in the code below. Your task is to complete the code to print out the model's predicted probability distribution for each sentence.

*Hint:*

*   `model(inputs)[0]` gives you the logits of the model for `inputs`.

In [None]:
# Load the trained model and make a few predictions
model_name_or_path = "bert-base-cased-finetuned-sst"
pretrained_weights = os.path.join(pretrained_models_dir, model_name_or_path)
task_type = "text_classification"
model = AUTO_MODEL[task_type].from_pretrained(pretrained_weights)
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)

classes = ["negative", "positive"]

sentence_1 = "the movie has something interesting to say"
sentence_2 = "it was so awful that i walked out after 30 minutes :("

inputs_1 = tokenizer.encode(sentence_1, add_special_tokens=True, return_tensors="pt")
inputs_2 = tokenizer.encode(sentence_2, add_special_tokens=True, return_tensors="pt")


# YOUR CODE HERE!
token_logits_1 = model(inputs_1)[0]
token_logits_2 = model(inputs_2)[0]
print("Predicted Probability distribution of the two sentences are as under: ")
print("For sentence 1: ")
print(F.softmax(token_logits_1))
print("For sentence 2: ")
print(F.softmax(token_logits_2))

Predicted Probability distribution of the two sentences are as under: 
For sentence 1: 
tensor([[0.0011, 0.9989]], grad_fn=<SoftmaxBackward0>)
For sentence 2: 
tensor([[0.9978, 0.0022]], grad_fn=<SoftmaxBackward0>)




### Question 2.2 (5 points)
Come up with a new sentence that the model gets wrong. The sentence must contain some sentiment (i.e., it cannot be neutral), and the model should place a higher probability on the wrong label than the correct one. Show the model's prediction on this new sentence.

In [None]:
your_sentence = 'Thine eyes I love, and they as pitying me, Knowing thy heart torment me with disdain, Have put on black, and loving mourners be, Looking with pretty ruth upon my pain.' # change to your sentence
your_sentence_sentiment = 'negative' # change to your sentence's ground-truth sentiment
#your_model_prediction =  [0.1, 0.9] # obviously, change this to the model's prediction on your sentence

# YOUR CODE HERE
your_model_prediction =  F.softmax(model(tokenizer.encode(your_sentence, add_special_tokens=True, return_tensors="pt"))[0],dim=-1)[0]
print('your sentence: "%s"\nground-truth label: %s\npredicted negative prob: %0.2f\npredicted positive prob: %0.2f'\
      % (your_sentence, your_sentence_sentiment, your_model_prediction[0], your_model_prediction[1]))

your sentence: "Thine eyes I love, and they as pitying me, Knowing thy heart torment me with disdain, Have put on black, and loving mourners be, Looking with pretty ruth upon my pain."
ground-truth label: negative
predicted negative prob: 0.22
predicted positive prob: 0.78


### Question 2.3 (5 points)
Provide a reasonable explanation as to why the model got your sentence wrong. Also provide a plausible method to improve the underlying sentiment model so that this kind of error stops happening.

********

*   This could be due to two plausible reasons: First the style of text that I have used is Shakespearean and since most of the finetuning would have been done on normal style sentences, hence it gets difficult for the model to predict label correctly. Second reason is above sentence is comprised of multiple sentences and there are positive as well as negative words both but context on the whole is negative which model is not able to comprehend and hence it gets difficult. 
*   One of the plausible method is to augment our data with different styles for same sentences. We can use text style transfer models to augment the present datasets.



## Fine-tuning BERT for question answering
In this section, we will use `BERT` for a question answering task, i.e., `SQuAD` [(Rajpurkar et al., 2016)](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) whose dataset was built from Wikipedia. Training on the full `SQuAD` dataset (108K examples) would takes a couple of hours, so we will provide you with a trained model to save your time. Run the following cell to download the model.

In [None]:
data_file = drive.CreateFile({'id': '19cnGSN88KlRJRcIqwxw3C4ylJftdkZ2W'})
data_file.GetContentFile('bert-base-cased-finetuned-squad.zip')

# Extract the data from the zipfile and put it into pretrained_models_dir
with zipfile.ZipFile('bert-base-cased-finetuned-squad.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-finetuned-squad.zip')
print("bert-base-cased-finetuned-squad downloaded!")

bert-base-cased-finetuned-squad downloaded!


### Question 2.4 (10 points)

Okay, same drill as before! Your task is to complete the code to show the model's predicted answer to each question. If you forgot how `BERT` solves extractive question answering tasks, check out Section 4.2 and Figure 1 / Figure 4c) in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf). Your output should be three strings, each corresponding to the answer of one of the three given questions. 

*Hints*

*   `model(**inputs)]` gives you the start and end logits of the model for  `inputs`.
*   Use `tokenizer.convert_tokens_to_string` to convert a sequence of tokens (string) into a single string.
*   Use `tokenizer.convert_ids_to_tokens` to convert a sequence of indices into a sequence of tokens.

In [None]:
task_name = "SQuAD"
model_name_or_path = "bert-base-cased-finetuned-squad"
pretrained_weights = os.path.join(pretrained_models_dir, model_name_or_path)
task_type = "question_answering"
model = AUTO_MODEL[task_type].from_pretrained(pretrained_weights)
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)

context = """This course will broadly focus on deep learning methods for 
natural language processing. Most of the semester will focus on very recent 
transfer learning methods that have significantly pushed forward the state of 
the art. It is intended for graduate students in computer science and 
linguistics who are (1) interested in learning about cutting-edge research 
progress in NLP and (2) familiar with machine learning fundamentals. We will 
cover modeling architectures, training objectives, and downstream tasks (e.g., 
text classification, question answering, and text generation). Coursework 
includes reading recent research papers, programming assignments, and a final 
project. This class will be asynchronous: lectures will be prerecorded and 
posted on a weekly basis, along with accompanying readings and assignments."""

questions = [
    "What is the focus of this course?",
    "Who is this course intended for?",
    "What is the coursework?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    # YOUR CODE HERE!    

    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    sep_index = input_ids.index(tokenizer.sep_token_id)
    num_seg_a = sep_index + 1
    num_seg_b = len(input_ids) - num_seg_a
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)

    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                 token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    answer = ' '.join(tokens[answer_start:answer_end+1])
    print("Ques: ", question)
    print("Ans: ", answer)
    print("*************************************************")



Ques:  What is the focus of this course?
Ans:  deep learning methods for natural language processing
*************************************************
Ques:  Who is this course intended for?
Ans:  graduate students in computer science and linguistics
*************************************************
Ques:  What is the coursework?
Ans:  reading recent research papers , programming assignments , and a final project
*************************************************


### Question 2.5 (5 points)
Come up with a new question about this passage that the model gets wrong. The question must be answerable by the passage (i.e., its ground-truth answer should be a span of text within the passage). Show the model's predicted answer on this new sentence.

In [None]:
your_question =  "What category of students will benefit from the course this semester?" # change to your question
your_answer = 'graduate students in computer science and linguistics' # change to your sentence's ground-truth answer
# Model prediction changed at the end
# your_model_prediction = 'blah blah' # obviously, change this to the model's predicted answer span


# YOUR CODE HERE

inputs = tokenizer.encode_plus(your_question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
# YOUR CODE HERE!    

tokens = tokenizer.convert_ids_to_tokens(input_ids)
sep_index = input_ids.index(tokenizer.sep_token_id)
num_seg_a = sep_index + 1
num_seg_b = len(input_ids) - num_seg_a
segment_ids = [0]*num_seg_a + [1]*num_seg_b
assert len(segment_ids) == len(input_ids)

start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                              token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores)

your_model_prediction = ' '.join(tokens[answer_start:answer_end+1])
  


print('your question: "%s"\nground-truth answer: %s\npredicted answer: %s'\
      % (your_question, your_answer, your_model_prediction))

your question: "What category of students will benefit from the course this semester?"
ground-truth answer: graduate students in computer science and linguistics
predicted answer: [CLS]


### Question 2.6 (5 points)
Provide a reasonable explanation as to why the model got your question wrong. Also provide a plausible method to improve the underlying QA model so that this kind of error stops happening.

***WRITE YOUR ANSWER HERE IN A FEW SENTENCES***

*   I decided to select a paraphrase of above question in 2.4 i.e. ""Who is this course intended for?" and wrote it in an indirect manner as "What category of students will benefit from the course this semester?". For humans,  it is very simple to comprehend but for algorithms, it might require some resoning on its part which BERT was not able to do in this case. Since, model was unsure, it predicted [CLS] as both start and end tokens
*  A suitable approach would be to augment the data or simply use more data for finetuning. Data Augmentation can be carried out by the backtranslation approach as used for upcoming questions 3.5 and 3.6



# Part 3: Low-resource NLP (50 points)

In this second part of the homework, we will experiment with an extremely low-resource setting for which there are only a few training examples available for the downstream target task. We provide you with a tiny version of the `SST` dataset called `tinySST` (located at `data/tinySST`) with only 20 training examples (10 examples per each class). We will explore various data augmentation and finetuning approaches to improve the target task performance.

`BERT` is unstable and prone to
degenerate performance on tasks with small training sets. The below cell fine-tunes `BERT` on `tinySST` using some default hyperparameters and also reports the mean and standard deviation of the dev set accuracy across 4 random seeds. Run the cell to obtain these baseline numbers, which should be around 50% average accuracy  (it might take a couple of minutes to finish).

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/tiny{task_name}"

# Fine-tune BERT with default hyperparameters using 4 random seeds
for seed in [1234, 2341, 3412, 4123]:
  output_dir = f"./output/tiny{task_name}-{seed}"
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_name_or_path,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=32,
      learning_rate=2e-5,
      num_train_epochs=3.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True
  )

# Evaluate BERT on the dev set
results = []
for seed in [1234, 2341, 3412, 4123]:
  model_dir = f"./output/tiny{task_name}-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)

print(f"Accuracy on TinySST dev set: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 18:21:44 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:21:44 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-1234', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-21-44_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

Step,Training Loss


12/14/2021 18:21:52 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:21:52 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-2341', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-21-52_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=2341, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

Step,Training Loss


12/14/2021 18:22:01 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:22:01 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-3412', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-22-01_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=3412, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

Step,Training Loss


12/14/2021 18:22:09 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:22:09 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-22-09_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

Step,Training Loss


12/14/2021 18:22:17 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:22:17 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-1234', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-22-17_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

12/14/2021 18:22:35 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:22:35 - INFO - __main__ -     eval_loss = 0.7130435705184937
12/14/2021 18:22:35 - INFO - __main__ -     eval_acc = 0.4908256880733945
12/14/2021 18:22:35 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:22:35 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-2341', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-22-35_b9600159aec5', 

12/14/2021 18:22:52 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:22:52 - INFO - __main__ -     eval_loss = 0.6689030528068542
12/14/2021 18:22:52 - INFO - __main__ -     eval_acc = 0.6559633027522935
12/14/2021 18:22:52 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:22:52 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-3412', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-22-52_b9600159aec5', 

12/14/2021 18:23:09 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:23:09 - INFO - __main__ -     eval_loss = 0.7283246517181396
12/14/2021 18:23:09 - INFO - __main__ -     eval_acc = 0.5091743119266054
12/14/2021 18:23:09 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:23:09 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-23-09_b9600159aec5', 

12/14/2021 18:23:26 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:23:26 - INFO - __main__ -     eval_loss = 0.70796138048172
12/14/2021 18:23:26 - INFO - __main__ -     eval_acc = 0.41628440366972475


Accuracy on TinySST dev set: 0.5180619266055045 +/- 0.08688534989021443
Time elapsed: 102.05716707899956 seconds


### Question 3.1 (10 points)
These default fine-tuning hyperparameters are not optimal for such a small dataset. Some recent work has proposed simple tweaks to improve training stability and model performance in these settings [(Mosbach et al, 2020](https://arxiv.org/pdf/2006.04884.pdf), [Zhang et al., 2020)](https://arxiv.org/pdf/2006.05987.pdf). After looking through these papers, make some modifications to the arguments of the training command in the below cell (which currently just contains the previous cell's code) that result in a higher mean accuracy and a lower standard deviation than what we observed above. 

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/tiny{task_name}"

# Fine-tune BERT with your hyperparameters using 4 random seeds
for seed in [1234, 2341, 3412, 4123]:
  output_dir = f"./output/tiny{task_name}-{seed}"

  ### CHANGE ONLY THE ARGUMENTS TO THE BELOW FUNCTION
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_name_or_path,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=16,
      learning_rate=2e-5,
      num_train_epochs=40.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True
  )

# Evaluate BERT on the dev set
results = []
for seed in [1234, 2341, 3412, 4123]:
  model_dir = f"./output/tiny{task_name}-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)

print(f"Accuracy on TinySST dev set: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 18:25:37 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:25:37 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-1234', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-25-37_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:26:21 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:26:21 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-2341', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-26-21_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=2341, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:27:06 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:27:06 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-3412', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-27-06_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=3412, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:27:51 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:27:51 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-27-51_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:28:36 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:28:36 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-1234', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-28-36_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

12/14/2021 18:28:54 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:28:54 - INFO - __main__ -     eval_loss = 0.9202939867973328
12/14/2021 18:28:54 - INFO - __main__ -     eval_acc = 0.6467889908256881
12/14/2021 18:28:54 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:28:54 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-2341', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-28-54_b9600159aec5', 

12/14/2021 18:29:13 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:29:13 - INFO - __main__ -     eval_loss = 0.5446656346321106
12/14/2021 18:29:13 - INFO - __main__ -     eval_acc = 0.7350917431192661
12/14/2021 18:29:13 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:29:13 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-3412', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-29-13_b9600159aec5', 

12/14/2021 18:29:32 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:29:32 - INFO - __main__ -     eval_loss = 0.6850396990776062
12/14/2021 18:29:32 - INFO - __main__ -     eval_acc = 0.6272935779816514
12/14/2021 18:29:32 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:29:32 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-29-32_b9600159aec5', 

12/14/2021 18:29:50 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:29:50 - INFO - __main__ -     eval_loss = 1.2794135808944702
12/14/2021 18:29:50 - INFO - __main__ -     eval_acc = 0.5779816513761468


Accuracy on TinySST dev set: 0.6467889908256881 +/- 0.05681526455694989
Time elapsed: 253.80112847500004 seconds


### Question 3.2 (5 points)
Explain the changes that you made and provide some justification as to why they resulted in an improvement over the default hyperparameters.

**Based on papers mentioned above, the learning rate was kept same at 2*10^-5. For other parameters, in the paper the authors mention that for a very small available data, zero training loss overfitting is not an issue. Hence, it is preferred that we increase the number of iterations. This was done in two ways above: decresing the batch size to 16 and increasing the epochs to 40 which ultimately leads to improvement over default hyperparameters.** 



## Intermediate task fine-tuning
 [Phang et al. (2019)](https://arxiv.org/pdf/1811.01088.pdf) proposed the paradigm of *intermediate-task fine-tuning*: first, fine-tune `BERT` on an intermediate task, and then fine-tune the resulting model on the target task. They showed that using data-rich supervised tasks as intermediate tasks can substantially improve `BERT`'s performance on the target task. However, the conditions for successful intermediate-task fine-tuning (i.e., which tasks make good intermediate tasks) remain unclear. [Pruksachatkun et al. 2020](https://arxiv.org/pdf/2005.00628.pdf) observe that intermediate tasks that require high-level inference and reasoning abilities tend to work best, while [Vu et al. 2020](https://arxiv.org/pdf/2005.00770.pdf) indicate that the similarity between the intermediate task and the target task is crucial for successful intermediate-task fine-tuning.

In this question, we will use intermediate fine-tuning to improve our `tinySST` accuracy by first fine-tuning `BERT` on a data-rich supervised task. Here, we'll consider two tasks: natural language inference via the `MNLI` dataset [(Williams et al., 2018)](https://www.aclweb.org/anthology/N18-1101.pdf), which has 393K training examples, and `Yelp Review Full` [(Zhang et al., 2015)](https://arxiv.org/pdf/1509.01626.pdf), which is a 5-way sentiment classification task that has 650K training examples.

Since fine-tuning `BERT` on these datasets takes several hours, we will provide you with trained models to save your time. Run the following cell to download the models.

In [None]:
data_file = drive.CreateFile({'id': '1BGJYmTEq7PLmree42MfFsvYdDQPLaNR1'})
data_file.GetContentFile('bert-base-cased-finetuned-mnli.zip')

# Extract the data from the zipfile and put it into the data directory
with zipfile.ZipFile('bert-base-cased-finetuned-mnli.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-finetuned-mnli.zip')
print("bert-base-cased-finetuned-mnli downloaded!")

data_file = drive.CreateFile({'id': '1stDkJtL9xczoHH-iQnQ9GSZTseLILD-b'})
data_file.GetContentFile('bert-base-cased-finetuned-yelp.zip')

# Extract the data from the zipfile and put it into the data directory
with zipfile.ZipFile('bert-base-cased-finetuned-yelp.zip', 'r') as zip_file:
    zip_file.extractall(pretrained_models_dir)
os.remove('bert-base-cased-finetuned-yelp.zip')
print("bert-base-cased-finetuned-yelp downloaded!")

bert-base-cased-finetuned-mnli downloaded!
bert-base-cased-finetuned-yelp downloaded!


### Question 3.3 (10 points)
In the cell below, you should write code to fine-tune the `bert-base-cased-finetuned-mnli` model for `tinySST` and then evaluate the resulting model on the `tinySST` dev set. Unlike in previous problems, here we will just do the fine-tuning once (not with multiple random seeds). We don't provide any scaffolding code here, but you should have enough from previous cells to complete this fairly easily. Please use the improved hyperparameters you found in problem 3.1. This cell should print out the `tinySST` dev accuracy.

*Hint*
*   Since `MNLI` has three classes while `SST` has two, we need to discard its final classification layer. You can do this by calling `do_target_task_finetuning` with the argument `model_load_mode = "base_model_only"`. Please look at the cell above that defines the `do_target_task_finetuning` function to understand what this does. 

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
# model_name_or_path = "bert-base-cased-finetuned-mnli"
# model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
model_name_or_path = "./pretrained_models_dir/bert-base-cased-finetuned-mnli"
model_cache_dir = "./pretrained_models_dir/bert-base-cased-finetuned-mnli"
data_cache_dir = f"./data_cache/finetuning/tiny{task_name}"

# Fine-tune BERT with your hyperparameters using 4 random seeds
for seed in [4123]:
  output_dir = f"./output/tiny{task_name}-{seed}"

  ### CHANGE ONLY THE ARGUMENTS TO THE BELOW FUNCTION
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_name_or_path,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=16,
      learning_rate=2e-5,
      num_train_epochs=40.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True,
      **{"model_load_mode":"base_model_only"},
  )

# Evaluate BERT on the dev set
results = []
for seed in [4123]:
  model_dir = f"./output/tiny{task_name}-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir,
      **{"model_load_mode":"base_model_only"},
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)

print(f"Accuracy on TinySST dev set: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 18:32:07 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:32:07 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-32-07_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:32:51 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:32:51 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-32-51_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

12/14/2021 18:33:08 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:33:08 - INFO - __main__ -     eval_loss = 0.7349459528923035
12/14/2021 18:33:08 - INFO - __main__ -     eval_acc = 0.7717889908256881


Accuracy on TinySST dev set: 0.7717889908256881 +/- 0.0
Time elapsed: 60.94784784799958 seconds


In the below cell, do the same thing as you did in the previous cell, except fine-tune the `bert-base-cased-finetuned-yelp` model instead of the MNLI model. This cell should again print the `tinySST` dev accuracy. 

In [None]:
### YOUR CODE HERE
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
# model_name_or_path = "bert-base-cased-finetuned-mnli"
# model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
model_name_or_path = "./pretrained_models_dir/bert-base-cased-finetuned-yelp"
model_cache_dir = "./pretrained_models_dir/bert-base-cased-finetuned-yelp"
data_cache_dir = f"./data_cache/finetuning/tiny{task_name}"

for seed in [4123]:
  output_dir = f"./output/tiny{task_name}-{seed}"

  ### CHANGE ONLY THE ARGUMENTS TO THE BELOW FUNCTION
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_name_or_path,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=16,
      learning_rate=2e-5,
      num_train_epochs=40.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True,
      **{"model_load_mode":"base_model_only"},
  )

# Evaluate BERT on the dev set
results = []
for seed in [1234]:
  model_dir = f"./output/tiny{task_name}-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir,
      **{"model_load_mode":"base_model_only"},
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)

print(f"Accuracy on TinySST dev set: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 18:33:08 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:33:08 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-4123', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=40.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-33-08_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False,

Step,Training Loss


12/14/2021 18:33:51 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 18:33:51 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-1234', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_18-33-51_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, 

12/14/2021 18:34:09 - INFO - __main__ -   ***** Eval results *****
12/14/2021 18:34:09 - INFO - __main__ -     eval_loss = 0.8723371028900146
12/14/2021 18:34:09 - INFO - __main__ -     eval_acc = 0.6502293577981652


Accuracy on TinySST dev set: 0.6502293577981652 +/- 0.0
Time elapsed: 61.2154324739995 seconds


### Question 3.4 (5 points)
Compare your results to the mean result you got in problem 3.1 without any intermediate fine-tuning. What was your best model, and why do you think it outperformed the others? 

*** In case of the above 3.2 problem, I was able to get mean accuracy of ~0.64 but with intermediate finetuning task with MNLI, I was able to get an increased mean accuracy of ~0.77 which is a huge jump. This shows that fine-tuning on intermediate tasks can be helpful especially when the task that we are performing is similar in nature. This is due to the fact that when tasks are similar, we are able to update weights in a correct direction which can ultimately help with final task. In the paper cited in the explanation section above, the authors have also shared that since MNLI is a resoning based task, it helps more as an intermediate task. Thus, the mentioned model outperformed others. ***

---



## Data augmentation using back-translation
In this part, we will explore another approach to improve `BERT`'s performance on the target task: creating more training data (*data augmentation*) using *backtranslation*. Backtranslation refers to the process of translating a sentence from language `X` into another language `Y` (called the *pivot language*) and then translating the resulting sentence back into language `X`. Often, the final sentence contains significant lexical and syntactic variation compared to the original sentence, while roughly preserving its meaning. Here, we will use backtranslation to  obtain paraphrases of the training data of the `tinySST` dataset.

Run the following cell to load Google Translate's model and run it on a toy example.

In [None]:
import googletrans
# Run print(googletrans.LANGUAGES) to see available languages
from googletrans import Translator
translator = Translator()

# translate from English to French
output = translator.translate("I love natural language processing", src='en', dest='fr')
output.text

"J'aime le traitement des langues naturelles"

### Question 3.5 (15 points)
Complete the following cell to paraphrase the training data of `tinySST` using backtranslation. We have intentionally left this problem open-ended: feel free to use as many pivot languages as you like, and also write any postprocessing code you think might help. The cell after this one will fine-tune BERT on the augmented training data, so you can use its output to validate your backtranslation strategy. To obtain full points, the model fine-tuned on your augmented data must achieve a higher average accuracy than the model without any augmentation, trained with the same hyperparameters. 

In [None]:
!rm -r data/tinySST-bt/
!rm -r data_cache/finetuning/tinySST-bt/

In [None]:
import random

In [None]:
from tqdm import tqdm
task_name = "SST"
data_dir = f"./data/tiny{task_name}"
task_processor = glue_processors[f"{task_name.lower()}-2"]()
train_examples = task_processor.get_train_examples(data_dir)

train_examples_augmented = []

### (incomplete) list of languages you can use
languages = [
    'en', # english
    'cs',  # czech
    'de',  # german
    'es', # spanish
    'pa', #punjabi
    'pl', #polish
    'or', #oriya
    'mr', #Marathi	
    'mn', #Mongolian	
    'my', #Myanmar (Burmese)	
    'ne', #Nepali	
    'no', #Norwegian	
    'ny', #Nyanja (Chichewa)	
    'fi',  # finnish
    'fr', # french
    'hi', # hindi
    'it', # italian
    'ja', # japanese
    'pt', # portuguese
    'ru', # russian
    'vi', # vietnamese
    'zh-cn',  # chinese
    ]

# generate some augmented examples for each training example
for example in tqdm(train_examples):
    train_examples_augmented.append(example) # always include the original example

    # YOUR CODE HERE!
    for idx in range(12):
      
      target_language = random.choice(languages)
      output1  = translator.translate(example.text_a, src='en', dest=target_language)
      output2  = translator.translate(output1.text, src=target_language, dest='en')
      # print(output1.text)
      # print(output2.text)
      train_examples_augmented.append(InputExample(guid=f"{example.guid}-aug-{target_language}",
                                                      text_a=example.text_a,
                                                      text_b=output2.text,
                                                      label=example.label))
      target_language2 = random.choice(languages)
      target_language3 = random.choice(languages)

      output1  = translator.translate(example.text_a, src='en', dest=target_language2)
      output2  = translator.translate(output1.text, src=target_language2, dest=target_language3)

      output3  = translator.translate(output2.text, src=target_language3, dest='en')
      train_examples_augmented.append(InputExample(guid=f"{example.guid}-aug-{target_language2}-{target_language3}",
                                                      text_a=example.text_a,
                                                      text_b=output3.text,
                                                      label=example.label))

        # #mid_lang = np.random.choice(languages)
        # # interim = translator.translate(output.text, src=target_language,dest=mid_lang)
        # paraphrase = translator.translate(output.text, src=target_lan, dest='en')
        # # the below line adds a single new augmented example to the dataset. 
        # # note that the guid should be a unique ID for this example, so you'll want to vary this
        # # depending on how you generate your paraphrases
        # train_examples_augmented.append(InputExample(guid=f"{example.guid}-aug-{target_language}",
        #                                                 text_a=paraphrase.text,
        #                                                 text_b=None,
        #                                                 label=example.label))
    
    

output_dir = f"./data/tiny{task_name}-bt"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
with open(os.path.join(output_dir, "train.tsv"), "w") as writer:
    writer.write("sentence\tlabel\n")
    for example in train_examples_augmented:
        writer.write(f"{example.text_a}\t{example.label}\n")

# Copy the original tinySST's dev set to the new directory
import shutil
shutil.copyfile(f"{data_dir}/dev.tsv", f"{output_dir}/dev.tsv")

100%|██████████| 20/20 [16:59<00:00, 50.97s/it]


'./data/tinySST-bt/dev.tsv'

The below cell fine-tunes BERT `bert-base-cased` with the combined training data (real + synthetic training examples) and then evaluates the resulting model on tinySST's dev set. Note that it uses the default fine-tuning hyperparameters, not the improved ones that you found earlier. You should observe a significantly higher accuracy than 50% when you run this cell on the augmented data (our reference implementation reaches 64%). ***Do NOT modify any code in this cell!***

In [None]:
start_time = timeit.default_timer()
task_name = "SST"
data_dir = f"./data/tiny{task_name}-bt"
model_name_or_path = "bert-base-cased"
model_cache_dir = os.path.join(pretrained_models_dir, model_name_or_path)
data_cache_dir = f"./data_cache/finetuning/tiny{task_name}-bt/"
output_dir = model_cache_dir

mean = None
std = None

# Fine-tune BERT using 4 random seeds
for seed in [1234, 2341, 3412, 4123]:
  output_dir = f"./output/tiny{task_name}-bt-{seed}"
  do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_name_or_path,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=True,
      do_eval=False, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      per_device_train_batch_size=32,
      learning_rate=2e-5,
      num_train_epochs=3.0,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=output_dir,
      overwrite_output_dir=True
  )

# Evaluate BERT on the dev set
results = []
for seed in [1234, 2341, 3412, 4123]:
  model_dir = f"./output/tiny{task_name}-bt-{seed}"
  result = do_target_task_finetuning(
      seed=seed,
      model_name_or_path=model_dir,
      task_name=f"{task_name}-2",
      task_type="text_classification",
      do_train=False,
      do_eval=True, 
      do_lower_case=True,
      data_dir=data_dir,
      max_seq_length=128,
      model_cache_dir=model_cache_dir,
      data_cache_dir=data_cache_dir,
      output_dir=model_dir,
  )
  results.append(result["eval_acc"])

results = np.array(results)
mean = np.mean(results)
std = np.std(results)

print("===== Data augmentation using back-translation =====")
print(f"Performance when fine-tuning BERT: {mean} +/- {std}")
elapsed_time = timeit.default_timer() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

12/14/2021 17:33:31 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:33:31 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-1234', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-33-30_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fals

Step,Training Loss


12/14/2021 17:34:38 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:34:38 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-2341', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-34-38_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=2341, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fals

Step,Training Loss


12/14/2021 17:35:46 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:35:46 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-3412', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-35-46_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=3412, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fals

Step,Training Loss


12/14/2021 17:36:53 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:36:53 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-4123', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-36-53_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=4123, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fals

Step,Training Loss


12/14/2021 17:38:01 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:38:01 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-1234', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-38-01_b9600159aec5', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=1234, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fals

12/14/2021 17:38:19 - INFO - __main__ -   ***** Eval results *****
12/14/2021 17:38:19 - INFO - __main__ -     eval_loss = 0.7848482728004456
12/14/2021 17:38:19 - INFO - __main__ -     eval_acc = 0.6330275229357798
12/14/2021 17:38:19 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:38:19 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-2341', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-38-19_b9600159aec5

12/14/2021 17:38:36 - INFO - __main__ -   ***** Eval results *****
12/14/2021 17:38:36 - INFO - __main__ -     eval_loss = 0.5564491748809814
12/14/2021 17:38:36 - INFO - __main__ -     eval_acc = 0.7121559633027523
12/14/2021 17:38:36 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:38:36 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-3412', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-38-36_b9600159aec5

12/14/2021 17:38:53 - INFO - __main__ -   ***** Eval results *****
12/14/2021 17:38:53 - INFO - __main__ -     eval_loss = 0.7340466976165771
12/14/2021 17:38:53 - INFO - __main__ -     eval_acc = 0.5974770642201835
12/14/2021 17:38:53 - INFO - __main__ -   Process device: cuda:0, n_gpu: 1
12/14/2021 17:38:53 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./output/tinySST-bt-4123', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec14_17-38-53_b9600159aec5

12/14/2021 17:39:10 - INFO - __main__ -   ***** Eval results *****
12/14/2021 17:39:10 - INFO - __main__ -     eval_loss = 0.8820002675056458
12/14/2021 17:39:10 - INFO - __main__ -     eval_acc = 0.5871559633027523


===== Data augmentation using back-translation =====
Performance when fine-tuning BERT: 0.632454128440367 +/- 0.04906126284548676
Time elapsed: 339.2624342959998 seconds


 ### Question 3.6 (5 points)
Briefly explain your backtranslation strategy here. Why do you think it resulted in an improvement?



*   I follow a simple strategy for the backtranslation approach. Firstly, I have added 9 new languages to the existing language list above. After that I run the loop 12 times and within each loop, I augment the sentence with two different ways: one pivot approach and two pivot approach. In one pivot approach, I choose a pivot language randomly, then convert english to pivot and pivot back to english, this modified change is appended to the list of sentences used for training. A similar strategy is followed but in this case, I use two pivots, i.e. convert english to pivot language 1 and then from pivot language1 to pivot language 2 and then from this pivot language 2 back to English. 


In [None]:
# END OF ASSIGNMENT 