# Lab 9: Finetuning BERT-based models

In this lab, we will explore finetuning a BERT model for classification. Your task will be to classify movie reviews as positive or negative (i.e. the task from HW2). Supporting code will help you with the actual training. Your job is to focus on formatting the input and evaluating the accuracy of your final model.

Once you finish working on this lab, please download it as a .ipynb notebook and submit the notebook to Moodle.

#### **What should I do if I run out of RAM?**

The free GPUs that Colab assigns might not always reliable. Sometimes you code will run without issues, and other times you might run into RAM errors. For this reason, try to train your models on as much data as possible, but do not worry if you are not able to train it on all of the data. You can also try to run the models on your personal computers without using GPUs!


### Guiding Questions

1. How do we use pretrained transformer models?
1. How do we finetune a neural model of language for classification?

### Learning Objectives.

1. Understand how to map classification to the context of transformers
1. Hands on experience using pretrained models
1. Build a classifer on top of a pretrained model
1. Reason about your model and its abilities

### Rubric

| Question | Points |
| ------| ----- |
| load_data | 25 Points |
| Reflection | 75 Points |

### Deadline:

November 12, 11PM EST

### Submission format:
ipynb file saved after running your code cells and submitted to Moodle

## Overview

We will use HuggingFace throughout this lab. Finetuning involves XX steps:

1. Loading a pretrained model
2. Loading our finetuning data
3. Searching for good hyperparameters for our model
4. Training our model
5. Testing our model
6. Saving our model

HuggingFace comes with tools for doing all of these. I will give very brief overviews of them below and point you to their code base. This will be useful for at least some of your final projects, so please look over these materials on your own.

### The model and the tokenizer

HuggingFace hosts a large number of pretrained models. You can find them [here](https://huggingface.co/). The main library for working with these models is transformers (documentation [link](https://huggingface.co/docs/transformers/index)). Each model comes with a tokenizer which maps from words to ids for the relevant model.

### The data format

In order to finetune our model with HuggingFace we need our data formatted in a particular way. We will use their dataset library (documentation [link](https://huggingface.co/docs/datasets/index)). Consider the following (modified) sample from the movie review dataset we are using:

    [{'text': 'Note that I did not say that it is',
        'label': 0},
    {'text': 'In what is arguably the best outdoor adventure film of all time,
        four city guys confront nature\'s wrath, in a story of survival.'  
        'label': 1}]

Notice that it is a list of dictionaries mapping text to their labels. You will write a data loader that does this step.

### Hyperparameter search with Trainer

Recalling, HW2 one thing we have to do is find good hyperparameters. Luckily, there exists libraries that facilitate this. We will use [optuna](https://optuna.org/) coupled with HuggingFace's libraries to find optimal hyperparameters for our model.

### Training with Trainer

To finetune our model, we need a way of training the model. HuggingFace has a utility called [Trainer](https://huggingface.co/docs/evaluate/main/en/transformers_integrations#trainer) that will handle this for us.

### Testing with Evaluate

Finally, we need to evaluate our model to see if it is any good. You've already done your own accuracy, precision, recall, and F1-scores before. Here will make use of HuggingFace's [Evaluate](https://huggingface.co/docs/evaluate/main/en/index) library to do the hard things for us.


## Setup

In [None]:
!pip install optuna transformers datasets accelerate evaluate

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch, evaluate, accelerate
from transformers import TrainingArguments, Trainer
import glob
import numpy as np
from datasets import Dataset, load_dataset
import optuna
import random

In [None]:
# Set device depending on whether or not you have access to GPUs
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
device

## Load Model

In [None]:
modelname = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(modelname, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(modelname,
                                                            num_labels=2).to(device)

## Load Data

In [None]:
!gdown 18iyGEGz4csxVDUee5gqUafbmvovUBzvr
!unzip -o sentiment_data.zip
!rm -rf __MACOSX/

In [None]:
#TODO: 25pts
def load_data(path: str, label2id: dict) -> list[dict]:
    """ Loads movie review data into a dictionary
    Args:
        path (str): Path to sentiment data directory
                    (e.g., sentiment_data)
        label2id (dict): Dict mapping label (i.e. pos, neg)
                        to numbers (e.g., {'pos': 0, 'neg': 1})
    Returns:
        data (list[dict]): List of dictionaries, see the markdown block above
                            this cell.
    """
    raise NotImplementedError

In [None]:
def getDataset(path: str, label2id: dict,
               tokenizer:AutoTokenizer=None,
              tokenize:bool=True,
               percent:float = 0.25) -> Dataset:
    """ Return HuggingFace Dataset instance
    Args:
        path (str): path to directory
        label2id (dict): Dictionary mapping classification labels to id
        tokenizer (AutoTokenizer): A HuggingFace pre-trained tokenizer
        tokenize (bool): Whether to tokenize data. Default True
    Returns:
        (Dataset): HuggingFace Dataset instance
    """
    data = load_data(path, label2id)
    # Shuffle the data
    random.shuffle(data)
    data = data[:int(len(data)*percent)]
    data = Dataset.from_list(data)
    # Tokenize
    if tokenize:
        if tokenizer is None:
            print('Pass a tokenizer')
            return
        data = data.map(lambda examples: tokenizer(examples["text"],
                                                   return_tensors="pt",
                                                   padding=True, truncation=True),
                        batched=True).with_format("torch")
    return data

In [None]:
train_small_dataset = getDataset("sentiment_data/train", {"pos": 0, "neg": 1}, tokenizer, percent=0.05)
train_dataset = getDataset("sentiment_data/train", {"pos": 0, "neg": 1}, tokenizer, percent=0.25)
eval_dataset = getDataset("sentiment_data/eval", {"pos": 0, "neg": 1}, tokenizer)

## Hyperparameter search

Recall from HW2 that models involve hyperparameters. We often want to find optimal hyperparameters that will result in better models. We can automate this process using HuggingFace. What will need to do is install a hyperparameter optimization library. We will use [optuna](https://optuna.org/).

At its core, hyperparameter optimization is about trying different configurations and comparing model performance. To facilitate this we will need a way of reseting our model so we can try new hyperparameters. The model_init function below does just that. Additionally, we will want to sample a smaller amount of data, since tuning can take a long time! The code below creates a set up that does this. Look over the code and make sure you understand its aims!

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(modelname, num_labels=2).to(device)

# Uses accuracy is the metric at eval steps
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set some initial parameters
batch_size=20
args = TrainingArguments(
        f"{modelname}-finetuned-movie-reviews",
        evaluation_strategy = "epoch",
        save_strategy = "epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=2,
        weight_decay=0.01
)

# Set up a trainer with less data
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_small_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Find the best hyperparameters over 10 runs
best_run = trainer.hyperparameter_search(n_trials=2, direction="maximize",
                                        backend="optuna")

## Train

Now that we have some (hopefully) good hyperparameters, let's train our model on our full training data! This may take some time, so look over the reflection questions in the meantime!

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(modelname,
                                                            num_labels=2).to(device)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
trainer.train()

## Test

Now that we've trained our model, we need to see if it's any good. Let's evaluate it on test data. First we load that data, then we make use of HuggingFace's evaluate library. We are interested in text classification here, so we use that task. In particular, we return the accuracy, precision, recall, and F1-score for our trained model on our test data.

If you ran into memory issues or want to evaluate a model trained for longer on more data, run the following code block.

In [None]:
!gdown 1n6M1LasX02kEe4KYMiNHbbH0-ZIH1Zio
!unzip -o distilbert-for-movie-reviews.zip
!rm -rf __MACOSX/

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-for-movie-reviews",
                                                           num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained("distilbert-for-movie-reviews")

The following gets your test data and returns metrics for the model.

In [None]:
test_dataset = getDataset("sentiment_data/test", {"pos": 0, "neg": 1}, tokenizer, tokenize=False)

In [None]:
task_evaluator = evaluate.evaluator("text-classification")
model.eval()
model.to("cpu")
results = task_evaluator.compute(
    model_or_pipeline=model,
    tokenizer=tokenizer,
    data=test_dataset,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0})
print(results)

## Save Model

Finally, we may want to save our final model. We can do that as below. I also show how we can load our pretrained model for use later.

In [None]:
# Save the model
trainer.save_model("distilbert-for-movie-reviews")

In [None]:
# Load a pretrained model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-for-movie-reviews",
                                                           num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained("distilbert-for-movie-reviews")

## Reflection

1. [25pts] Below, evaluate a non-finetuned version of DistilBERT on our task. Compare the accuracy of that model with your final model. Reflect on your precision, recall, and F1-score. What is your model doing better or worse at?

In [None]:
# Write your code here


2. [10pts] How does the model performance compare to your Naive Bayes' Classifier? What do you think might contribute to the differences between these two models?  

[Write your answer here]

3. [5pts] For hyperparameter tuning, what parameters in your model were hyperparameters?

[Write your answer here]

4. [35pts] Run the following code snippet, which evaluates your model on a movie review I wrote. Try out some cases of your own. Does your model work well? Can you come up with cases that trick it?  

In [None]:
@torch.no_grad()
def getScore(review: str, model: AutoModelForSequenceClassification,
            tokenizer: AutoTokenizer) -> str:
    """ Returns the label of the movie review"""
    model.eval()
    model = model.to("cpu")
    input_ids = tokenizer(review, return_tensors="pt", padding=True,
                          truncation=True)
    output = model(**input_ids).logits
    pred = np.argmax(output, axis=-1).tolist()[0]
    if pred == 0:
        return "Positive"
    else:
        return "Negative"

review = "Sunset Boulevard is an eery movie that deeply upset me. I did love it though."
getScore(review, model, tokenizer)