# Homework 3, Part 2

For this part of the homework, we'll work with the data we labeled in Part 1 on estimating human values to fine-tune a deep learning classifier based on a pre-trained language model. Pre-trained language models like BERT or GPT4 have all been trained to do the task of language modeling in some form and their parameters "know" about language through doing the task, much like how in Homework 2 the word vectors end up "knowing" about word meaning through the task of context word prediction. By starting from these parameters, we can often build a much more effective model for some NLP tasks, such as a classifier. Often, we speak of _fine-tuning_ the parameters for a specific task to distinguish from the _pre-training_ step. Both are training the model, but the former is what we'll do as practitioners to adapt the model to get it to do what we want.

To fine-tune our classifier, we'll be using the Huggingface [`transformers`](https://huggingface.co/) library. This library provides many useful functions for working with pre-trained language models and its [model repository](https://huggingface.co/models) contains many pre-trained models that companies and individuals have shared for doing different tasks. With their code, we can often quickly get a basic model up and running.

In particular, we'll be going through how to use the Huggingface [`Trainer`](https://huggingface.co/docs/transformers/training) class, which is a powerful, high-level abstraction for how to fine-tune models for many types of different NLP tasks. We'll look at two types of tasks:
- Training a regular classifier
- Training a _multilabel_ classifier (i.e., one that predicts multiple different labels at the same time)

The first type is similar to what we saw in Homeworks 1 and 2. The second type is what we'll need to use with our data annotation where we have zero or more values being present in any given instance ---we want to predict which of them are present at once, not just one at a time.

We'll start by going through the basics of how to load data and get it prepared for use with `Trainer` and then extend it to the multilabel case.

## Summary of the Training Task and Learning Goals

Homework 3 focuses on a subjective annotation/classification task: What values are signaled by an author in their writing? For this part of the homework we've provided example data from the recent SemEval shared task: [SemEval-2023 Task 4: ValueEval: Identification of Human Values Behind Arguments](https://aclanthology.org/2023.semeval-1.313/). This dataset and format will be similar to what we have to annotate in Part 1, so this is just to get you started.

Learning goals
- Gain experience in working with `torch` and `transformer` libraries 
- Learn how to tokenize text and create batches
- Learn how to fine-tune pre-trained models for different tasks using the `Trainer` class
- Learn how to evaluate models
- Learn how to train models on a GPU 
- Learn how to use Great Lakes to submit jobs 

## How to do this part of the assignment

Huggingface makes it easy to fine-tune models. We'll be working with one very small model in this case. To get things started, you can get most of the code working in this notebook, including doing some very small-scale training (e.g., a few training steps) which will verify that all the steps work. Once you get it working, you'll then convert this notebook to a script and run it on Great Lakes to train the whole model and save the model. You only need to use Great Lakes for training.

Once you have the final model, you can run the experiments with it on your laptop in the next notebook (Part 3). 

## Important Notes on Training on Great Lakes

This homework requires that you use a GPU for fine-tuning your model. You can use your own or you can use the GPUs on Great Lakes for free.  We have provided you with a course account for Great Lakes, `si630w24`, which will give you access to many hours of GPU time on the cluster using state-of-the-art GPUs. Your Great Lakes jobs are limited to 4 hours of training, which is more than enough to train your model for this assignment. 

We've put together a [guide](https://docs.google.com/document/d/1YtOkxSGUyX0siaOtKhry-dMgKDqPAycPvlE62wbrBsM/edit?usp=sharing) on how to access Great Lakes, set up your environment, and submit your jobs. We strongly encourage learning to use Great Lakes for this assignment as (1) you have free access to substantial computational resources and (2) you would want to use it for the next homework and the final project.

Great Lakes uses a queueing system where users (like you) submit requests for their programs to run (often called "jobs"). Great Lakes supports two kinds of job requests: (1) interactive mode jobs that give you a Jupyter notebook and (2) running a script as a job. **We strongly encourage the scripts and discourage interactive jobs for the GPUs.** To get a GPU, you'll need to submit a job to the cluster, which uses [SLURM for scheduling](https://arc.umich.edu/greatlakes/slurm-user-guide/). If you attempt to queue for an interactive job, you will have no control over when it starts, so you may end up having your notebook run for 4 hours from 3am to 7am and then it ends, at which point you have to get back in the queue. If you submit a job as a script (i.e., a .py file that runs the code in this notebook), it will run for the specified amount of time and save your fine-tuned model without you having to interact with anything. 

SLURM and cluster scheduling are very common in some industries where there is a single cluster resource and people share it by submitting jobs to run so that no one can monopolize the system and that jobs can run in parallel. Given that Great Lakes will be useful to you in future assignments and projects, we strongly encourage you to learn how to use it effectively in this assignment.

Depending on how you're working on this file, there are a few ways to directly convert the notebook to a file if you use [Jupyter or the command line](https://mljar.com/blog/convert-jupyter-notebook-python/) or [VSCode](https://stackoverflow.com/questions/64297272/best-way-to-convert-ipynb-to-py-in-vscode). Once you convert it, you'll modify the file some to change the epochs and text file as specified in the PDF. **We also strongly recommend having your script save the model at the end of every epoch.**  That way, if your script takes longer than 4 hours and gets killed, you still have the best-saved model you could get based on the amount of training you could do.

In [1]:
import pandas as pd
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorWithPadding, AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
from tqdm import tqdm
import os
from datasets import load_dataset
from collections import defaultdict
from datasets import load_metric
from datasets import Dataset, DatasetDict

from transformers import EvalPrediction
from sklearn.metrics import precision_recall_fscore_support


# import torch
torch.manual_seed(42)
np.random.seed(42)

[2024-04-04 20:47:45,636] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)


# Task 3.1.1: Load Dataset 

We'll start by loading the dataset for the [SemEval 2023 Values prediction task](https://aclanthology.org/2023.semeval-1.313/) and converting this into a `Dataset` that we can use to train our model. 

For the `Trainer` code to work, it expects a column called "labels" (unless you want to configure it more heavily). In our case, we've already given you one column to rename as "labels". 

- Start by [loading](https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html) the training, dev, and test data into a pandas `DataFrame`.
- Add or name that column so each of the `DataFrame`s has a "labels" column corresponding to the value of the "Security: personal" columns
- Then convert all of these into a `Dataset`.
- Then wrap all of these in [`DatasetDict`]()

We recommend reading the [Huggingface documentation](https://huggingface.co/docs/datasets/index) on how to load and interact with datasets, as you'll end up doing this a lot as a practitioner!

In [2]:
# TODO: See above instructions

# Load data into pandas DataFrames
train_df = pd.read_csv('si630-w24-train.tsv', sep='\t')
dev_df = pd.read_csv('si630-w24-dev.tsv', sep='\t')
test_df = pd.read_csv('si630-w24-test.tsv', sep='\t')

# Rename the relevant column to 'labels'
train_df = train_df.rename(columns={'Security: personal': 'labels'})
dev_df = dev_df.rename(columns={'Security: personal': 'labels'})
test_df = test_df.rename(columns={'Security: personal': 'labels'})

# Convert DataFrames to Huggingface Datasets
train_dataset = Dataset.from_pandas(train_df)
dev_dataset = Dataset.from_pandas(dev_df)
test_dataset = Dataset.from_pandas(test_df)

# Create a DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': dev_dataset,
    'test': test_dataset
})

In [3]:
display(train_df)

Unnamed: 0,inst_id,text,labels
0,A01002,We should ban human cloning because as it will...,0
1,A01005,We should ban fast food because fast food shou...,1
2,A01006,We not should end the use of economic sanction...,0
3,A01007,We not should abolish capital punishment becau...,0
4,A01008,We not should ban factory farming because fact...,1
...,...,...,...
5388,E08016,The EU should integrate the armed forces of it...,0
5389,E08017,Food whose production has been subsidized with...,1
5390,E08018,Food whose production has been subsidized with...,0
5391,E08019,Food whose production has been subsidized with...,1


# Task 3.2 Preparing the Data

Once we have our Dataset created, we need to turn it into features and labels so we can train the model with it.

## Task 3.2.1 Tokezing

Pre-trained language models are each associated with a specific method for tokenizing data, much like how in Homework 1 you write the `tokenize` function that turns strings into discrete features. Huggingface has a fast [`tokenizers`]() package that we will use here through the [`AutoTokenizer`]() class.

On Huggingface, each model is associated with a particular unique string. For example, the original BERT model is `bert-based-uncased`. You can use this string to look up the corresponding tokenizer with `AutoTokenizer`

Be sure to set the `model_max_length` argument to indicate what's the maximum length sequence this model can handle.


In [4]:
# NOTE: you can use this smaller model if you want to get started
# model_name = 'microsoft/MiniLM-L12-H384-uncased'
model_name = "google-bert/bert-base-cased"

# TODO: Load the tokenizer using AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=512)

## Task 3.2.2 Converting the labels 

Each label needs to also be associated with an ID (starting at 0). In 3.2.2, we'll do a _very simple_ pass and **use only one label column** (you'll use more columns later). Your tasks are

- Create a list of the labels in the dataset.  For Task 3.2.2, *this list should contains only a single label* (for now)
- Create a mapping from ID to label, and the reverse 


In [5]:
labels = train_df['labels'].unique()
id2label = label2id = {label: id for id, label in enumerate(labels)}
label2id = id2label = {id: label for label, id in label2id.items()}


print("Label to ID:", label2id)
print("ID to Label:", id2label)

Label to ID: {0: 0, 1: 1}
ID to Label: {0: 0, 1: 1}


## Task 3.2.3: Preprocessing the data

Models are typically fine-tuned using _batches_ rather than single instances. For the attention-based classifier in Homework 2 Part 2, we used a batch size of 1 because of the key challenge in batching over sequences: texts have different lengths, but all items in a batch need to be the same length. 

Here, we'll use a function to _create_ batches of the same length by padding them with extra tokens typically labeled as `[PAD]`. The underlying model code knows to avoid doing computation with these tokens so they don't have any effect on the text, other than making the tensors all the same size; if you're curious, look around for details of the "attention mask" to see how the "[PAD]" token gets ignored (you'll also see this in the tokenized dataset object too!).

One of the big reasons we use a `Dataset` is that it supports easy preprocessing to turn the text into IDs and do this truncation for us. We'll define the function below that says how to transform the instances and then call `map` on the dataset to get the preprocessed/tokenized data back out.

**Important Note**: One key hyperparameter to deal with is the maximum length of the sequence. If you recall from our attention, the attention mechanism is $O(n^2)$ for the length $n$ of a sequence. This means long sequences get very expensive in the kinds of models we'll use here. Since all the items in a batch get padded to the length of either the longest sequence in a batch or the maximum length of the model, one long sequence can mean we use a lot more memory just to hold empty [PAD] tokens. As a result, often we can truncate very long sequences to make them fit in memory, under the assumption that the extra tokens aren't that informative. In the code below, we use the longest sequence in a batch which hopefully helps keep things smaller. However, in your projects, you may explore other truncation strategies.

In [7]:
def tokenize_function(examples, padding="longest", truncation=True):

    ### TODO: Tokenize the text using the tokenizer using the 
    # specified function arguments
    # pass
    return tokenizer(examples["text"], padding="longest", truncation=True)
    
# TODO: Call .map on the dataset to tokenize the data
tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)

Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

In [8]:
# Let's see what all got added to the tokenized_datasets
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['inst_id', 'text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5393
    })
    validation: Dataset({
        features: ['inst_id', 'text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1896
    })
    test: Dataset({
        features: ['inst_id', 'text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1576
    })
})

In [9]:
# Let's take a look at what we have
example = tokenized_datasets['train'][0]

# We can reverse the tokenization to see the original text too
tokenizer.decode(example['input_ids'])

'[CLS] We should ban human cloning because as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

### Decide on where the computation will be

The deep learning library under all the huggingface code is `torch` which is a set of libraries for doing matrix math quickly. Different hardware have different capabilities for how to do math so torch lets you specify where you want to do the computation with the `device` argument. GPUs are designed to do fast matrix multiplication and so are ideal for our purposes. As you might have noticed in homework 2, `torch` can change the device though and will run just fine (but slower) on the GPU. The code snippet below shows how to choose the device.

**Important Note:** When doing the computation both the data and the parameters need to be on the same device. This means if you move the pre-trained model to the GPU but the data is sitting on the CPU, the model can't see the data (or vice-versa). This is why you'll see a lot of `something.to(device)` to everything is on the same device.

In [6]:
# check if gpu is available
device = 'cpu' 
if torch.backends.mps.is_available():
    device = 'mps'
if torch.cuda.is_available():
    device = 'cuda'
print(f"Using '{device}' device")

Using 'cuda' device


## Task 3.3.1: Getting the model and `Trainer` setup

Now it's time to bring in the model. Just like with the tokenizer, we can use huggingface's `AutoModelFor[TASK_TYPE]` to load the model for the appropriate task type. In this case, we want to classify a sequence of tokens, so we'll use the `AutoModelForSequenceClassification` class but there are many other options after "For" that you can check out (e.g., for your course project)

In this case, we need to specify how many classes/labels there are in our data.
Calculate that from the data above and provide it as an argument when loading the parameters for the pre-trained model we specified.

In [11]:
# TODO: Load Pre-trained model from HuggingFace Model Hub
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label2id))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
## Let's see how many parameters we are going to be changing

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 108311810 || all params: 108311810 || trainable%: 100.0


## Task 3.3.2 Setting up the training arguments

There are a lot of options when you fine-tune a model! In homework 2, you saw a few of these when we set the learning rate and number of epochs. When using the huggingface `Trainer` class, we specify all of the training arguments at once in a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments) object. There are a _lot_ of arguments you can specify and for the most part, you do not need to set all of them. Reading that class's documentation can be overwhelming, so don't worry if you don't know what all of them mean.

Here are a few useful arguments that you'll need to set:
- `output_dir` - where to save the models. This directory can get very large if you save all the checkpoints!
- `overwrite_output_dir` - whether to overwrite the previously saved models
- `learning_rate` - use 2e-5,
- `per_device_train_batch_size` - how many items per batch. You usually want this to be a power of 2 due to how computers work. Common sizes with GPUs are between 64 and 256, but it will depend a lot on how much memory the GPU has (and how big your sequences are). 
- `per_device_eval_batch_size` - same as above, but because you're not doing gradient descent on these (just eval), there's less "stuff" needed in memory and this can be a bit larger.
- `do_eval` - whether do evaluate on the development/validation data periodically during training
- `seed` - the random seed to use. Use 12345
- `evaluation_strategy` - when to evaluate the model during training. "epoch" evaluates after the end of every epoch, while "steps" evaluates every `eval_steps`. If you have a very large dataset, you probably want to use "steps" so that you can get periodic updates on how the model is doing. Even though our dataset is relatively small, let's use "steps".
- `eval_steps` - how many steps between an evaluation on the dev data. For this assignment, use 50 so we can see how our model trains. In real-world scenarios, you'll often have this larger (e.g., 1000) so that you're not spending more of your GPU time evaluating instead of training
- `save_strategy` - how over to save your model's parameters during training. This is either "epoch" or "steps", where `save_steps` is used. The logic for setting the argument is similar to that for `evaluation_strategy`. Because our dataset is relatively small, use "steps".
- `save_steps` - how many steps between saving the model's parameters. Typically you set this the same as `eval_steps` which we'll do here.
- `num_train_epochs` - how many epochs to train for. Use 10 for now.
- `logging_dir` - where to save the log files.
- `load_best_model_at_end` -  Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. This is kind of a sneakily-important argument. If you set this to `True`, the `Trainer` will automatically keep track of what the best model is so far (checked at every `save_strategy`) so you always have a copy of the parameters on disk in the checkpoint file for the best version. If you don't set this, you might keep training and never save your best model! Set this to `True`.
- `metric_for_best_model` - another important argument: this says how we should define our "best" model in terms of a metric. We could use loss on the training data with "loss", but since we have an evaluation dataset, we'll choose based on performance on that model. When looking at the metrics on the dev/evaluation/validation dataset, all of the metrics get prefixed with "eval_". In this case, use "eval_f1"
- `greater_is_better` - if we're setting `metric_for_best_model` we need to tell the `Trainer` which direction is better, e.g., lower is better for "loss" but greater is better for metrics like "f1". 
- `report_to` - the `Trainer` code is hooked into common logging libraries. We'll use `wandb` like in Homework 2. You might not even need to do anything for it to log but you'll need to make sure you can get plots showing up on Weights & Biases for the homework.

Here are a few useful arguments you won't need here but you might want to try or explore later:
- `fp16` - Most floating-point computation is done with 32 bits. However, some modern GPUs and even CPUs can support floating-point operations with fewer bits. These operations are faster, though less precise. Because of the sheer number of calculations, it's often useful to prioritize speed and you can turn on 16-bit floating point by setting `fp16=True`. There are a bunch of other options like this if you're curious (good office hours discussion too).
- `gradient_accumulation_steps` - For complex NLP tasks, sometimes we only have enough GPU memory for very small batches (e.g., 2 or 4 items). However, we can simulate big batches by asking the trainer to _accumulate_ (i.e., sum) the gradients across batches before taking the update step, which allows us to have arbitrarily large "accumulated batches". 
- `lr_scheduler_type` - In our previous homework, our SGD/AdamW optimizers all used the same learning rate at each step. But our model gets better every step, so sometimes we might want to be able to make smaller updates so we don't change too much and make the model worse. This argument lets you choose different learning rate [schedulers](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/optimizer_schedules#transformers.SchedulerType) that will dynamically change the learning rate based on how training is going.


In [13]:
# NOTE: when debugging the evaluation code, feel free to turn down the eval_steps
# to a small number so that training evaluates right away.

training_args = TrainingArguments(
    output_dir="/nfs/turbo/coe-mihalcea/longju/SI630HW3",  # Directory for saving models
    overwrite_output_dir=True,  # Overwrite the content of the output directory
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=64,  # Batch size for training
    per_device_eval_batch_size=64,  # Batch size for evaluation
    do_eval=True,  # Perform evaluation during training
    seed=12345,  # Random seed
    evaluation_strategy="steps",  # Evaluate every `eval_steps`
    eval_steps=50,  # Number of steps to run evaluation
    save_strategy="steps",  # Save the model every `save_steps`
    save_steps=50,  # Number of steps to save the model
    num_train_epochs=10,  # Number of training epochs
    logging_dir='./logs',  # Directory for storing logs
    load_best_model_at_end=True,  # Load the best model at the end
    metric_for_best_model="eval_f1",  # Use eval_f1 to evaluate the best model
    greater_is_better=True,  # Higher eval_f1 indicates a better model
    report_to="wandb"  # Use Weights & Biases for logging
)

print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=50,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=True,
group_by_le

## Task 3.3.3 Defining some evaluation metrics

How good is our model? We'll need to provide the `Trainer` some function that given some predictions, can evaluate how good the predictions are. When the `Trainer` instance calls this function, it will pass in a tuple that contains the logits of the model's predictions. These are the unnormalized weights for each of the labels (the logit is the inverse function of the sigmoid). We _could_ normalize these back with a softmax, but instead, we can simply figure out which dimension's value is largest and say that's the dimension label.

The `compute_metrics` function can return a dictionary that maps a metric name to its value. This will let us track multiple metrics over time. All of these metrics also get recorded with `wandb` by `Trainer` too so we'll see how the model trains. **Important Note:** When the `Trainer` class evaluates on the development data, the key names for the matrics get prefixed with "eval_", so if we reported a dictionary with "f1" as a key, we'd see a corresponding metric of "eval_f1" in our logs.

For our model, we'll compute binary F1. This only keeps track of the positive class, which is appropriate in our case where we want to know whether the model is good at finding a model in terms of its precision and recall. Use the `sklearn` to calculate these.

**NOTE:** Earlier we had to convert all the labels/classes into IDs starting from 0 and this code helps explain why--each of the classes has its own dimension!

**NOTE:** The labels we get in the `eval_pred` tuple are the same labels that we specified in our dataset. `Trainer` looks for this column so that it can pass it through the `AutoModel` and have it reported here!

In [14]:
# Define the metric to use for evaluation

def compute_metrics(eval_pred: EvalPrediction):
    # TODO: 
    # 1. Get the logits and labels from the eval_pred
    # 2. Compute the predictions from the logits
    # 3. Calculate binary precision, recal, and F1
    # 4. Return the values as a dictionary with key names for
    #    indicating the metric
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    # Calculate precision, recall, and F1 score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    
    # Return the metrics as a dictionary
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

## Task 3.3.4: Setup the `Trainer`

Finally! Let's specify the `Trainer` that's going to run the training. Most of our arguments and hyperparameters have already been specified in the `TrainingArguments` but there are still a few things we need to specify:

- `model` - A pre-trained model to fine-tune
- `args` - the `TrainingArguments` we just defined
- `train_dataset` - the training portion of the tokenized dataset
- `eval_dataset` - the portion of the tokenized dataset that we'll use during training to evaluate 
- `tokenizer` - the tokenizer model used to turn text into sequences
- `compute_metrics` - the `compute_metrics` function we just defined.


In [15]:
# TODO: Fill in the Trainer object's arguments
trainer = Trainer(
    model=model,  # The pre-trained model
    args=training_args,  # The TrainingArguments we defined earlier
    train_dataset=tokenized_datasets["train"],  # The training dataset
    eval_dataset=tokenized_datasets["validation"],  # The evaluation dataset
    tokenizer=tokenizer,  # The tokenizer
    compute_metrics=compute_metrics  # The metric function
)

In [16]:
# Now let's train!
trainer.train() # 

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlongju[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Precision,Recall,F1
50,No log,0.562582,0.617391,0.748353,0.676593
100,No log,0.507099,0.731672,0.657444,0.692575
150,No log,0.482514,0.691228,0.778656,0.732342
200,No log,0.506046,0.683223,0.815547,0.743544
250,No log,0.493626,0.710828,0.735178,0.722798
300,No log,0.579007,0.77237,0.648221,0.704871
350,No log,0.600184,0.679669,0.757576,0.716511
400,No log,0.658995,0.765203,0.596838,0.670614
450,No log,0.794205,0.745033,0.592885,0.660308
500,0.312100,0.804946,0.740219,0.623188,0.676681


TrainOutput(global_step=850, training_loss=0.20993914211497589, metrics={'train_runtime': 523.1698, 'train_samples_per_second': 103.083, 'train_steps_per_second': 1.625, 'total_flos': 4794525789634200.0, 'train_loss': 0.20993914211497589, 'epoch': 10.0})

In [18]:
# Once we finish training, we can evaluate the model on the dev set. Note that
# since we specified the trainer to load the best model at the end, the 
# trainer will automatically load the best model for us to use here.
evaluation_results = trainer.evaluate()
print(evaluation_results)

{'eval_loss': 0.506045937538147, 'eval_precision': 0.6832229580573952, 'eval_recall': 0.8155467720685112, 'eval_f1': 0.7435435435435436, 'eval_runtime': 3.1623, 'eval_samples_per_second': 599.562, 'eval_steps_per_second': 9.487, 'epoch': 10.0}


In [19]:
# TODO: Use the trainer to predict() on the test set and then score the predictions
test_preds = trainer.predict(tokenized_datasets["test"])

In [22]:
test_preds

PredictionOutput(predictions=array([[-0.33226264, -0.76825434],
       [ 1.518177  , -2.2027586 ],
       [-2.2347767 ,  0.9374685 ],
       ...,
       [-0.23518634, -0.8393739 ],
       [ 1.370341  , -1.9539418 ],
       [-1.5006099 ,  0.8468643 ]], dtype=float32), label_ids=array([1, 0, 1, ..., 0, 0, 1]), metrics={'test_loss': 0.46613338589668274, 'test_precision': 0.6413199426111909, 'test_recall': 0.8324022346368715, 'test_f1': 0.7244732576985412, 'test_runtime': 2.693, 'test_samples_per_second': 585.219, 'test_steps_per_second': 9.283})