# Exercise "Natural Language Processing" -- Text Classification with Huggingface

For this course, save a COPY to your Google Drive for the tutorial (File -> Save copy in Drive). Then complete the tasks in your saved copy. If you're done, submit your notebook with the name `{yourFirstName_yourLastName}.ipynb` via moodle **by sharing a link** with the full (i.e. read + write) permissions. We will do the grading in the Colab Notebook copy. Please avoid uploading `ipynb` files and rather, share the link to your notebook with us.

This is an individual assignment, i.e., submit your solutions individually.

This assignment is **mandatory for participation in the exam**. You are required to obtain at least 50% of the total points in this assignment to become eligible for participating in the final exam.


**Due date: 15.06.2023, 9:15 a.m.(CEST)**

----

In this assignment we will revisit the text classification task from the previous assignment but we will solve it using the Transformer model architecture implemented with the HuggingFace `transformers` library.

In particular, we will:
- Take a look at the Transformer model architecture
- Define data loading and processing pipelines for the `AG_NEWS` dataset
- Train a BERT-style Transformer model for classifiying the news articles in `AG_NEWS`
- Use Transfer Learning to boost performance on the text classification task





First, we need to install a few packages and we are ready to go.

In [19]:
!pip install -q transformers datasets evaluate accelerate scikit-learn

## Background on Transformer Models

In this assignment, we will use the Transformer neural network architecture. This architecture is much more powerful than the simple MLP from last assignment!


You might find different styles of Transformer models online. For this assignment, we will use BERT-style Transformer models (sometimes also called "encoder-only" Transformers). 

Developing an intuitive understanding of how Transformers work is important but not trivial. Besides the lecture content, we recommend this excellent write-up for Transformers: https://jalammar.github.io/illustrated-transformer/.

Now, let's take a look at single Transformer model. We will use the `transformers` library by HuggingFace, which makes training Transformer models really easy and convenient (and is used by many researchers around the world).

**Task 1 (1 point)**: Load and instantiate the `distilroberta-base` model with `transformers`; then print the model object using Python's native `print` function. 

`distilroberta-base` is a BERT-style Transformer model derived from the RoBERTa architecture (Robustly optimized BERT Pretraining approach). As you can see, NLP researchers like puns. It has been "distilled", e.g. approximated using a smaller model size. We are using it so that training using Colab's single GPU is faster, since the model is smaller than the original variants.

HINT: `transformers.AutoModel.from_pretrained()`

In [20]:
from transformers import AutoModel # hint

model = AutoModel.from_pretrained("distilroberta-base")
print(model)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-5): 6 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout)

The `print` statement should have given you a nice overview over the entire RoBERTa Transformer model. You should be able to identify the layers that are responsible for the attention mechanism. RoBERTa is a highly optimized model architecture (although nowadays even better architectures exist). That is why there are many different modules, it is okay if you do not understand the role of every last one for now.

In the `embedding` module of `distilroberta-base`, you should also be able to see a module called `position_embeddings`.



**Task 2 (2 points):** What is the role of the positional embedding in BERT-style Transformer models. Why is it necessary? (roughly 2-4 sentences)

In [21]:
# What is the role of the positional embedding in BERT-style Transformer models. Why is it necessary? (roughly 2-4 sentences)# Text answer below

# The positional embedding is necessary because the model does not understand the order of words. It is a vector that is added to the word embedding to give the model information about the (relative) position of the word in the sentence.
# This is necessary because the same word can have different meanings depending on its position in the sentence; e.g. "The leaves fall in fall"

**Task 3 (2 points)**: What is the role of the attention mechanism in BERT-style Transformer models? (roughly 2-4 sentences)

In [22]:
# What is the role of the attention mechanism in BERT-style Transformer models? (roughly 2-4 sentences)
# Text answer below

# The attention mechanism in a model gives contextual information to each word by looking other words in the sentence. It gives more weight to words that are more relevant to the word it is looking at; e.g. "The apple tastes good because it is ripe." The model will know that "it" refers to "apple" because of the attention mechanism.

## Data Pipeline

Now, let's prepare for training our own Transformer model. As always, we need to set up a data processing pipeline.

We will still use the `AG_NEWS` dataset. Last time, we used `torch.datasets` to load `AG_NEWS`. Besides `transformers`, HuggingFace also has a library called `datasets` that is widely used and supports most popular (and even unpopular) datasets. So this time, we will use `datasets` to load `AG_NEWS`!


**Task 4 (3 points)**: 
- Load the `AG_NEWS` dataset using HuggingFace `datasets`.
- Transformer models need a lot of compute and using the entire dataset will take too much of your time for this assignment. Shuffle the train split of the dataset and select only the first `4000` samples after shuffling; discard the rest.
- We still need to create a dev split. Take a random sample of 10% of the samples from the (reduced) train set (hint: `datasets.train_test_split`). For compatibility with `transformers`, the dev split should be assigned under the `"val"` keyword (for validation). Background: validation and dev split are two names for the same thing. There is a fierce debate between ML researchers over which one is correct.

Use a fixed random seed of 42 whenever random number generation is involved to ensure reproducibility. 



In [24]:
from datasets import load_dataset  # hint
import torch

SEED = 42

# Wow, we already did it for you!
dataset = load_dataset("ag_news")
dataset = dataset.shuffle(SEED)

# TODO: select the first 4000 samples from the training set to make the dataset manageable for this assignment

dataset["train"] = dataset["train"].select(range(4000))


assert len(dataset["train"]) == 4000, f'you should have 4000 samples but you have {len(dataset["train"])} samples'

# TODO: generate the validation set from the training data.
# Assume the training set was not shuffled, perform a shuffle before splitting. This is good practice, so we practice it here as well.
# hint: you can use the train_test_split method from HF. Remember the seed!

temp = dataset["train"].train_test_split(test_size=0.1, shuffle=True, seed=SEED)
dataset["train"] = temp["train"]
dataset["val"] = temp["test"]

assert len(dataset["test"]) == 7600, f'test set should not be touched but you have {len(dataset["test"])} samples'
assert len(dataset["train"]) == 3600, f'train set should be 3600 samples but you have {len(dataset["train"])} samples'
assert len(dataset["val"]) == 400, f'val set should be 400 samples but you have {len(dataset["val"])} samples'

Found cached dataset ag_news (/home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 1127.05it/s]
Loading cached shuffled indices for dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-dd0ff9596fea92b0.arrow
Loading cached shuffled indices for dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-12f3c4e4bf422cce.arrow
Loading cached split indices for dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-b0522893c60a5a50.arrow and /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-291482d2afab481f.arrow


**Task 5 (2 points):** For each split, calculate the distribution of class labels and print them (e.g. train: 0 ->  0.23 % | 1 ->  0.27 % | 2 ->  0.24 % | 3 ->  0.26%). Are the labels (roughly) balanced? What could we do to guarantee a balanced random split?

In [45]:
import torch
import numpy as np

# TODO: for each split, calculate the class label distribution and print it

for split in ["train", "val", "test"]:
    print(split)
    for label_index in range(4):
        print(f"{label_index} -> {np.sum(np.array(dataset[split]['label']) == label_index) / len(dataset[split]['label']) * 100:.2f} %")



# EXTRA TODO: store the number of class labels in `num_labels` (we will use it later)
num_labels: int = 4

train
0 -> 25.36 %
1 -> 25.67 %
2 -> 23.19 %
3 -> 25.78 %
val
0 -> 23.50 %
1 -> 19.00 %
2 -> 26.75 %
3 -> 30.75 %
test
0 -> 25.00 %
1 -> 25.00 %
2 -> 25.00 %
3 -> 25.00 %


In [26]:
# Are the labels (roughly) balanced? What could we do to guarantee a balanced random split?

# The labels are roughly balanced for the train and test set, but not for the validation set. We could use stratified sampling to guarantee a balanced random split (stratify_by_column for the HF train_test_split method).

**Task 6 (7 points)**: 

Similarly to the assignment-3, write a *preprocess_function* which:
- tokenizes the text with the DistilRoBERTa tokenizer (hint: `transformers.AutoTokenizer.from_pretrained()`)
- truncates the result to a maximum of max_sequence_length tokens, if longer
- pads the result to max_sequence_length using the padding token from DistilRoBERTa
- converts all tokens into token IDs (expected output: list of integers)

HINT: You can implement all these things manually, but they can also be done in a single line using the tokenizer implementation from HuggingFace.

Finally call this function for each sample of your train, test and validation set using the `map` method from the HuggingFace `datasets` library.





In [52]:
from transformers import AutoTokenizer  # hint
from typing import Dict, Any
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")


# TODO: define a preprocessing function to tokenize a sample
def preprocess_function(sample: Dict[str, Any], seq_len: int):
    """
    Function applied to all the examples in the Dataset (individually or in batches). 
    It accepts as input a sample as a dictionary and return a new dictionary with the BERT tokens for that sample

    Args:
        sample Dict[str, Any]:
            Dictionary of sample
            
    Returns:
        Dict: Dictionary of tokenized sample in the following style:
        {
          "input_ids": list[int] # token ids
          "attention_mask": list[int] # Mask for self-attention (padding tokens are ignored).
        }
        Hint: if your are using the Huggingface tokenizer implementation, this is the default output format but check it yourself to be sure!
    """
    Dict = tokenizer(sample["text"], truncation=True, padding="max_length", max_length=seq_len)
    return Dict


# TEST for truncation
mock_example_long = Dataset.from_list([{"text": ("lorem ipsum dolar sonet " * 10_000) }]).map(
    preprocess_function, batched=True, fn_kwargs={"seq_len": 256}
)
assert len(mock_example_long[0]["input_ids"]) == 256

# TEST for padding
mock_example_short = Dataset.from_list([{"text": ("lorem ipsum dolar sonet " * 2) }]).map(
    preprocess_function, batched=True, fn_kwargs={"seq_len": 256}
)
assert len(mock_example_short[0]["input_ids"]) == 256
assert mock_example_short[0]["input_ids"][-1] == tokenizer.pad_token_id

# TODO: use the `map` function to tokenize your dataset. store the results in `encoded_ds`
encoded_ds = dataset.map(preprocess_function, batched=True, fn_kwargs={"seq_len": 256})


# TEST  dataset
assert len(encoded_ds["train"][0]["input_ids"]) == 256
assert len(encoded_ds["val"][0]["input_ids"]) == 256
assert len(encoded_ds["test"][0]["input_ids"]) == 256
assert len(encoded_ds["train"][50]["input_ids"]) == 256

Loading cached processed dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-8a8c13ff1774fe94.arrow
Loading cached processed dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-8c2a838058f67ead.arrow
Loading cached processed dataset at /home/jerome/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548/cache-1d99c0de6612b0e0.arrow


**Task 7 (3 points)**: Define a function to evaluate the model during training. The function is automatically called by the trainer on the validation or test set, it must take an `EvalPrediction` object 
(see https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.EvalPrediction) 
and return a dictionary `dict[str, float]` mapping metric names (`str`) to metric values (`float`).


HINT: Again, HuggingFace has a convenient library for evaluation called `evaluate`.


In [78]:
import evaluate # hint
import numpy as np
from transformers import EvalPrediction


# TODO
def compute_metrics(eval_pred: EvalPrediction):
    """
    The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

    Args:
        eval_pred EvalPrediction:
            Evaluation output (always contains labels), to be used to compute metrics.
            It has one Numpy array with predictions and one with labels.
            
    Returns:
        Dict: Dictionary of metric values
    """
    predictions = np.argmax(eval_pred.predictions, axis=1)
    labels = eval_pred.label_ids
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    precision = precision_metric.compute(predictions=predictions, references=labels, average="macro")
    recall = recall_metric.compute(predictions=predictions, references=labels, average="macro")
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    Dict = {}
    Dict.update(precision)
    Dict.update(recall)
    Dict.update(f1)
    Dict.update(accuracy)
    return Dict


# TEST
predictions = np.array([[0.7, 0.3, 0.9]])
labels = np.array([2])
eval_pred = EvalPrediction(predictions=predictions, label_ids=labels)
assert compute_metrics(eval_pred)["accuracy"] == 1

predictions = np.array([[0.7, 0.3, 0.9], [0.9, 0.1, 0.1]])
labels = np.array([2, 1])
eval_pred = EvalPrediction(predictions=predictions, label_ids=labels)
assert compute_metrics(eval_pred)["accuracy"] == 0.5

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Training from scratch

Now, we will start training our first Transformer model. We will begin by training a model **from scratch**, e.g. from randomly initialized weights.

In the last assignment, we coded the entire training loop from scratch as well. In practice, it is often easier to use pre-implemented `Trainer` classes instead. Suprise: Huggingface provides such a `Trainer` implementation. 

The Huggingface `Trainer` abstracts away a lot of the complexity of training models! But to really understand what's going on and debug errors, it is crucial to know what's happening under the hood. 

**Task 8 (10 points)**: 
- Initialize the `distilroberta-base` model from HuggingFace **with random weights** and train it from scratch for 5 epochs. Use the `Trainer` class from HuggingFace for the training loop implementation and `AutoModelForSequenceClassification` instead of `AutoModel` to load the model. HINT: You definitely want to train on GPU this time, otherwise it will be very slow. 

* Why is it important to use `AutoModelForSequenceClassification` instead of `AutoModel`? How is the Transformer model architecture modified to predict the class labels? Describe in 3-6 sentences (rough guideline). HINT: You will need to resarch a bit on your own for this.


You do not have to reach a specific perofrmance goal for this task. It is rather about building an understanding of how to work with Transformer models. An accuracy of below 60% after 5 epochs means things are probably not working as intended.

In [79]:
from transformers import AutoModelForSequenceClassification, AutoConfig  # hint
from transformers import TrainingArguments, Trainer  # hint

# TODO: Create `TrainingArguments`. You can mostly use default values, but setting `per_device_train_batch_size` and `per_device_eval_batch_size` to 32 and a learning rate of 2e-5 worked well for us.
# Set arguments to do the following: Evaluate and save a checkpoint every epoch. At the end of training, load the weights of the best checkpoint (measured by loss on the validation set). Set the seed to 1944 (Hasso's birthyear).
# Don't forget to set the number of training epochs!

TrainingArguments = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    output_dir="./results_scratch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=1944,
    num_train_epochs=5,
)

# TODO: Load the model with *random weights*. HINT: number of class labels! HINT2: AutoConfig.from_pretrained and AutoModelForSequenceClassification.from_config
# You'll get some warnings here, but usually they can just be ignored / are expected if you read them exactly

config = AutoConfig.from_pretrained("distilroberta-base", num_labels=num_labels)
model = AutoModelForSequenceClassification.from_config(config)

# TODO: Initialize the `Trainer` and start training! Don't forget passing the `compute_metrics` method!

trainer = Trainer(
    model=model,
    args=TrainingArguments,
    train_dataset=encoded_ds["train"],
    eval_dataset=encoded_ds["val"],
    compute_metrics=compute_metrics,
)

trainer.train()

# TODO: Final evaluation after training

trainer.evaluate()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.386652,0.297859,0.257979,0.095802,0.1975
2,No log,1.306614,0.534229,0.425887,0.335615,0.4275
3,No log,0.936076,0.625624,0.641894,0.626027,0.6225
4,No log,0.830387,0.698413,0.670678,0.673123,0.665
5,1.147100,0.791492,0.692867,0.704645,0.696616,0.69


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.7914922833442688,
 'eval_precision': 0.692866589083433,
 'eval_recall': 0.7046449484730115,
 'eval_f1': 0.6966160073789975,
 'eval_accuracy': 0.69,
 'eval_runtime': 7.9519,
 'eval_samples_per_second': 50.303,
 'eval_steps_per_second': 1.635,
 'epoch': 5.0}

In [None]:
# Answer: AutoModelForSequenceClassification vs. AutoModel

# AutoModelForSequenceClassification is, as the name says, specifically designed and optimized for sequence classification. It has a classification head on top of the encoder that maps the encoded sequence representation to the number of output classes. The activation function used in the head, for example softmax, helps with the clasification task.

# Transfer Learning: Finetuning a pretrained model

Nowadays, barely anyone trains a Transformer model from scratch for NLP. Instead, we initialize our model with **pretrained weights**. Usually these weights have been trained with massive amounts of data and compute and publicly released by big players like Google, Meta, Microsoft, etc... or SAP ;).

HuggingFace `transformers` makes loading pretrained weights very easy with the `AutoModelFor<TaskDescription>.from_pretrained(<model-name>)` method.

**Task 9 (5 points)**:
* Load `distilroberta-base` with **pretrained weights** and finetune the model on our task for 5 epochs. Set a different `output_dir` than for the previous training. Otherwise, use the same `TrainingArguments` as before.
* Which weights of the model were initialized from pretrained weights and which weights were still randomly initialized?
* Briefly describe the differences you observe to training from scratch. What might be the reasons for these differences? (roughly 3-6 sentences).


Again, you do not have to hit specific performance goals here.

In [80]:
from transformers import TrainingArguments, Trainer, BertForSequenceClassification  # hint

# TODO: Create `TrainingArguments`

TrainingArguments = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    output_dir="./results_pretrained",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=1944,
    num_train_epochs=5,
)

# TODO: Load the model with pretrained weights 

model = BertForSequenceClassification.from_pretrained("distilroberta-base", num_labels=num_labels)

# TODO: Initialize the `Trainer` and start training!

trainer = Trainer(
    model=model,
    args=TrainingArguments,
    train_dataset=encoded_ds["train"],
    eval_dataset=encoded_ds["val"],
    compute_metrics=compute_metrics,
)
trainer.train()

# TODO: Final evaluation after training

trainer.evaluate()

You are using a model of type roberta to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilroberta-base were not used when initializing BertForSequenceClassification: ['roberta.encoder.layer.2.attention.self.query.bias', 'roberta.encoder.layer.2.intermediate.dense.bias', 'roberta.encoder.layer.1.attention.output.dense.weight', 'roberta.encoder.layer.3.output.dense.weight', 'roberta.embeddings.LayerNorm.weight', 'roberta.encoder.layer.5.attention.output.dense.weight', 'roberta.encoder.layer.5.output.dense.weight', 'roberta.encoder.layer.2.intermediate.dense.weight', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.5.attention.output.LayerNorm.weight', 'roberta.encoder.layer.1.attention.self.value.bias', 'roberta.encoder.layer.4.attention.self.value.weight', 'roberta.encoder.layer.3.attention.self.value.weight', 'roberta.encoder.layer.2.output.dense.bias', '

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.387412,0.160794,0.331817,0.187244,0.2675
2,No log,1.20958,0.630941,0.528605,0.544544,0.545
3,No log,0.889476,0.65587,0.664619,0.652405,0.6475
4,No log,0.801108,0.727384,0.693891,0.691628,0.685
5,1.091500,0.743776,0.723812,0.730235,0.723373,0.7175


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.7437756061553955,
 'eval_precision': 0.7238121024281942,
 'eval_recall': 0.7302346486654681,
 'eval_f1': 0.7233728751380412,
 'eval_accuracy': 0.7175,
 'eval_runtime': 6.1758,
 'eval_samples_per_second': 64.769,
 'eval_steps_per_second': 2.105,
 'epoch': 5.0}

In [None]:
# Text Answers
# randomly initialized weights: the weights of the classification
# pretrained weights: the weights of the encoding

# Using pretrained weights we start (and end) with higher accuracy than training from scratch. This might be because the pretrained weights already contain a lot of information about the language which was probably learned on a big number of texts. Still, we need to train the classification head so that the model learns to classify the texts correctly for this specific task.

# Testing

**Task 10 (1 point)**: Evaluate the trained model from Task 9 (finetuning a pretrained model) on the test set. Is there a performance gap between validation and test set? 

In [81]:
# TODO: evaluate on testset

trainer.evaluate(encoded_ds["test"])

{'eval_loss': 0.7198035717010498,
 'eval_precision': 0.7220683785668043,
 'eval_recall': 0.7213157894736841,
 'eval_f1': 0.719821669580131,
 'eval_accuracy': 0.7213157894736842,
 'eval_runtime': 13.323,
 'eval_samples_per_second': 570.442,
 'eval_steps_per_second': 17.864,
 'epoch': 5.0}

In [None]:
# Answer: Is there a performance gap between validation and test set?

# Indeed there is a performance gap between the validation and test set, and the model performs better on the latter. This might stem from the fact that all the way in the beginning we realized that the validation set specifically was not balanced. Maybe the model has an easier time with a more equal distribution of class labels.