In [None]:
# common imports
import numpy as np
import os
import random

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# Path variables
PROJECT_ROOT_DIR = "."
MODEL_PATH = os.path.join(PROJECT_ROOT_DIR, "models")
DATA_PATH = os.path.join(PROJECT_ROOT_DIR, "data")
logging_dir = os.path.join(PROJECT_ROOT_DIR, "my_logs_3")
os.makedirs(MODEL_PATH, exist_ok=True)
os.makedirs(DATA_PATH, exist_ok=True)

# torch imports
import torch
# , torch.nn as nn
# import torch.utils.data as data
# import torch.nn.functional as F
# import torch.optim as optim


# Tutorial 4: Fine-Tuning a pretrained Transformer-Model with Huggingface

In this tutorial we use Huggingface for sentiment analysis of reviews: we use a pre-trained transformer model and fine-tune it for classifying [rotten tomatoe reviews](https://huggingface.co/datasets/rotten_tomatoes), see the documentation in the link.

The model we want to use is: 

In [None]:
model_checkpoint = "google-bert/bert-base-cased"

In [None]:
!pip install datasets evaluate transformers

In [None]:
!pip install tf-keras

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")

dataset["train"][0]

Get the text of the review of an instance:

In [None]:
dataset["train"][0]["text"]

## 1. Tokenizer

The first step for any NLP task is to use a tokenizer to transform text into a sequence of numbers. 
Also, you need strategies for truncation and padding for variable sequence lengths.

Tokenizers are available on huggingface via transformers [AutoTokenizer](https://huggingface.co/docs/transformers/v4.41.0/en/model_doc/auto#transformers.AutoTokenizer)

Find out how to get an instance of a pretrained tokenizer for the model we want to use. 

In [None]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Try the tokenizer out on the sentence "I totally loved reading Harry Potter!" and look at the output. 
Then turn this sequence into a sequence of words to find out how this tokenizer splits up the sentence!

In [None]:
id_seq = tokenizer("I totally loved the movie Harry Potter!")
print(id_seq)
print(tokenizer.convert_ids_to_tokens(id_seq.input_ids))


Write a function `tokenize_fct(instances, max_length=512)` which applies the tokenizer to `instances`'s review-text, and cuts it to max_length, uses padding to size 512, and outputs the result:

In [None]:
def tokenize_fct(instances, max_length=512):
    model_inputs = tokenizer(
        instances["text"],
        max_length=max_length,
        truncation=True,
        padding='max_length'
    )
    return model_inputs

Now apply this function to the entire dataset while batching the data using [`.map`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.map) and call the result `tokenized_datasets`
For other preprocessing operations in huggingface, see: 
https://huggingface.co/docs/datasets/process


In [None]:
tokenized_datasets = dataset.map(tokenize_fct, batched=True)

We choose a smaller subset of the datasets to reduce the amount of training time. 

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [None]:
print(small_train_dataset[0])

Loading the pretrained model for Sequence Classification we want via [`AutoModelForSequenceClassification`](https://huggingface.co/docs/transformers/v4.41.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) and specify the number of labels. 

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=5)

We now want to fine-tune this model. Do the following: 
- specify training hyperparameters using [`Training_Arguments`](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.TrainingArguments): specify the output-directory as `MODEL_PATH` as `finetuned_hf_model`, and the `eval_strategy` to "epoch" and `num_train_epochs`=20, and `optim='adamw_torch'`
- load the accuracy metric from the [`evaluate`](https://huggingface.co/docs/evaluate/index) library
- write a function `compute_metric(logits, labels)` that computes the metric based on the true labels and the prediction logits, where the logits are the outputs of the last classification layer BEFORE the Softmax, so you have to compute the argument with the maximal entry.
- create a [`Trainer`](https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.Trainer) object `trainer` with all necessary information above 
- and train.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=os.path.join(MODEL_PATH, "finetuned_hf_model"), overwrite_output_dir=True, evaluation_strategy="epoch", num_train_epochs=20, optim='adamw_torch')

In [None]:
import evaluate

metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Next, evaluate the trained model on the eval_dataset using `trainer`. 

In [None]:
trainer.evaluate()

## Theoretical Questions


**Question 1** 
Describe all three variants of attention used in an encoder-decoder transformer and where in the model it is used. 

see lecture

**Question 2**
Desribe the method of an autoregressive encoder-decoder architecture in the example of a seq-to-seq RNN. 

see lecture

**Question 3**
What is positional encoding and why is it necessary? 

see lecture

**Question 4**
What is teacher forcing and when/why is it used?

see lecture

**Question 6**
Describe parametric Cross-Attention. 

see lecture

**Question 6**
Suppose the input of a non-parametric ross-Attention Layer is 20-dimensional, keys are 20-dimensional, and values are 30-dimensional. What is the dimension of the output?

30