# Introduction to Natural Language Processing: Assignment 4

In this exercise we'll practice fine-tuning LLMs to predict one or more labels for a given text using Hugging Face and PyTorch.

- You can use any Python package you need.
- Please comment your code
- Submissions are due Sunday at 23:59 **only** on eCampus: **Assignmnets >> Student Submissions >> Assignment 4 (Deadline: 11.06.2023, at 23:59)**

- Name the file aproppriately: "Assignment_4_\<Your_Name\>.ipynb" and submit only the Jupyter Notebook file.

### Task 1 (1 points)

The goal of this task is to download a multi-label text classification dataset from the [Hugging Face Hub](https://huggingface.co/datasets) and load it.

a) Select the `Text Classification` tag on the left, multi-label-classificationas as well as the the "1K<n<10K" tag to find a relatively small dataset. (e.g., sem_eval_2018_task_1 >> subtask5.english)

b) Load your dataset using `load_dataset` and check the last data point in the validation set.

**Hint:** If you don't have access to GPU, you can downsample the dataset.

In [2]:
# Here comes your code
from datasets import load_dataset

dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset sem_eval_2018_task_1 (/home/freddie/.cache/huggingface/datasets/sem_eval_2018_task_1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182)
100%|██████████| 3/3 [00:00<00:00, 211.23it/s]


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 886
    })
})

In [4]:
from datasets import DatasetDict

In [5]:
labels = [label for label in dataset['train'].features.keys() if label not in ['ID', 'Tweet']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['anger',
 'anticipation',
 'disgust',
 'fear',
 'joy',
 'love',
 'optimism',
 'pessimism',
 'sadness',
 'surprise',
 'trust']

### Task 2 (3 points)

a) Write a function `tokenize_data(dataset)` that takes the loaded dataset as input and returns the encoded dataset for both text and labels.


**Hints:**

1. You should tokenize the text using the BERT tokenizer `bert-base-uncased`
2. You also need to provide labels to the model as numbers. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). This should be a tensor of floats rather than integers.
3. You can apply the function `tokenize_data(dataset)` to the the dataset using `map()`. (You can check out the exercise!)
4. You should set the format of the data to PyTorch tensors using `encoded_dataset.set_format("torch")`. This will turn the training, validation and test sets into standard PyTorch.

b) Print the `keys()` of the the last data point in the validation set in the encoded dataset.

**Hint:** The output should be as follows:

`dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])`

In [6]:
from transformers import AutoTokenizer
import pandas as pd
import numpy as np

In [7]:
df_train = pd.DataFrame(dataset['train'])

In [8]:
df_train

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2017-En-21441,“Worry is a down payment on a problem you may ...,False,True,False,False,False,False,True,False,False,False,True
1,2017-En-31535,Whatever you decide to do make sure it makes y...,False,False,False,False,True,True,True,False,False,False,False
2,2017-En-21068,@Max_Kellerman it also helps that the majorit...,True,False,True,False,True,False,True,False,False,False,False
3,2017-En-31436,Accept the challenges so that you can literall...,False,False,False,False,True,False,True,False,False,False,False
4,2017-En-22195,My roommate: it's okay that we can't spell bec...,True,False,True,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6833,2017-En-21383,@nicky57672 Hi! We are working towards your hi...,False,False,False,False,False,False,False,False,False,False,False
6834,2017-En-41441,@andreamitchell said @berniesanders not only d...,False,True,False,False,False,False,False,False,False,True,False
6835,2017-En-10886,@isthataspider @dhodgs i will fight this guy! ...,True,False,True,False,False,False,False,True,False,False,False
6836,2017-En-40662,i wonder how a guy can broke his penis while h...,False,False,False,False,False,False,False,False,False,True,False


In [9]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [44]:
def tokenize_function(examples):
    
    """ This function tokenizes the text in the examples dictionary.
        We pass it to the map function of the dataset so that we can batch the tokenization for efficiency by
        tokenizing batches in parallel.
    """
    text = examples["Tweet"]
    encoding = bert_tokenizer(text, padding="max_length", truncation=True)
    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(labels)))
    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()
    return encoding

In [45]:
def tokenize_data(dataset):
    # here comes your code
    encoded_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset['train'].column_names)
    return encoded_dataset

In [46]:
encoded_data = tokenize_data(dataset)


[A


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

In [47]:
encoded_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 886
    })
})

In [48]:
example = encoded_data['train'][10]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [50]:
print(example['input_ids'])
print(example['labels'])

[101, 2437, 2008, 12142, 6653, 2013, 7568, 1998, 17772, 2267, 2709, 2121, 2000, 5305, 1998, 9069, 21877, 18719, 23738, 1012, 1001, 2267, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [18]:
encoded_data.set_format("torch")

### Task 3 (3 points)

a) Define a model that includes a pre-trained base (`bert-base-uncased`) using `AutoModelForSequenceClassification`

**Hints:**

1. You need 2 dictionaries that map labels to integers and back for the `id2label` and `id2label` parameters in `.from_pretrained` function.
2. You should set the `problem_type` to be "multi_label_classification", because this makes sure the appropriate loss function is used.



b) Train the model using HuggingFace's Trainer API.

**Hints:**

1. You need to use `TrainingArguments`, `Trainer` classes.
2. While training, we need to compute metrics. To do so, we should define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [19]:
# Here comes your code
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

2023-06-11 11:44:38.348537: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-11 11:44:38.945918: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-11 11:44:38.946005: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-11 11:44:39.022011: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-11 11:44:40.261564: W tensorflow/stream_executor/platform/de

In [26]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(labels) ,id2label=id2label,label2id=label2id ,problem_type="multi_label_classification")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [27]:
training_args = TrainingArguments(
    output_dir="my_model",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [28]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
import evaluate

In [29]:

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# 2.

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return clf_metrics.compute(predictions=predictions, references=labels)

In [None]:
outputs = model(input_ids=encoded_data['train']['input_ids'][0].unsqueeze(0), labels=encoded_data['train'][0]['labels'].unsqueeze(0))
outputs

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_data["train"],
    eval_dataset=encoded_data["valid"],
    compute_metrics=compute_metrics,
)

In [32]:
trainer.train()



RuntimeError: stack expects each tensor to be equal size, but got [1] at entry 0 and [2] at entry 2

### Task 4 (3 points)

a) Evaluate your model on the validation set.

b) Test your model on a new example.

In [None]:
# Here comes your code