<a href="https://colab.research.google.com/github/EmiliaFidler/Intro_to_Comp_Ling_WS24/blob/main/homeexercise3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 3: Hyperparameters and Evaluation
In this third home exercise, you will use the knowledge from Tutorial 4 to experiment with hyperparameters, create a test set, and evaluate your final model on the created test set.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Distilbert: Hyperparameters and Evaluation**

Use the code of Tutorial 4 to load and fine-tune the `distilbert-base-cased`model on the small subset of the `imdb`Movie Review Dataset. For convenience, the code of Tutorial 4 required for this exercise is already provided in the code cells below.

👋 ⚒ When creating the dataset splits in the code cell below, additionally create a test set to be used after thet training. Make sure that your test set does not contain any of the sentences contained in the training or validation set and is approximately of the same size as the validation set.

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate --upgrade

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding, AutoTokenizer

imdb_dataset = load_dataset("imdb")
# we had loaded the imdb dataset already above - if not, outcomment this line
# Make sure you have the right tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


# Just take the first 50 tokens for speed on CPU
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }



# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(128, 160)).map(truncate),
    test=imdb_dataset['test'].shuffle(seed=24).select(range(160, 192)).map(truncate)
)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

👋 ⚒ For this exercise, we will use the Hugging Face Trainer class to play with hyperparamters. Try to find a set of hyperparameter settings that achieves the highest possilbe accuracy on the **validation set** with the small dataset and model in this setup.

**Optional:** If you want to follow a more systematic route, feel free to use available frameworks for hyperparameter optimization, such as [Optuna](https://optuna.org/).

In [None]:
!pip install optuna

Collecting optuna
  Using cached optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Using cached alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Using cached colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Using cached Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Using cached optuna-4.1.0-py3-none-any.whl (364 kB)
Using cached alembic-1.14.0-py3-none-any.whl (233 kB)
Using cached colorlog-6.9.0-py3-none-any.whl (11 kB)
Using cached Mako-1.3.6-py3-none-any.whl (78 kB)
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.6 alembic-1.14.0 colorlog-6.9.0 optuna-4.1.0


In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification
from transformers import EarlyStoppingCallback

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(
    output_dir="sample_cl_trainer", # where to put everything that is trained
    per_device_train_batch_size=8,  # amount of sentences model looks at before anything is updated
    per_device_eval_batch_size=8,
    logging_steps=8,                # how many times results are logged/track/output
    num_train_epochs=10,
    eval_strategy="epoch",          # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=1e-5,             # influences how weight changes in function, determines how big a step you take in valley
    weight_decay=0.005,              # sum of weights added to loss function, regulization method, penalizes large weights
    load_best_model_at_end=True,
    report_to='none',
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'], # change to test when you do your final evaluation!
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=2))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6822,0.693174,0.46875
2,0.6841,0.688901,0.46875
3,0.6637,0.672419,0.5
4,0.6124,0.644293,0.78125
5,0.5624,0.607863,0.8125
6,0.4624,0.553763,0.84375
7,0.3756,0.513987,0.8125
8,0.3142,0.478871,0.875
9,0.2673,0.463749,0.875
10,0.2725,0.459768,0.84375


TrainOutput(global_step=160, training_loss=0.5006450787186623, metrics={'train_runtime': 1239.8815, 'train_samples_per_second': 1.032, 'train_steps_per_second': 0.129, 'total_flos': 36337463293824.0, 'train_loss': 0.5006450787186623, 'epoch': 10.0})

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
results = trainer.predict(small_tokenized_dataset['val'])
print(results)

PredictionOutput(predictions=array([[ 0.6254548 , -0.24747477],
       [-1.1510874 ,  0.90899223],
       [ 0.24678981,  0.26977423],
       [-0.71183646,  0.71441317],
       [-0.46013474,  0.5440769 ],
       [-0.96429515,  0.80689865],
       [ 0.5574993 , -0.399373  ],
       [ 0.7924975 , -0.6476236 ],
       [ 0.02069771,  0.24121338],
       [ 0.11433366, -0.02530708],
       [-0.1796627 ,  0.63809395],
       [ 0.6276436 , -0.4834797 ],
       [ 0.13189234, -0.30014634],
       [-0.7639835 ,  0.64128625],
       [-0.48282862,  0.6144321 ],
       [-0.32334077,  0.38191673],
       [ 0.00915378,  0.10162576],
       [ 0.14100838,  0.28745154],
       [-0.2988384 ,  0.41972435],
       [ 0.77839196, -0.5631106 ],
       [ 0.07537009,  0.380215  ],
       [ 0.5557085 , -0.31079784],
       [ 0.556118  , -0.1604549 ],
       [ 0.7119608 , -0.48887908],
       [-0.8648093 ,  0.8594225 ],
       [ 0.95808923, -0.7401277 ],
       [ 0.30001524, -0.08070067],
       [-0.05270683,  0.47

👋 ⚒ Change the following code cell in a way that not only a single sentence is evaluated on your trained model (!make sure to use the correct checkpoint!) but the evaluation is performaned on the entire newly created test set.

This might also be a good occassion to get familiar with the [Hugging Face documentation and tutorials](https://huggingface.co/docs/transformers/index).

In [None]:
import torch

test_str = "I hate this movie!"

fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("drive/MyDrive/checkpoint-144")
model_inputs = tokenizer(test_str, return_tensors="pt")
prediction = torch.argmax(fine_tuned_model(**model_inputs).logits)
print(["NEGATIVE", "POSITIVE"][prediction])

NEGATIVE


In [None]:
test_dataset = small_imdb_dataset['test']

test_results = trainer.predict(small_tokenized_dataset['test'])
print("Test set evaluation results:", test_results)

Test set evaluation results: PredictionOutput(predictions=array([[ 0.87845576, -0.61240655],
       [ 0.8294658 , -0.6899486 ],
       [ 0.46415454, -0.18060079],
       [-0.22983849,  0.15763688],
       [ 0.614255  , -0.62235266],
       [-0.80942875,  0.89223933],
       [-0.6827831 ,  0.68266195],
       [ 0.584398  , -0.16275622],
       [-0.62700206,  0.7141466 ],
       [ 0.13370642,  0.24209908],
       [ 0.31586063,  0.12510666],
       [-0.8778178 ,  0.78590673],
       [ 0.51914144, -0.02581131],
       [ 0.4363265 , -0.20913824],
       [-0.74065256,  0.73400635],
       [-0.56064963,  0.5335861 ],
       [-0.4799466 ,  0.6638038 ],
       [-0.9617985 ,  0.7026859 ],
       [ 0.86578906, -0.51373667],
       [-0.36208847,  0.47121   ],
       [ 0.13761687,  0.05034046],
       [ 0.04363983,  0.24811244],
       [ 1.0411247 , -0.95407206],
       [ 0.65384215, -0.3879072 ],
       [-0.10314952,  0.10691467],
       [-0.3827263 ,  0.56766206],
       [ 0.7845536 , -0.5548713 

In [None]:
tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True,
    batch_size=16
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

# new trainer for evaluation
trainer = Trainer(
    model=fine_tuned_model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['test'], # change to test when you do your final evaluation!
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

test_results = trainer.evaluate(eval_dataset=tokenized_test_dataset)
print(test_results)

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

{'eval_loss': 0.5175219774246216, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.84375, 'eval_runtime': 10.1167, 'eval_samples_per_second': 3.163, 'eval_steps_per_second': 0.395}
