# HW 2: Efficient Fine-Tuning with BitFit?
**Due: March 13, 11:30 AM**

In this homework assignment, you will replicate [the BitFit experiments (Zaken et al., 2020)](https://aclanthology.org/2022.acl-short.1/). You will first use the [🤗 Transformers framework](https://huggingface.co/docs/transformers/index) to fine-tune a [BERT$_\text{tiny}$ model](https://huggingface.co/prajjwal1/bert-tiny) ([Turc et al., 2019](https://arxiv.org/abs/1908.08962); [Bhargava et al., 2021](https://aclanthology.org/2021.insights-1.18/)) on the IMDb dataset. You will then fine-tune the same model, but with all parameters frozen other than the bias terms. You will compare the two models on the following metrics: (1) their accuracy on the IMDb test set and (2) the number of parameters trained during fine-tuning.

## Important: Read Before Starting

In the following exercises, you will need to implement functions defined in the `train_model.py` and `test_model.py` scripts. **Please write all your code in those files.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into Python modules.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly.

For part of this assignment, you will be asked to fine-tune a BERT$_\text{tiny}$ model on the IMDb dataset with hyperparameter tuning. **This will take several hours to run on a laptop with a CPU.** You may want to instead run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

To begin, please run the following `import` statements.

In [None]:
! pip install datasets evaluate optuna --quiet # install datasets if it is not included in your environment

In [None]:
!pip install evaluate



In [None]:
!pip install optuna



In [None]:
"""
Code for Problem 1 of HW 2.
"""
import pickle
from typing import Any, Dict

import evaluate
import numpy as np
import optuna
from datasets import Dataset, load_dataset
#from transformers import BertTokenizerFast, BertForSequenceClassification, \
#    Trainer, TrainingArguments, EvalPrediction
from transformers import BertTokenizerFast, BertForSequenceClassification, \
    Trainer, TrainingArguments, EvalPrediction, EarlyStoppingCallback


def preprocess_dataset(dataset: Dataset, tokenizer: BertTokenizerFast) \
        -> Dataset:
    """
    Problem 1d: Implement this function.

    Preprocesses a dataset using a Hugging Face Tokenizer and prepares
    it for use in a Hugging Face Trainer.

    :param dataset: A dataset
    :param tokenizer: A tokenizer
    :return: The dataset, prepreprocessed using the tokenizer
    """

    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=512,
        )
    #return dataset.map(tokenize_function, batched=True, remove_columns=["text"])
    return dataset.map(tokenize_function, batched=True)

    #raise NotImplementedError("Problem 1d has not been completed yet!")


def init_model(trial: Any, model_name: str, use_bitfit: bool = False) -> \
        BertForSequenceClassification:
    """
    Problem 2a: Implement this function.

    This function should be passed to your Trainer's model_init keyword
    argument. It will be used by the Trainer to initialize a new model
    for each hyperparameter tuning trial. Your implementation of this
    function should support training with BitFit by freezing all non-
    bias parameters of the initialized model.

    :param trial: This parameter is required by the Trainer, but it will
        not be used for this problem. Please ignore it
    :param model_name: The identifier listed in the Hugging Face Model
        Hub for the pre-trained model that will be loaded
    :param use_bitfit: If True, then all parameters will be frozen other
        than bias terms
    :return: A newly initialized pre-trained Transformer classifier
    """
    model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

    if use_bitfit:
        for name, param in model.named_parameters():
            if "bias" not in name:
                param.requires_grad = False

    return model
    #raise NotImplementedError("Problem 2a has not been completed yet!")

#new
def compute_metrics(eval_pred: EvalPrediction):
    """Computes accuracy for evaluation."""
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#new





def init_trainer(model_name: str, train_data: Dataset, val_data: Dataset,
                 use_bitfit: bool = False) -> Trainer:
    """
    Prolem 2b: Implement this function.

    Creates a Trainer object that will be used to fine-tune a BERT-tiny
    model on the IMDb dataset. The Trainer should fulfill the criteria
    listed in the problem set.

    :param model_name: The identifier listed in the Hugging Face Model
        Hub for the pre-trained model that will be fine-tuned
    :param train_data: The training data used to fine-tune the model
    :param val_data: The validation data used for hyperparameter tuning
    :param use_bitfit: If True, then all parameters will be frozen other
        than bias terms
    :return: A Trainer used for training
    """
    #output_dir = get_new_run_directory() #new

    training_args = TrainingArguments(
        output_dir="./checkpoints",
        #output_dir = output_dir,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        save_total_limit=3,
        metric_for_best_model="accuracy", greater_is_better=True,
        logging_dir="./logs",
        logging_steps=10,
    )

    trainer = Trainer(
        model_init=lambda trial: init_model(trial, model_name, use_bitfit),
        args=training_args,
        train_dataset=train_data,
        eval_dataset=val_data,
        compute_metrics=compute_metrics,  # Added metric computation
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    )

    return trainer

    #raise NotImplementedError("Problem 2b has not been completed yet!")


def hyperparameter_search_settings() -> Dict[str, Any]:
    """
    Problem 2c: Implement this function.

    Returns keyword arguments passed to Trainer.hyperparameter_search.
    Your hyperparameter search must satisfy the criteria listed in the
    problem set.

    :return: Keyword arguments for Trainer.hyperparameter_search
    """

    return {
        "direction": "maximize",
        "n_trials": 20,  # Increased trials for better tuning
        "hp_space": lambda trial: {
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 5e-5, log=True),
            "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
            "num_train_epochs": trial.suggest_int("num_train_epochs", 3, 6),
            "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
        },
    }

    #raise NotImplementedError("Problem 2c has not been completed yet!")


if __name__ == "__main__":  # Use this script to train your model
    model_name = "prajjwal1/bert-tiny"

    # Load IMDb dataset and create validation split
    imdb = load_dataset("imdb")
    split = imdb["train"].train_test_split(.2, seed=3463)
    imdb["train"] = split["train"]
    imdb["val"] = split["test"]
    del imdb["unsupervised"]
    del imdb["test"]

    # Preprocess the dataset for the trainer
    tokenizer = BertTokenizerFast.from_pretrained(model_name)

    imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
    imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)

    # Set up trainer
    trainer = init_trainer(model_name, imdb["train"], imdb["val"],
                           use_bitfit=True)

    # Train and save the best hyperparameters

    best = trainer.hyperparameter_search(**hyperparameter_search_settings())
    """
    with open("train_results.p", "wb") as f:
        pickle.dump(best, f)
    """
    try:
        with open("train_results.p", "rb") as f:
            best = pickle.load(f)  # ✅ This ensures `f` is properly defined
            print("Loaded best hyperparameters:", best)
    except FileNotFoundError:
        print("⚠️ train_results.p not found! Make sure you have run training first.")
    except Exception as e:
        print(f"⚠️ An error occurred while loading train_results.p: {e}")



Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-03-15 19:10:17,430] A new study created in memory with name: no-name-f087f4be-11cc-45c4-8136-558af571f48d
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.7098,0.712794


[I 2025-03-15 19:14:01,324] Trial 0 finished with value: 0.71279376745224 and parameters: {'learning_rate': 2.239985538470037e-06, 'num_train_epochs': 1, 'seed': 26, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁██
train/global_step,▁██
train/grad_norm,▁
train/learning_rate,▁
train/loss,▁

0,1
eval/loss,0.71279
eval/runtime,20.3165
eval/samples_per_second,246.105
eval/steps_per_second,30.763
total_flos,6352435200000.0
train/epoch,1.0
train/global_step,625.0
train/grad_norm,0.10848
train/learning_rate,0.0
train/loss,0.7098


Epoch,Training Loss,Validation Loss
1,0.7006,0.699346


[I 2025-03-15 19:17:54,400] Trial 1 finished with value: 0.6993458271026611 and parameters: {'learning_rate': 1.8654145377087265e-06, 'num_train_epochs': 1, 'seed': 1, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▃▅▆███
train/global_step,▁▃▅▆███
train/grad_norm,▃▁▃█▃
train/learning_rate,█▆▅▃▁
train/loss,▄▅█▁▅

0,1
eval/loss,0.69935
eval/runtime,20.279
eval/samples_per_second,246.561
eval/steps_per_second,30.82
total_flos,6352435200000.0
train/epoch,1.0
train/global_step,2500.0
train/grad_norm,0.33003
train/learning_rate,0.0
train/loss,0.7006


Epoch,Training Loss,Validation Loss
1,0.6977,0.698429


[I 2025-03-15 19:22:19,541] Trial 2 finished with value: 0.6984289884567261 and parameters: {'learning_rate': 2.2667536963828638e-05, 'num_train_epochs': 1, 'seed': 3, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▂▃▃▄▅▆▆▇███
train/global_step,▁▂▃▃▄▅▆▆▇███
train/grad_norm,▆▁▃▃▁█▄▁▇▁
train/learning_rate,█▇▆▆▅▄▃▃▂▁
train/loss,█▅▆▃▃▃▄▃▁▃

0,1
eval/loss,0.69843
eval/runtime,21.229
eval/samples_per_second,235.527
eval/steps_per_second,29.441
total_flos,6352435200000.0
train/epoch,1.0
train/global_step,5000.0
train/grad_norm,0.16421
train/learning_rate,0.0
train/loss,0.6977


Epoch,Training Loss,Validation Loss
1,No log,0.6958
2,0.695800,0.694546


[I 2025-03-15 19:29:15,951] Trial 3 finished with value: 0.694546103477478 and parameters: {'learning_rate': 8.577376442502214e-05, 'num_train_epochs': 2, 'seed': 27, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▅██
train/global_step,▁▅██
train/grad_norm,▁
train/learning_rate,▁
train/loss,▁

0,1
eval/loss,0.69455
eval/runtime,21.7832
eval/samples_per_second,229.535
eval/steps_per_second,28.692
total_flos,12704870400000.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,0.08006
train/learning_rate,2e-05
train/loss,0.6958


Epoch,Training Loss,Validation Loss
1,0.7016,0.702047
2,0.6986,0.698858
3,0.6955,0.697093
4,0.694,0.696521


[I 2025-03-15 19:43:01,305] Trial 4 finished with value: 0.6965214014053345 and parameters: {'learning_rate': 1.346226900769052e-05, 'num_train_epochs': 4, 'seed': 11, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▄▂▁
eval/runtime,█▄▄▁
eval/samples_per_second,▁▅▅█
eval/steps_per_second,▁▅▅█
train/epoch,▁▁▃▄▅▆▆███
train/global_step,▁▁▃▄▅▆▆███
train/grad_norm,█▁▂▁▂
train/learning_rate,█▆▅▃▁
train/loss,█▅▂▁▁

0,1
eval/loss,0.69652
eval/runtime,21.7063
eval/samples_per_second,230.348
eval/steps_per_second,28.793
total_flos,25409740800000.0
train/epoch,4.0
train/global_step,2500.0
train/grad_norm,0.11075
train/learning_rate,0.0
train/loss,0.694


Epoch,Training Loss,Validation Loss
1,0.6941,0.695462


[I 2025-03-15 19:46:54,331] Trial 5 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██
train/grad_norm,▁▃▅█▄
train/learning_rate,█▆▅▃▁
train/loss,▂▅█▄▁

0,1
eval/loss,0.69546
eval/runtime,21.8346
eval/samples_per_second,228.995
eval/steps_per_second,28.624
train/epoch,1.0
train/global_step,2500.0
train/grad_norm,0.27045
train/learning_rate,0.0
train/loss,0.6941


Epoch,Training Loss,Validation Loss
1,0.7144,0.710316
2,0.7159,0.70767
3,0.7152,0.706895


[I 2025-03-15 19:58:30,208] Trial 6 finished with value: 0.7068954110145569 and parameters: {'learning_rate': 1.982229058393526e-06, 'num_train_epochs': 3, 'seed': 24, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▃▁
eval/runtime,█▂▁
eval/samples_per_second,▁▇█
eval/steps_per_second,▁▇█
train/epoch,▁▂▂▃▃▃▄▄▅▅▅▅▆▇▇▇███
train/global_step,▁▁▂▃▃▃▃▄▅▅▅▅▆▇▇▇███
train/grad_norm,▂▂▁▅▂▃█▂▂▂▄▁▄▅▄
train/learning_rate,█▇▇▇▆▆▅▅▄▃▃▂▂▁▁
train/loss,█▆▄▄▃▄▃▅▃▃▁▃▃▅▃

0,1
eval/loss,0.7069
eval/runtime,20.5075
eval/samples_per_second,243.813
eval/steps_per_second,30.477
total_flos,19057305600000.0
train/epoch,3.0
train/global_step,7500.0
train/grad_norm,0.3868
train/learning_rate,0.0
train/loss,0.7152


Epoch,Training Loss,Validation Loss
1,0.699,0.696094


[I 2025-03-15 20:02:09,924] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▆█
train/global_step,▁▆█
train/grad_norm,▁█
train/learning_rate,█▁
train/loss,█▁

0,1
eval/loss,0.69609
eval/runtime,20.5314
eval/samples_per_second,243.529
eval/steps_per_second,30.441
train/epoch,1.0
train/global_step,1250.0
train/grad_norm,0.21838
train/learning_rate,0.0
train/loss,0.699


Epoch,Training Loss,Validation Loss
1,No log,0.700698
2,0.702300,0.700479
3,0.702300,0.700338
4,0.701400,0.700295


[I 2025-03-15 20:15:59,056] Trial 8 finished with value: 0.700294554233551 and parameters: {'learning_rate': 3.231119950187164e-06, 'num_train_epochs': 4, 'seed': 8, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 0.71279376745224.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▄▂▁
eval/runtime,▁▃▂█
eval/samples_per_second,█▆▇▁
eval/steps_per_second,█▆▇▁
train/epoch,▁▂▃▆▆██
train/global_step,▁▂▃▆▆██
train/grad_norm,▁█
train/learning_rate,█▁
train/loss,█▁

0,1
eval/loss,0.70029
eval/runtime,20.8462
eval/samples_per_second,239.852
eval/steps_per_second,29.982
total_flos,26680227840000.0
train/epoch,4.0
train/global_step,1252.0
train/grad_norm,0.0631
train/learning_rate,0.0
train/loss,0.7014


Epoch,Training Loss,Validation Loss
1,No log,0.692833


[I 2025-03-15 20:19:29,827] Trial 9 pruned. 


In [None]:

"""
Code for Problem 1 of HW 2.
"""
import pickle

import evaluate
from datasets import load_dataset
from transformers import BertTokenizerFast, BertForSequenceClassification, \
    Trainer, TrainingArguments

from train_model import preprocess_dataset

def compute_metrics(eval_pred):
    """Computes accuracy for evaluation."""
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def init_tester(directory: str) -> Trainer:
    """
    Prolem 2b: Implement this function.

    Creates a Trainer object that will be used to test a fine-tuned
    model on the IMDb test set. The Trainer should fulfill the criteria
    listed in the problem set.

    :param directory: The directory where the model being tested is
        saved
    :return: A Trainer used for testing
    """
    """
    training_args = TrainingArguments(
        output_dir=directory,
        per_device_eval_batch_size=8,
        do_eval=True,
    )

    model = BertForSequenceClassification.from_pretrained(directory)
    tokenizer = BertTokenizerFast.from_pretrained(directory)

    def compute_metrics(eval_pred: EvalPrediction):
        metric = evaluate.load("accuracy")
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
    )

    return trainer
    """
    model = BertForSequenceClassification.from_pretrained(directory, num_labels=2)
    training_args = TrainingArguments(
        output_dir="./test_results",
        per_device_eval_batch_size=8,
        do_train=False,  # Disable training
        do_eval=True,    # Enable evaluation
        evaluation_strategy="no", #new
        logging_dir="./logs",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        #eval_dataset=test_data,
        #eval_dataset=imdb, #new
        compute_metrics=compute_metrics,
    )

    return trainer


    #raise NotImplementedError("Problem 2b has not been completed yet!")


if __name__ == "__main__":  # Use this script to test your model
    model_name = "prajjwal1/bert-tiny"

    # Load IMDb dataset
    imdb = load_dataset("imdb")
    del imdb["train"]
    del imdb["unsupervised"]

    # Preprocess the dataset for the tester
    tokenizer = BertTokenizerFast.from_pretrained(model_name)
    imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

    # Set up tester
    tester = init_tester("path_to_your_best_model")

    # Test
    results = tester.predict(imdb["test"])
    with open("test_results.p", "wb") as f:
        pickle.dump(results, f)


ModuleNotFoundError: No module named 'train_model'

In [None]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast



In [None]:
# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer

ModuleNotFoundError: No module named 'train_model'

In [None]:
from test_model import init_tester

## Problem 1: Setup (30 Points in Total)

In this assignment, you will fine-tune a pre-trained Transformer model using libraries provided by [Hugging Face](https://huggingface.co/) (whose name is usually styled using the emoji 🤗). You have already been exposed to Hugging Face in lab, where you used the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the IMDb dataset and the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library to load a pre-trained BERT$_\text{tiny}$ model. In the following problems, additionally use the [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library, which provides evaluation metrics such as accuracy and F1.

For several parts of this problem, you will need to refer to the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) for guidance.

### Problem 1a: Understand the 🤗 Transformers Library (No Submission, 0 Points)

🤗 Transformers is imported into Python via the name `transformers`. Please find the import statements from 🤗 Transformers in the code cell above.

🤗 Transformers comes with a number of different Transformer architectures, as well as [the Model Hub, a repository of pre-trained model parameters](https://huggingface.co/models). A pre-trained model is loaded by calling the model architecture's `.from_pretrained` method.

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code above loads a Transformer classifier consisting of a pre-trained BERT$_\text{base}$ encoder with case-insensitive vocabulary and a randomly initialized 2-layer MLP decoder with tanh activation. The choice of this particular set of pre-trained parameters is specified by the identifier `'bert-base-uncased'`, which is passed to the first parameter of `.from_pretrained`. Different pre-trained weights can be loaded by passing a different identifier to `.from_pretrained`. The following code loads the BERT$_\text{tiny}$ model from [Turc et al. (2019)](https://arxiv.org/abs/1908.08962) and [Bhargava et al. (2021)](https://aclanthology.org/2021.insights-1.18/), which you will be fine-tuning in this assignment. (The `/` indicates that this is a user-submitted model, uploaded by the user [`prajjwal1`](https://huggingface.co/prajjwal1).)

In [None]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In order to load a model using the code above, you would have to know that BERT$_{\text{tiny}}$'s architecture is implemented using the same class as BERT$_{\text{base}}$. This is not true in general, however. For instance, if you wanted to initialize a RoBERTa classifier instead of a BERT classifier, you would need to call `RobertaForSequenceClassification.from_pretrained` instead of `BertForSequenceClassification.from_pretrained`. When you don't know which class implements the architecture of pre-trained model you want to load, you can use the `AutoModelForSequenceClassification` class ([and equivalent classes for other tasks](https://huggingface.co/docs/transformers/model_doc/auto)), which will figure out which class to instantiate based on the pre-trained weights you would like to load.

In [None]:
# This code does exactly the same thing as the previous code cell
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In addition to models, 🤗 Transformers also provides tokenizers that implement a full processing pipeline similar to what you implemented in HW 2. You can load the appropriate tokenizer for your model using a `.from_pretrained` method, just as you did with the model.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

As we saw in lab, the tokenizer object can be called as a function. Doing so will return a fully processed input, ready to be passed to the model.

In [None]:
# Because 🤗 Transformers supports multiple deep learning libraries, you will
# need to use the keyword parameter return_tensors in order to indicate that
# you want your inputs to be returned in PyTorch format.
inputs = tokenizer(["Hello world!", "How are you?"], padding=True,
                   return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0],
        [ 101, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

The inputs returned by the tokenizer are passed to the model via [dictionary unpacking](https://realpython.com/python-kwargs-and-args/). The output of the model is structured, with various kinds of information provided depending on keyword arguments passed to the model.

In [None]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs, end="\n\n")

# Use the dot operator to access parts of the output
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.0604,  0.3207],
        [-0.1486,  0.2225]]), hidden_states=None, attentions=None)

tensor([[-0.0604,  0.3207],
        [-0.1486,  0.2225]])


### Problem 1b: Understand BERT Inputs (Written, 10 Points)

Look at the tokenized inputs from two code cells above. The inputs are represented as a dict with three keys: `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`. What do each of those three inputs represent? Please consult the [original BERT paper (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) for guidance.

### Problem 1c: Understand BERT Hyperparameters (Written, 10 Points)

For this assignment, you will perform hyperparameter tuning for the BERT$_\text{tiny}$ model using the same procedure as in the [original paper](https://arxiv.org/abs/1908.08962). Their hyperparameter tuning procedure is documented in the [official BERT GitHub repository](https://github.com/google-research/bert) under the heading "**\*\*\*\*\*New March 11th, 2020: Smaller BERT Models\*\*\*\*\***." Please read this documentation and describe how hyperparameter tuning was performed for the GLUE benchmark.

### Problem 1d: Prepare Dataset (Code, 10 Points)

As in lab, we will be using the IMDb dataset provided by 🤗 Datasets.

In [None]:
# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

The 🤗 Transformers fine-tuning API expects datasets to be pre-processed through the following steps.
- All input texts should be tokenized.
- BERT models have a maximum input length, and all inputs need to be truncated to this length.
- Inputs shorter than the maximum input length should be padded to this length.
- The pre-processed inputs do not need to be in the form of PyTorch tensors.

These steps are performed by the `preprocess_dataset` function in `run_experiment.py`, which you will implement for this problem.

In [None]:
imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Visualize the preprocessed dataset
for k, v in imdb["val"][:2].items():
    print("{}:\n{}\n{}\n".format(k, type(v),
                                 [item[:20] if isinstance(item, Iterable) else
                                 item for item in v[:5]]))

NameError: name 'preprocess_dataset' is not defined

Please base your implementation on the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training), and please consult [Appendix A.2 of the BERT paper](https://arxiv.org/abs/1810.04805) to find out what the maximum input length should be.

## Problem 2: Implement Experiment (50 Points in Total)
### Problem 2a: Freeze Non-Bias Weights (Code, 10 Points)

At the end of this assignment, you will be comparing a BERT$_{\text{tiny}}$ model fine-tuned using BitFit to a BERT$_{\text{tiny}}$ model fine-tuned _without_ BitFit. To run that experiment, you will need to support freezing all non-bias parameters of the model. To do this, please implement the `init_model` function, illustrated below. This function should load a pre-trained BERT classifier model from the Hugging Face Model Hub and optionally freeze non-bias parameters.

In [None]:
# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

**Hint:** Please consult the [documentation for the function `nn.Module.named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

### Problem 2b: Set Up Trainer and Tester (Code, 20 Points)

🤗 Transformers provides a [`Trainer` object](https://huggingface.co/docs/transformers/main_classes/trainer) that implements training and testing a neural network. For this problem, please implement the functions `init_trainer` in `train_model.py` and `init_tester` in `test_model.py`, which will set up the `Trainer`s used to train and test your model, respectively.

In [None]:
# Creates a Trainer from a Hugging Face Model Hub identifier
trainer = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer.train()

# Change this to whichever checkpoint you want to evalaute
eval_checkpoint_directory = "checkpoints/run-13/checkpoint-1252"

# Creates a Trainer to test a Hugging Face saved model
tester = init_tester(eval_checkpoint_directory)

Your `init_trainer` function needs to support the following.
- The training configuration (total number of epochs, early stopping criteria if any) must match your answer for Problem 1c.
- Your `Trainer` needs to save the model obtained during each training run to a folder called `checkpoints`.
- You should leave the `model` keyword parameter blank and instead pass an argument to the `model_init` keyword parameter.
- It should evaluate models based on accuracy.

Your `init_tester` function needs to support the following.
- The `Trainer` should only support testing and not traiing.
- It should evaluate models based on accuracy.


Please use the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) as well as [this forum post](https://discuss.huggingface.co/t/using-trainer-at-inference-time/9378/3) for guidance. You may need to create new functions for this problem, and you may find it useful to learn about [lambda expressions](https://realpython.com/python-lambda/) if you don't know about them already.

### Problem 2c: Set Up Hyperparameter Tuning (Code, 20 Points)

Finally, to complete the experiment setup, you will implement hyperparameter tuning using the [Optuna](https://optuna.org/) framework. Optuna is integrated with 🤗 Transformers, and it can be invoked via the `Trainer.hyperparameter_search` method. Please implement the function `hyperparameter_search_settings` in `train_model.py` by returing the correct keyword arguments for `Trainer.hyperparameter_search`. (Observe that, at the end of `train_model.py`, these keyword arguments are passed to `Trainer.hyperparameter_search` via dictionary unpacking.)  

Your code should support the following requirements.
- Your hyperparameter tuning configuration must match your answer for Problem 1c.
- You must use Optuna for hyperparameter tuning.
- You must indicate to Optuna that the hyperparameter search should maximize accuracy.

Please use the following resources for guidance.
- [The Hugging Face tutorial on hyperparameter tuning](https://huggingface.co/docs/transformers/hpo_train)
- [The documentation for `Trainer.hyperparameter_search`](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/trainer#transformers.Trainer.hyperparameter_search)
- [The documentation for Optuna's `GridSampler`](https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.samplers.GridSampler.html)

## Problem 3: Run Experiment (20 Points in Total)

To complete the assignment, you will now run your code and report on the results. It is recommended that you run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

### Problem 3a: Train Models (Code and Written, 10 Points)

Please now run the following experimental procedure by running `train_model.py` as a Python script:
- first, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _with_ BitFit;
- then, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _without_ BitFit.

The `train_model.py` script should create a Pickle object containing information about the best hyperparameters found during hyperparameter tuning. Please submit this object, using the filenames `train_results_with_bitfit.p` and `train_results_without_bitfit.p` for your two training runs, respectively. Please also report the highest validation accuracy attained in each of your two training runs, as well as the hyperparameters used in those trials. Please format these results as a table such as the following.

| | Validation Accuracy | Learning Rate | Batch Size |
|---|---|---|---|
| Without BitFit | | | |
| With BitFit | | | |

### Problem 3b: Test Models and Report Results (Code and Written, 10 Points)

For each of your two training runs, please test the model that attained the best validation accuracy across all hyperparameter tuning trials. You may do so by running the `test_model.py` script. Once testing is complete, please report your results in the form of a table such as the following.

| | # Trainable Parameters | Test Accuracy |
|---|---|---|
| Without BitFit | | |
| With BitFit | | |

The `test_model.py` script should create a Pickle object containing information about test results. Please submit this object, using the filenames `test_results_with_bitfit.p` and `test_results_without_bitfit.p` for your two tests.

Finally, please comment on your results. How do they compare to the results reported by Zaken et al. (2020)? What does this say about BitFit and its applicability to other pre-trained Transformers?