# HW 2: Efficient Fine-Tuning with BitFit?
**Due: March 13, 11:30 AM**

In this homework assignment, you will replicate [the BitFit experiments (Zaken et al., 2020)](https://aclanthology.org/2022.acl-short.1/). You will first use the [🤗 Transformers framework](https://huggingface.co/docs/transformers/index) to fine-tune a [BERT$_\text{tiny}$ model](https://huggingface.co/prajjwal1/bert-tiny) ([Turc et al., 2019](https://arxiv.org/abs/1908.08962); [Bhargava et al., 2021](https://aclanthology.org/2021.insights-1.18/)) on the IMDb dataset. You will then fine-tune the same model, but with all parameters frozen other than the bias terms. You will compare the two models on the following metrics: (1) their accuracy on the IMDb test set and (2) the number of parameters trained during fine-tuning.

## Important: Read Before Starting

In the following exercises, you will need to implement functions defined in the `train_model.py` and `test_model.py` scripts. **Please write all your code in those files.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into Python modules.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly.

For part of this assignment, you will be asked to fine-tune a BERT$_\text{tiny}$ model on the IMDb dataset with hyperparameter tuning. **This will take several hours to run on a laptop with a CPU.** You may want to instead run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

To begin, please run the following `import` statements.

In [1]:
! pip install datasets evaluate optuna --quiet # install datasets if it is not included in your environment

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/487.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/487.4 kB[0m [31m20.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/383.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/231.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install evaluate



In [3]:
!pip install optuna



In [4]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast



In [5]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/NLU/HW2/')


Mounted at /content/drive


In [6]:
# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer

In [7]:
from test_model import init_tester

## Problem 1: Setup (30 Points in Total)

In this assignment, you will fine-tune a pre-trained Transformer model using libraries provided by [Hugging Face](https://huggingface.co/) (whose name is usually styled using the emoji 🤗). You have already been exposed to Hugging Face in lab, where you used the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the IMDb dataset and the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library to load a pre-trained BERT$_\text{tiny}$ model. In the following problems, additionally use the [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library, which provides evaluation metrics such as accuracy and F1.

For several parts of this problem, you will need to refer to the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) for guidance.

### Problem 1a: Understand the 🤗 Transformers Library (No Submission, 0 Points)

🤗 Transformers is imported into Python via the name `transformers`. Please find the import statements from 🤗 Transformers in the code cell above.

🤗 Transformers comes with a number of different Transformer architectures, as well as [the Model Hub, a repository of pre-trained model parameters](https://huggingface.co/models). A pre-trained model is loaded by calling the model architecture's `.from_pretrained` method.

In [8]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code above loads a Transformer classifier consisting of a pre-trained BERT$_\text{base}$ encoder with case-insensitive vocabulary and a randomly initialized 2-layer MLP decoder with tanh activation. The choice of this particular set of pre-trained parameters is specified by the identifier `'bert-base-uncased'`, which is passed to the first parameter of `.from_pretrained`. Different pre-trained weights can be loaded by passing a different identifier to `.from_pretrained`. The following code loads the BERT$_\text{tiny}$ model from [Turc et al. (2019)](https://arxiv.org/abs/1908.08962) and [Bhargava et al. (2021)](https://aclanthology.org/2021.insights-1.18/), which you will be fine-tuning in this assignment. (The `/` indicates that this is a user-submitted model, uploaded by the user [`prajjwal1`](https://huggingface.co/prajjwal1).)

In [9]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In order to load a model using the code above, you would have to know that BERT$_{\text{tiny}}$'s architecture is implemented using the same class as BERT$_{\text{base}}$. This is not true in general, however. For instance, if you wanted to initialize a RoBERTa classifier instead of a BERT classifier, you would need to call `RobertaForSequenceClassification.from_pretrained` instead of `BertForSequenceClassification.from_pretrained`. When you don't know which class implements the architecture of pre-trained model you want to load, you can use the `AutoModelForSequenceClassification` class ([and equivalent classes for other tasks](https://huggingface.co/docs/transformers/model_doc/auto)), which will figure out which class to instantiate based on the pre-trained weights you would like to load.

In [10]:
# This code does exactly the same thing as the previous code cell
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In addition to models, 🤗 Transformers also provides tokenizers that implement a full processing pipeline similar to what you implemented in HW 2. You can load the appropriate tokenizer for your model using a `.from_pretrained` method, just as you did with the model.

In [11]:
tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

As we saw in lab, the tokenizer object can be called as a function. Doing so will return a fully processed input, ready to be passed to the model.

In [12]:
# Because 🤗 Transformers supports multiple deep learning libraries, you will
# need to use the keyword parameter return_tensors in order to indicate that
# you want your inputs to be returned in PyTorch format.
inputs = tokenizer(["Hello world!", "How are you?"], padding=True,
                   return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0],
        [ 101, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

The inputs returned by the tokenizer are passed to the model via [dictionary unpacking](https://realpython.com/python-kwargs-and-args/). The output of the model is structured, with various kinds of information provided depending on keyword arguments passed to the model.

In [13]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs, end="\n\n")

# Use the dot operator to access parts of the output
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.1833,  0.0153],
        [-0.3145,  0.0678]]), hidden_states=None, attentions=None)

tensor([[-0.1833,  0.0153],
        [-0.3145,  0.0678]])


### Problem 1b: Understand BERT Inputs (Written, 10 Points)

Look at the tokenized inputs from two code cells above. The inputs are represented as a dict with three keys: `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`. What do each of those three inputs represent? Please consult the [original BERT paper (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) for guidance.

### Problem 1c: Understand BERT Hyperparameters (Written, 10 Points)

For this assignment, you will perform hyperparameter tuning for the BERT$_\text{tiny}$ model using the same procedure as in the [original paper](https://arxiv.org/abs/1908.08962). Their hyperparameter tuning procedure is documented in the [official BERT GitHub repository](https://github.com/google-research/bert) under the heading "**\*\*\*\*\*New March 11th, 2020: Smaller BERT Models\*\*\*\*\***." Please read this documentation and describe how hyperparameter tuning was performed for the GLUE benchmark.

### Problem 1d: Prepare Dataset (Code, 10 Points)

As in lab, we will be using the IMDb dataset provided by 🤗 Datasets.

In [14]:
# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

The 🤗 Transformers fine-tuning API expects datasets to be pre-processed through the following steps.
- All input texts should be tokenized.
- BERT models have a maximum input length, and all inputs need to be truncated to this length.
- Inputs shorter than the maximum input length should be padded to this length.
- The pre-processed inputs do not need to be in the form of PyTorch tensors.

These steps are performed by the `preprocess_dataset` function in `run_experiment.py`, which you will implement for this problem.

In [15]:
imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Visualize the preprocessed dataset
for k, v in imdb["val"][:2].items():
    print("{}:\n{}\n{}\n".format(k, type(v),
                                 [item[:20] if isinstance(item, Iterable) else
                                 item for item in v[:5]]))

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

text:
<class 'list'>
['As so many others ha', 'When converting a bo']

label:
<class 'list'>
[1, 0]

input_ids:
<class 'list'>
[[101, 2004, 2061, 2116, 2500, 2031, 2517, 1010, 2023, 2003, 1037, 6919, 4516, 1012, 2182, 2003, 1037, 2862, 1997, 1996], [101, 2043, 16401, 1037, 2338, 2000, 2143, 1010, 2009, 2003, 3227, 1037, 2204, 2801, 2000, 2562, 2012, 2560, 2070, 1997]]

token_type_ids:
<class 'list'>
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

attention_mask:
<class 'list'>
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]



Please base your implementation on the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training), and please consult [Appendix A.2 of the BERT paper](https://arxiv.org/abs/1810.04805) to find out what the maximum input length should be.

## Problem 2: Implement Experiment (50 Points in Total)
### Problem 2a: Freeze Non-Bias Weights (Code, 10 Points)

At the end of this assignment, you will be comparing a BERT$_{\text{tiny}}$ model fine-tuned using BitFit to a BERT$_{\text{tiny}}$ model fine-tuned _without_ BitFit. To run that experiment, you will need to support freezing all non-bias parameters of the model. To do this, please implement the `init_model` function, illustrated below. This function should load a pre-trained BERT classifier model from the Hugging Face Model Hub and optionally freeze non-bias parameters.

In [16]:
# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


False
True


**Hint:** Please consult the [documentation for the function `nn.Module.named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

### Problem 2b: Set Up Trainer and Tester (Code, 20 Points)

🤗 Transformers provides a [`Trainer` object](https://huggingface.co/docs/transformers/main_classes/trainer) that implements training and testing a neural network. For this problem, please implement the functions `init_trainer` in `train_model.py` and `init_tester` in `test_model.py`, which will set up the `Trainer`s used to train and test your model, respectively.

In [None]:
# 073a5098db64860d3b5a379e55cd7efb89f50b8d

In [17]:
# Creates a Trainer from a Hugging Face Model Hub identifier
trainer = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer.train()

# Change this to whichever checkpoint you want to evalaute


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdb5144[0m ([33mdb5144-new-york-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3684,0.35852,0.8646
2,0.2311,0.407939,0.87
3,0.3663,0.525805,0.8592
4,0.2641,0.446904,0.8844


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=10000, training_loss=0.32413340103626254, metrics={'train_runtime': 308.3359, 'train_samples_per_second': 259.457, 'train_steps_per_second': 32.432, 'total_flos': 101638963200000.0, 'train_loss': 0.32413340103626254, 'epoch': 4.0})

In [18]:
best_trian_without_bitfit = trainer.hyperparameter_search()

    # Save best hyperparameters


[I 2025-03-19 16:02:49,164] A new study created in memory with name: no-name-06088dc3-5bcc-4dda-80fb-abd086f3c2b3
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▃▄▁█
eval/loss,▁▃█▅
eval/runtime,█▅▁▅
eval/samples_per_second,▁▄█▄
eval/steps_per_second,▁▄█▄
train/epoch,▁▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇██
train/grad_norm,▁▁▃▂▂▂▂▁▂▁▂▂▂▁▂▁▁▃▁▂▄█▁▁▁▁▃▁▅▁▃▁▁▂▃▁▁▁▁▂
train/learning_rate,█▇▇▇▆▆▆▆▅▅▅▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁
train/loss,█▇▇█▆▇▅▆▅▅▃▆▄▃▃▇▇▃▇▃▄▄▄▄▅▃▂▃▄▁▆▄▄▂▁▃▄▃▆▂

0,1
eval/accuracy,0.8844
eval/loss,0.4469
eval/runtime,9.3073
eval/samples_per_second,537.212
eval/steps_per_second,67.151
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,67.72562
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2015,0.369717,0.8654
2,0.3773,0.362254,0.8754


[I 2025-03-19 16:05:10,626] Trial 0 finished with value: 0.8754 and parameters: {'learning_rate': 6.599438342652606e-05, 'num_train_epochs': 2, 'seed': 22, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.8754.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇███
train/grad_norm,▂▁▁▃▂▂▂▂▂▄▂▄▄▂▃▃▂▃▇▂▇▂█▃▄▂▁▇▅▄▁▅▂▆▁▇▁▂▅▂
train/learning_rate,███▇▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁▁
train/loss,█▆▅▅▅▄▄▄▄▅▅▄▅▇▄▅▅▃▅▄▃▄▄▃▄▁▇▂▆▂▂▃▃▁▃▃▂▆▂▃

0,1
eval/accuracy,0.8754
eval/loss,0.36225
eval/runtime,9.1685
eval/samples_per_second,545.343
eval/steps_per_second,68.168
total_flos,50819481600000.0
train/epoch,2.0
train/global_step,5000.0
train/grad_norm,31.67848
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.436,0.361945,0.8452
2,0.3105,0.327455,0.867
3,0.26,0.325276,0.8724
4,0.2588,0.331959,0.8734


[I 2025-03-19 16:09:09,812] Trial 1 finished with value: 0.8734 and parameters: {'learning_rate': 3.793728770657766e-05, 'num_train_epochs': 4, 'seed': 12, 'per_device_train_batch_size': 16}. Best is trial 1 with value: 0.8734.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆██
eval/loss,█▁▁▂
eval/runtime,▃▆█▁
eval/samples_per_second,▆▃▁█
eval/steps_per_second,▆▃▁█
train/epoch,▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▁▁▁▂▁▂▃▂▅▄▂▃▃▂▃▁▃▄▁▂▅▂▂▃▅▂▃▃▂▃▂▃▃█▃▂▃▄▂▇
train/learning_rate,█████▇▇▇▇▇▇▇▇▇▆▆▆▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▂▂▁▁▁
train/loss,██▅▅▅▄▅▃▄▁▄▃▂▃▃▃▄▃▃▃▂▁▂▂▂▃▃▃▃▁▃▂▂▄▂▂▂▁▂▁

0,1
eval/accuracy,0.8734
eval/loss,0.33196
eval/runtime,7.8442
eval/samples_per_second,637.416
eval/steps_per_second,79.677
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,5000.0
train/grad_norm,2.47356
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5123,0.53593,0.739
2,0.4319,0.44879,0.7976
3,0.36,0.418887,0.8144
4,0.3517,0.400156,0.8232


[I 2025-03-19 16:13:08,765] Trial 2 finished with value: 0.8232 and parameters: {'learning_rate': 1.027469477669979e-05, 'num_train_epochs': 4, 'seed': 24, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.8232.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▇█
eval/loss,█▄▂▁
eval/runtime,▆█▁▂
eval/samples_per_second,▂▁█▇
eval/steps_per_second,▂▁█▇
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇████
train/grad_norm,▂▁▁▁▁▁▂▁▂▂▂▃▂▃▃▄▂▃▁▂▂▃▂▂▇▄▃▃▂▅▄▂▅█▃▂▅▃▃▂
train/learning_rate,██████▇▇▇▇▇▆▆▆▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▁
train/loss,███▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▃▄▃▂▃▂▄▄▃▄▃▃▂▃▃▄▁▅▃▄▃▃

0,1
eval/accuracy,0.8232
eval/loss,0.40016
eval/runtime,7.7671
eval/samples_per_second,643.745
eval/steps_per_second,80.468
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,5000.0
train/grad_norm,23.78348
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3199,0.462979,0.8286
2,0.4626,0.517468,0.8432


[I 2025-03-19 16:16:17,174] Trial 3 finished with value: 0.8432 and parameters: {'learning_rate': 1.6394864749315943e-05, 'num_train_epochs': 2, 'seed': 29, 'per_device_train_batch_size': 4}. Best is trial 2 with value: 0.8232.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,▁█
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▁▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▇▇▇▇▇▇▇▇▇█
train/global_step,▁▁▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▅▄▂▃▁▅▁▁▄▂▁▅▃█▁▁▆▃▄▁▁▃▆▅
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,▅▅▅▅▅▄▅▅▅▅▄▄▄▃▄▄▃▃▄▃▅▃▂▅▄▃▃▃▂▄█▃▄▇▃▁▂▄▃▂

0,1
eval/accuracy,0.8432
eval/loss,0.51747
eval/runtime,9.6924
eval/samples_per_second,515.869
eval/steps_per_second,64.484
total_flos,50819481600000.0
train/epoch,2.0
train/global_step,10000.0
train/grad_norm,67.49819
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2425,0.378494,0.8742
2,0.3735,0.379259,0.8856
3,0.0882,0.465184,0.8862
4,0.1096,0.48652,0.8896


[I 2025-03-19 16:20:59,521] Trial 4 finished with value: 0.8896 and parameters: {'learning_rate': 9.170857069053858e-05, 'num_train_epochs': 4, 'seed': 22, 'per_device_train_batch_size': 8}. Best is trial 2 with value: 0.8232.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▆█
eval/loss,▁▁▇█
eval/runtime,▁▇▁█
eval/samples_per_second,█▂█▁
eval/steps_per_second,█▂█▁
train/epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇██
train/grad_norm,▁▁▂▁▁▁▁▂▂▁▂▄▁▂▂▃▄▁▁█▂▂▃▁▂▁▁▁▂▅▁▇▁█▁▄▁▂▁▁
train/learning_rate,█████▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▁▁
train/loss,█▇▇▅▅▆▅▃▅▄▃▄▅▅▂▅▃█▁▄▂▃▁▃▇▃▄▅▄▄▃▁▂▂▃▂▃▄▂▂

0,1
eval/accuracy,0.8896
eval/loss,0.48652
eval/runtime,7.6515
eval/samples_per_second,653.466
eval/steps_per_second,81.683
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,118.09566
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4205,0.484203,0.851
2,0.3273,0.584218,0.8482
3,0.3392,0.489666,0.8706
4,0.053,0.524114,0.8708


[I 2025-03-19 16:27:11,958] Trial 5 finished with value: 0.8708 and parameters: {'learning_rate': 2.2254777772526316e-05, 'num_train_epochs': 4, 'seed': 8, 'per_device_train_batch_size': 4}. Best is trial 2 with value: 0.8232.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▂▁██
eval/loss,▁█▁▄
eval/runtime,▂█▇▁
eval/samples_per_second,▇▁▂█
eval/steps_per_second,▇▁▂█
train/epoch,▁▁▁▁▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇█
train/grad_norm,▁▁▁▁▁▁▁▂▃▁▃▄▁▁▄▁▄▁█▁▅▁▅▁▇▁▅▁▁▂▁▁▂▁▁▁█▁▁▂
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁
train/loss,▇▆▄▄▄▅▄▇▅▃▅█▅▂▆▃▂▁▃▂▆▄▁█▆█▂▃▃▃▄▇▆▃▄▄▅▅▄▁

0,1
eval/accuracy,0.8708
eval/loss,0.52411
eval/runtime,7.6549
eval/samples_per_second,653.18
eval/steps_per_second,81.648
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,20000.0
train/grad_norm,0.10277
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.452,0.399209,0.8286
2,0.3659,0.356896,0.8496


[I 2025-03-19 16:29:06,401] Trial 6 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇████
train/global_step,▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇█
train/grad_norm,▂▁▁▁▁▁▁▁▁▁▁▃▂▂▃▄▂▄▄▃▅▇▅▃▄▄▅▃▄█▃▂▄▇▇▄▂▄▄▃
train/learning_rate,████▇▇▇▇▆▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
train/loss,████▇▇▆▆▆▅▆▄▄▄▄▃▃▂▃▃▂▃▂▂▁▃▄▂▃▂▃▁▂▁▂▂▂▂▂▂

0,1
eval/accuracy,0.8496
eval/loss,0.3569
eval/runtime,9.0617
eval/samples_per_second,551.773
eval/steps_per_second,68.972
train/epoch,2.0
train/global_step,1250.0
train/grad_norm,9.5109
train/learning_rate,1e-05
train/loss,0.3659


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2929,0.339949,0.8648
2,0.2651,0.405239,0.8612
3,0.4546,0.396552,0.8802
4,0.398,0.418643,0.8806


[I 2025-03-19 16:33:55,360] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▂▁██
eval/loss,▁▇▆█
eval/runtime,▁█▄▁
eval/samples_per_second,█▁▅█
eval/steps_per_second,█▁▅█
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇▇██
train/grad_norm,▁▁▂▂▂▃▂▂▂▁▁▄▂▂▂▁▃▃▃▁▃▃▂▁▂▆█▅▃▁▂▁▁▁▂▁▂▅▄▂
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▆▅▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,██▇▅▅▆▃▄▃▄▄▄▆▅▃▃▆▃▃▄▇▄▂▆▂▆▂▁▄▄▂▅▃▄▃▃▃▄▂▁

0,1
eval/accuracy,0.8806
eval/loss,0.41864
eval/runtime,9.2056
eval/samples_per_second,543.145
eval/steps_per_second,67.893
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,63.53713
train/learning_rate,0.0
train/loss,0.398


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6622,0.655207,0.641
2,0.6543,0.635273,0.6582
3,0.6193,0.627763,0.6652


[I 2025-03-19 16:38:37,380] Trial 8 finished with value: 0.6652 and parameters: {'learning_rate': 1.160928293027596e-06, 'num_train_epochs': 3, 'seed': 4, 'per_device_train_batch_size': 4}. Best is trial 8 with value: 0.6652.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆█
eval/loss,█▃▁
eval/runtime,▂▁█
eval/samples_per_second,▇█▁
eval/steps_per_second,▇█▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▇▇▇▇▇▇███
train/grad_norm,▂▂▁▂▃▁▅▂▂▁▂▂▂▁▃▂▄▃▂▃▆▆▃▅▃▃▃▂▂▂█▂█▃▃▂▃▂▂▃
train/learning_rate,███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁
train/loss,▆▆▄▆▆▇▆▅▆▆▅▇▅▄▆▄▅▄▄▄▃▃▃▄█▅▅▅▆▅▂▄▆▅▅▄▁▂▇▅

0,1
eval/accuracy,0.6652
eval/loss,0.62776
eval/runtime,9.3865
eval/samples_per_second,532.68
eval/steps_per_second,66.585
total_flos,76229222400000.0
train/epoch,3.0
train/global_step,15000.0
train/grad_norm,5.24427
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6909,0.669095,0.611
2,0.6574,0.654137,0.6428
3,0.6493,0.64883,0.649


[I 2025-03-19 16:43:13,199] Trial 9 finished with value: 0.649 and parameters: {'learning_rate': 1.0390014028394358e-06, 'num_train_epochs': 3, 'seed': 35, 'per_device_train_batch_size': 4}. Best is trial 9 with value: 0.649.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇█
eval/loss,█▃▁
eval/runtime,▇▁█
eval/samples_per_second,▂█▁
eval/steps_per_second,▂█▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▆▆▆▆▆▆▆▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▇▇▇█████
train/grad_norm,▆▄▂▃▃▃▅▃▂▂▃▃▃▂▆▄▃▁▂▁█▁▁▁▃▂█▄▂█▂▁▂▃▃▃▂▂▃▂
train/learning_rate,█▇▇▇▇▇▇▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▇▆▆▅▅▅▇▆▃▆▄▆▄▄▆▅▄▅▅▃▃▅▃▃▄▅▁▆▄▂▄▃▂▄▄▄▄▂▃

0,1
eval/accuracy,0.649
eval/loss,0.64883
eval/runtime,9.4648
eval/samples_per_second,528.272
eval/steps_per_second,66.034
total_flos,76229222400000.0
train/epoch,3.0
train/global_step,15000.0
train/grad_norm,7.27992
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6921,0.692203,0.5018


[I 2025-03-19 16:44:10,086] Trial 10 finished with value: 0.5018 and parameters: {'learning_rate': 1.5932966510590878e-06, 'num_train_epochs': 1, 'seed': 40, 'per_device_train_batch_size': 64}. Best is trial 10 with value: 0.5018.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▅▅▂▁▅▄▂▄▃▁▂▂▆▂▂▃▂▂▃▃▅▇▁▁█▅▁▅▂▂▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▄▇▅▄▃▇▅█▅▄▄▅▇▄▅▅▄▆▃▁▃▅▄▄▅▅▃▅▁▃▅

0,1
eval/accuracy,0.5018
eval/loss,0.6922
eval/runtime,8.0999
eval/samples_per_second,617.293
eval/steps_per_second,77.162
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.82419
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6943,0.694636,0.4934


[I 2025-03-19 16:45:06,829] Trial 11 finished with value: 0.4934 and parameters: {'learning_rate': 1.0726564771392724e-06, 'num_train_epochs': 1, 'seed': 40, 'per_device_train_batch_size': 64}. Best is trial 11 with value: 0.4934.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▅▅▂▁▅▄▂▄▃▁▂▂▆▂▂▂▂▂▃▂▅▇▁▁█▅▁▅▂▂▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▃▇▅▃▃▇▅█▅▄▄▅▇▄▆▅▄▆▃▁▃▅▄▄▆▅▃▅▂▃▆

0,1
eval/accuracy,0.4934
eval/loss,0.69464
eval/runtime,7.6616
eval/samples_per_second,652.608
eval/steps_per_second,81.576
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.83996
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6878,0.6875,0.5302


[I 2025-03-19 16:46:02,822] Trial 12 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▆▅▂▁▄▄▂▄▄▂▂▃▅▂▃▃▂▂▃▃▅▇▂▂█▅▁▆▃▂▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▅█▆▅▄▇▆█▅▅▅▅▇▅▅▅▄▅▃▂▄▆▄▄▄▅▃▄▁▄▅

0,1
eval/accuracy,0.5302
eval/loss,0.6875
eval/runtime,8.0416
eval/samples_per_second,621.769
eval/steps_per_second,77.721
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.79506
train/learning_rate,0.0
train/loss,0.6878


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6777,0.67496,0.614


[I 2025-03-19 16:46:59,666] Trial 13 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▃▃▃▂▁▂▃▄▃▃▃▂▅▃▆▂▃▁▃▂█▄▂▄▃▃▂▅▆▂▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▅▇█▆▇▅▆▆▅▆▅▅▂▂▄▄▂▄▃▃▅▃▃▃▁▄▁▃▃▂▃

0,1
eval/accuracy,0.614
eval/loss,0.67496
eval/runtime,8.9572
eval/samples_per_second,558.213
eval/steps_per_second,69.777
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.03705
train/learning_rate,0.0
train/loss,0.6777


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6857,0.683368,0.5592


[I 2025-03-19 16:47:57,712] Trial 14 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▃▂▃▁▅▃▁▂▁▁▄▁▁▁▄▂▄▁▂▃▂▃▃▅▁▁▃▂▂█▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▃▆▅▆█▃▅▃▄▂▃▄▅▁▆▁▆▁▂▆▃▅▄▄▂▃▃▂▃▂▃

0,1
eval/accuracy,0.5592
eval/loss,0.68337
eval/runtime,9.2865
eval/samples_per_second,538.418
eval/steps_per_second,67.302
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.63987
train/learning_rate,0.0
train/loss,0.6857


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6678,0.668602,0.6184


[I 2025-03-19 16:48:54,696] Trial 15 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▄█▁▃▃▁▄▁▂▂▂▁▂▅▂▁▂▃▄▂▃▂▂▂▅▃▄▃▄▃▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▆▇▆█▆▆▆▅▆▅▄▅▅▅▅▄▄▄▃▃▃▃▃▃▃▃▄▂▃▁

0,1
eval/accuracy,0.6184
eval/loss,0.6686
eval/runtime,9.3243
eval/samples_per_second,536.232
eval/steps_per_second,67.029
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.14476
train/learning_rate,0.0
train/loss,0.6678


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7001,0.687196,0.5544


[I 2025-03-19 16:49:52,332] Trial 16 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▂▁▃▂▃▂▂▁▁▁▄▃▄▂▂▃▂▁▂▃▇▃▆▃▂▁▂█▁▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▆▆▅▇█▆▄▅▆▅▅▄▇▅█▃▃▃▆▃▂▃▃▂▂▄▄▂▂▁▇

0,1
eval/accuracy,0.5544
eval/loss,0.6872
eval/runtime,9.7607
eval/samples_per_second,512.258
eval/steps_per_second,64.032
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.15565
train/learning_rate,0.0
train/loss,0.7001


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6796,0.67691,0.627


[I 2025-03-19 16:50:51,659] Trial 17 finished with value: 0.627 and parameters: {'learning_rate': 2.054859347583799e-06, 'num_train_epochs': 1, 'seed': 29, 'per_device_train_batch_size': 32}. Best is trial 11 with value: 0.4934.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███
train/grad_norm,▁▄▄▂▁▁▃▂▃▃▁▃▄▁▄▃▂▂▃▄▂▂▂█▁▄▂▂▂▃▂▂█▁▂▆▂▂▂▂
train/learning_rate,█████▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁
train/loss,▅▄█▅▁▃▅▅▂▆▂▄▃▄▄▁▂▂▄▃▃▂▂▃▂▁▄▂▂▃▂▁▁▂▂▄▃▂▂▂

0,1
eval/accuracy,0.627
eval/loss,0.67691
eval/runtime,8.3182
eval/samples_per_second,601.092
eval/steps_per_second,75.136
total_flos,25613018726400.0
train/epoch,1.0
train/global_step,625.0
train/grad_norm,1.64232
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6806,0.670438,0.613


[I 2025-03-19 16:51:48,323] Trial 18 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▂▄▄▁▄▂▃▃▅▅▃▇▁▄▄▁▄▄▃▃▂▂█▁▆▄▅▄▂▃▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,██▆▅▇▇▆▆▆▆▅▅▅▅▄▄▃▄▄▄▃▄▃▄▁▂▂▂▁▃▃

0,1
eval/accuracy,0.613
eval/loss,0.67044
eval/runtime,7.7259
eval/samples_per_second,647.175
eval/steps_per_second,80.897
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.92187
train/learning_rate,0.0
train/loss,0.6806


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6728,0.675728,0.5984


[I 2025-03-19 16:52:46,086] Trial 19 pruned. 


NameError: name 'pickle' is not defined

In [19]:
import pickle
with open("train_results_without_bitfit.p", "wb") as f:
  pickle.dump(best_trian_without_bitfit, f)

In [None]:
eval_checkpoint_directory = "/content/checkpoints/1/checkpoint-5000"
#/content/checkpoints/run-1742097961/checkpoint-5000
# Creates a Trainer to test a Hugging Face saved model
tester = init_tester(eval_checkpoint_directory)



Your `init_trainer` function needs to support the following.
- The training configuration (total number of epochs, early stopping criteria if any) must match your answer for Problem 1c.
- Your `Trainer` needs to save the model obtained during each training run to a folder called `checkpoints`.
- You should leave the `model` keyword parameter blank and instead pass an argument to the `model_init` keyword parameter.
- It should evaluate models based on accuracy.

Your `init_tester` function needs to support the following.
- The `Trainer` should only support testing and not traiing.
- It should evaluate models based on accuracy.


Please use the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) as well as [this forum post](https://discuss.huggingface.co/t/using-trainer-at-inference-time/9378/3) for guidance. You may need to create new functions for this problem, and you may find it useful to learn about [lambda expressions](https://realpython.com/python-lambda/) if you don't know about them already.

### Problem 2c: Set Up Hyperparameter Tuning (Code, 20 Points)

Finally, to complete the experiment setup, you will implement hyperparameter tuning using the [Optuna](https://optuna.org/) framework. Optuna is integrated with 🤗 Transformers, and it can be invoked via the `Trainer.hyperparameter_search` method. Please implement the function `hyperparameter_search_settings` in `train_model.py` by returing the correct keyword arguments for `Trainer.hyperparameter_search`. (Observe that, at the end of `train_model.py`, these keyword arguments are passed to `Trainer.hyperparameter_search` via dictionary unpacking.)  

Your code should support the following requirements.
- Your hyperparameter tuning configuration must match your answer for Problem 1c.
- You must use Optuna for hyperparameter tuning.
- You must indicate to Optuna that the hyperparameter search should maximize accuracy.

Please use the following resources for guidance.
- [The Hugging Face tutorial on hyperparameter tuning](https://huggingface.co/docs/transformers/hpo_train)
- [The documentation for `Trainer.hyperparameter_search`](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/trainer#transformers.Trainer.hyperparameter_search)
- [The documentation for Optuna's `GridSampler`](https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.samplers.GridSampler.html)

## Problem 3: Run Experiment (20 Points in Total)

To complete the assignment, you will now run your code and report on the results. It is recommended that you run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

### Problem 3a: Train Models (Code and Written, 10 Points)

Please now run the following experimental procedure by running `train_model.py` as a Python script:
- first, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _with_ BitFit;
- then, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _without_ BitFit.

The `train_model.py` script should create a Pickle object containing information about the best hyperparameters found during hyperparameter tuning. Please submit this object, using the filenames `train_results_with_bitfit.p` and `train_results_without_bitfit.p` for your two training runs, respectively. Please also report the highest validation accuracy attained in each of your two training runs, as well as the hyperparameters used in those trials. Please format these results as a table such as the following.

| | Validation Accuracy | Learning Rate | Batch Size |
|---|---|---|---|
| Without BitFit | | | |
| With BitFit | | | |

### Problem 3b: Test Models and Report Results (Code and Written, 10 Points)

For each of your two training runs, please test the model that attained the best validation accuracy across all hyperparameter tuning trials. You may do so by running the `test_model.py` script. Once testing is complete, please report your results in the form of a table such as the following.

| | # Trainable Parameters | Test Accuracy |
|---|---|---|
| Without BitFit | | |
| With BitFit | | |

The `test_model.py` script should create a Pickle object containing information about test results. Please submit this object, using the filenames `test_results_with_bitfit.p` and `test_results_without_bitfit.p` for your two tests.

Finally, please comment on your results. How do they compare to the results reported by Zaken et al. (2020)? What does this say about BitFit and its applicability to other pre-trained Transformers?

In [None]:
!python "/content/drive/MyDrive/NLU/HW2/train_model.py"

2025-03-18 22:37:16.443801: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742337436.463907    7248 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742337436.469906    7248 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Map: 100% 20000/20000 [00:23<00:00, 864.16 examples/s]
Map: 100% 5000/5000 [00:04<00:00, 1032.92 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):


In [20]:
with open("train_results_with_bitfit.p", "rb") as f:
  best_bitfit = pickle.load(f)

with open("train_results_without_bitfit.p", "rb") as f:
  best_no_bitfit = pickle.load(f)

    # Extract best hyperparameters and accuracy
results_table = f"""
    | Validation Accuracy | Learning Rate | Batch Size |
    |---------------------|--------------|------------|
    | {best_no_bitfit.objective:.4f}  | {best_no_bitfit.hyperparameters['learning_rate']} | {best_no_bitfit.hyperparameters['per_device_train_batch_size']} |
    | {best_bitfit.objective:.4f}  | {best_bitfit.hyperparameters['learning_rate']} | {best_bitfit.hyperparameters['per_device_train_batch_size']} |
"""

print(results_table)


    | Validation Accuracy | Learning Rate | Batch Size |
    |---------------------|--------------|------------|
    | 0.4934  | 1.0726564771392724e-06 | 64 |
    | 0.5038  | 1.1800784767230618e-06 | 32 |



In [None]:
!python "/content/drive/MyDrive/NLU/HW2/test_model.py"