# HW 2: Efficient Fine-Tuning with BitFit?
**Due: March 13, 11:30 AM**

In this homework assignment, you will replicate [the BitFit experiments (Zaken et al., 2020)](https://aclanthology.org/2022.acl-short.1/). You will first use the [🤗 Transformers framework](https://huggingface.co/docs/transformers/index) to fine-tune a [BERT$_\text{tiny}$ model](https://huggingface.co/prajjwal1/bert-tiny) ([Turc et al., 2019](https://arxiv.org/abs/1908.08962); [Bhargava et al., 2021](https://aclanthology.org/2021.insights-1.18/)) on the IMDb dataset. You will then fine-tune the same model, but with all parameters frozen other than the bias terms. You will compare the two models on the following metrics: (1) their accuracy on the IMDb test set and (2) the number of parameters trained during fine-tuning.

## Important: Read Before Starting

In the following exercises, you will need to implement functions defined in the `train_model.py` and `test_model.py` scripts. **Please write all your code in those files.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into Python modules.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly.

For part of this assignment, you will be asked to fine-tune a BERT$_\text{tiny}$ model on the IMDb dataset with hyperparameter tuning. **This will take several hours to run on a laptop with a CPU.** You may want to instead run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

To begin, please run the following `import` statements.

In [None]:
! pip install datasets evaluate optuna --quiet # install datasets if it is not included in your environment

In [None]:
!pip install evaluate

In [None]:
!pip install optuna

In [4]:
import torch
from collections.abc import Iterable
from datasets import load_dataset

# Model and tokenizer from 🤗 Transformers
from transformers import AutoModelForSequenceClassification, \
    BertForSequenceClassification, BertTokenizerFast



In [5]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/NLU/HW2/')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# Code you will write for this assignment
from train_model import init_model, preprocess_dataset, init_trainer

In [7]:
from test_model import init_tester

## Problem 1: Setup (30 Points in Total)

In this assignment, you will fine-tune a pre-trained Transformer model using libraries provided by [Hugging Face](https://huggingface.co/) (whose name is usually styled using the emoji 🤗). You have already been exposed to Hugging Face in lab, where you used the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the IMDb dataset and the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library to load a pre-trained BERT$_\text{tiny}$ model. In the following problems, additionally use the [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library, which provides evaluation metrics such as accuracy and F1.

For several parts of this problem, you will need to refer to the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) for guidance.

### Problem 1a: Understand the 🤗 Transformers Library (No Submission, 0 Points)

🤗 Transformers is imported into Python via the name `transformers`. Please find the import statements from 🤗 Transformers in the code cell above.

🤗 Transformers comes with a number of different Transformer architectures, as well as [the Model Hub, a repository of pre-trained model parameters](https://huggingface.co/models). A pre-trained model is loaded by calling the model architecture's `.from_pretrained` method.

In [10]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=2)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code above loads a Transformer classifier consisting of a pre-trained BERT$_\text{base}$ encoder with case-insensitive vocabulary and a randomly initialized 2-layer MLP decoder with tanh activation. The choice of this particular set of pre-trained parameters is specified by the identifier `'bert-base-uncased'`, which is passed to the first parameter of `.from_pretrained`. Different pre-trained weights can be loaded by passing a different identifier to `.from_pretrained`. The following code loads the BERT$_\text{tiny}$ model from [Turc et al. (2019)](https://arxiv.org/abs/1908.08962) and [Bhargava et al. (2021)](https://aclanthology.org/2021.insights-1.18/), which you will be fine-tuning in this assignment. (The `/` indicates that this is a user-submitted model, uploaded by the user [`prajjwal1`](https://huggingface.co/prajjwal1).)

In [11]:
model = BertForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In order to load a model using the code above, you would have to know that BERT$_{\text{tiny}}$'s architecture is implemented using the same class as BERT$_{\text{base}}$. This is not true in general, however. For instance, if you wanted to initialize a RoBERTa classifier instead of a BERT classifier, you would need to call `RobertaForSequenceClassification.from_pretrained` instead of `BertForSequenceClassification.from_pretrained`. When you don't know which class implements the architecture of pre-trained model you want to load, you can use the `AutoModelForSequenceClassification` class ([and equivalent classes for other tasks](https://huggingface.co/docs/transformers/model_doc/auto)), which will figure out which class to instantiate based on the pre-trained weights you would like to load.

In [12]:
# This code does exactly the same thing as the previous code cell
model = AutoModelForSequenceClassification.from_pretrained(
    "prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In addition to models, 🤗 Transformers also provides tokenizers that implement a full processing pipeline similar to what you implemented in HW 2. You can load the appropriate tokenizer for your model using a `.from_pretrained` method, just as you did with the model.

In [13]:
tokenizer = BertTokenizerFast.from_pretrained("prajjwal1/bert-tiny")

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

As we saw in lab, the tokenizer object can be called as a function. Doing so will return a fully processed input, ready to be passed to the model.

In [14]:
# Because 🤗 Transformers supports multiple deep learning libraries, you will
# need to use the keyword parameter return_tensors in order to indicate that
# you want your inputs to be returned in PyTorch format.
inputs = tokenizer(["Hello world!", "How are you?"], padding=True,
                   return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 7592, 2088,  999,  102,    0],
        [ 101, 2129, 2024, 2017, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

The inputs returned by the tokenizer are passed to the model via [dictionary unpacking](https://realpython.com/python-kwargs-and-args/). The output of the model is structured, with various kinds of information provided depending on keyword arguments passed to the model.

In [15]:
model.eval()
with torch.no_grad():
    outputs = model(**inputs)

print(outputs, end="\n\n")

# Use the dot operator to access parts of the output
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.0256, -0.0840],
        [ 0.0349, -0.0166]]), hidden_states=None, attentions=None)

tensor([[ 0.0256, -0.0840],
        [ 0.0349, -0.0166]])


### Problem 1b: Understand BERT Inputs (Written, 10 Points)

Look at the tokenized inputs from two code cells above. The inputs are represented as a dict with three keys: `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`. What do each of those three inputs represent? Please consult the [original BERT paper (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) for guidance.

### Problem 1c: Understand BERT Hyperparameters (Written, 10 Points)

For this assignment, you will perform hyperparameter tuning for the BERT$_\text{tiny}$ model using the same procedure as in the [original paper](https://arxiv.org/abs/1908.08962). Their hyperparameter tuning procedure is documented in the [official BERT GitHub repository](https://github.com/google-research/bert) under the heading "**\*\*\*\*\*New March 11th, 2020: Smaller BERT Models\*\*\*\*\***." Please read this documentation and describe how hyperparameter tuning was performed for the GLUE benchmark.

### Problem 1d: Prepare Dataset (Code, 10 Points)

As in lab, we will be using the IMDb dataset provided by 🤗 Datasets.

In [16]:
# Load IMDb dataset and create validation split
imdb = load_dataset("imdb")
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
del imdb["unsupervised"]

The 🤗 Transformers fine-tuning API expects datasets to be pre-processed through the following steps.
- All input texts should be tokenized.
- BERT models have a maximum input length, and all inputs need to be truncated to this length.
- Inputs shorter than the maximum input length should be padded to this length.
- The pre-processed inputs do not need to be in the form of PyTorch tensors.

These steps are performed by the `preprocess_dataset` function in `run_experiment.py`, which you will implement for this problem.

In [17]:
imdb["train"] = preprocess_dataset(imdb["train"], tokenizer)
imdb["val"] = preprocess_dataset(imdb["val"], tokenizer)
imdb["test"] = preprocess_dataset(imdb["test"], tokenizer)

# Visualize the preprocessed dataset
for k, v in imdb["val"][:2].items():
    print("{}:\n{}\n{}\n".format(k, type(v),
                                 [item[:20] if isinstance(item, Iterable) else
                                 item for item in v[:5]]))

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

text:
<class 'list'>
['As so many others ha', 'When converting a bo']

label:
<class 'list'>
[1, 0]

input_ids:
<class 'list'>
[[101, 2004, 2061, 2116, 2500, 2031, 2517, 1010, 2023, 2003, 1037, 6919, 4516, 1012, 2182, 2003, 1037, 2862, 1997, 1996], [101, 2043, 16401, 1037, 2338, 2000, 2143, 1010, 2009, 2003, 3227, 1037, 2204, 2801, 2000, 2562, 2012, 2560, 2070, 1997]]

token_type_ids:
<class 'list'>
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

attention_mask:
<class 'list'>
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]



Please base your implementation on the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training), and please consult [Appendix A.2 of the BERT paper](https://arxiv.org/abs/1810.04805) to find out what the maximum input length should be.

## Problem 2: Implement Experiment (50 Points in Total)
### Problem 2a: Freeze Non-Bias Weights (Code, 10 Points)

At the end of this assignment, you will be comparing a BERT$_{\text{tiny}}$ model fine-tuned using BitFit to a BERT$_{\text{tiny}}$ model fine-tuned _without_ BitFit. To run that experiment, you will need to support freezing all non-bias parameters of the model. To do this, please implement the `init_model` function, illustrated below. This function should load a pre-trained BERT classifier model from the Hugging Face Model Hub and optionally freeze non-bias parameters.

In [18]:
# The first parameter is unused; we just pass None to it
model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

# Check if weight matrix is frozen
print(model.bert.encoder.layer[0].attention.self.query.weight.requires_grad)

# Check if bias term is frozen
print(model.bert.encoder.layer[0].attention.self.query.bias.requires_grad)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


False
True


**Hint:** Please consult the [documentation for the function `nn.Module.named_parameters`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters).

### Problem 2b: Set Up Trainer and Tester (Code, 20 Points)

🤗 Transformers provides a [`Trainer` object](https://huggingface.co/docs/transformers/main_classes/trainer) that implements training and testing a neural network. For this problem, please implement the functions `init_trainer` in `train_model.py` and `init_tester` in `test_model.py`, which will set up the `Trainer`s used to train and test your model, respectively.

In [19]:
# Creates a Trainer from a Hugging Face Model Hub identifier
trainer = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer.train()

# Change this to whichever checkpoint you want to evalaute


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdb5144[0m ([33mdb5144-new-york-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3684,0.35852,0.8646
2,0.2311,0.407939,0.87
3,0.3663,0.525805,0.8592
4,0.2641,0.446904,0.8844


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=10000, training_loss=0.32413340103626254, metrics={'train_runtime': 494.3607, 'train_samples_per_second': 161.825, 'train_steps_per_second': 20.228, 'total_flos': 101638963200000.0, 'train_loss': 0.32413340103626254, 'epoch': 4.0})

In [None]:
eval_checkpoint_directory = "/content/checkpoints/1/checkpoint-5000"
#/content/checkpoints/run-1742097961/checkpoint-5000
# Creates a Trainer to test a Hugging Face saved model
tester = init_tester(eval_checkpoint_directory)



Your `init_trainer` function needs to support the following.
- The training configuration (total number of epochs, early stopping criteria if any) must match your answer for Problem 1c.
- Your `Trainer` needs to save the model obtained during each training run to a folder called `checkpoints`.
- You should leave the `model` keyword parameter blank and instead pass an argument to the `model_init` keyword parameter.
- It should evaluate models based on accuracy.

Your `init_tester` function needs to support the following.
- The `Trainer` should only support testing and not traiing.
- It should evaluate models based on accuracy.


Please use the [Hugging Face fine-tuning tutorial](https://huggingface.co/docs/transformers/training) as well as [this forum post](https://discuss.huggingface.co/t/using-trainer-at-inference-time/9378/3) for guidance. You may need to create new functions for this problem, and you may find it useful to learn about [lambda expressions](https://realpython.com/python-lambda/) if you don't know about them already.

### Problem 2c: Set Up Hyperparameter Tuning (Code, 20 Points)

Finally, to complete the experiment setup, you will implement hyperparameter tuning using the [Optuna](https://optuna.org/) framework. Optuna is integrated with 🤗 Transformers, and it can be invoked via the `Trainer.hyperparameter_search` method. Please implement the function `hyperparameter_search_settings` in `train_model.py` by returing the correct keyword arguments for `Trainer.hyperparameter_search`. (Observe that, at the end of `train_model.py`, these keyword arguments are passed to `Trainer.hyperparameter_search` via dictionary unpacking.)  

Your code should support the following requirements.
- Your hyperparameter tuning configuration must match your answer for Problem 1c.
- You must use Optuna for hyperparameter tuning.
- You must indicate to Optuna that the hyperparameter search should maximize accuracy.

Please use the following resources for guidance.
- [The Hugging Face tutorial on hyperparameter tuning](https://huggingface.co/docs/transformers/hpo_train)
- [The documentation for `Trainer.hyperparameter_search`](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/trainer#transformers.Trainer.hyperparameter_search)
- [The documentation for Optuna's `GridSampler`](https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.samplers.GridSampler.html)

## Problem 3: Run Experiment (20 Points in Total)

To complete the assignment, you will now run your code and report on the results. It is recommended that you run your code on [Google Colaboratory](https://colab.research.google.com/) using a free GPU.

### Problem 3a: Train Models (Code and Written, 10 Points)

Please now run the following experimental procedure by running `train_model.py` as a Python script:
- first, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _with_ BitFit;
- then, fine-tune a BERT$_{\text{tiny}}$ model on the IMDb dataset _without_ BitFit.

The `train_model.py` script should create a Pickle object containing information about the best hyperparameters found during hyperparameter tuning. Please submit this object, using the filenames `train_results_with_bitfit.p` and `train_results_without_bitfit.p` for your two training runs, respectively. Please also report the highest validation accuracy attained in each of your two training runs, as well as the hyperparameters used in those trials. Please format these results as a table such as the following.

| | Validation Accuracy | Learning Rate | Batch Size |
|---|---|---|---|
| Without BitFit | | | |
| With BitFit | | | |

### Problem 3b: Test Models and Report Results (Code and Written, 10 Points)

For each of your two training runs, please test the model that attained the best validation accuracy across all hyperparameter tuning trials. You may do so by running the `test_model.py` script. Once testing is complete, please report your results in the form of a table such as the following.

| | # Trainable Parameters | Test Accuracy |
|---|---|---|
| Without BitFit | | |
| With BitFit | | |

The `test_model.py` script should create a Pickle object containing information about test results. Please submit this object, using the filenames `test_results_with_bitfit.p` and `test_results_without_bitfit.p` for your two tests.

Finally, please comment on your results. How do they compare to the results reported by Zaken et al. (2020)? What does this say about BitFit and its applicability to other pre-trained Transformers?

with bitfit

In [22]:
def apply_bitfit(model):
    for name, param in model.named_parameters():
        if "bias" not in name:  # Freeze all except bias terms
            param.requires_grad = False


In [28]:
model_with_bitfit = BertForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=2)
#model_with_bitfit = apply_bitfit(model_with_bitfit)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
#model_with_bitfit = init_model(None, model_with_bitfit, use_bitfit=True)

In [32]:
trainer_with_bitfit = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer_with_bitfit.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3684,0.35852,0.8646
2,0.2311,0.407939,0.87
3,0.3663,0.525805,0.8592
4,0.2641,0.446904,0.8844


TrainOutput(global_step=10000, training_loss=0.32413340103626254, metrics={'train_runtime': 281.3513, 'train_samples_per_second': 284.342, 'train_steps_per_second': 35.543, 'total_flos': 101638963200000.0, 'train_loss': 0.32413340103626254, 'epoch': 4.0})

In [36]:
best_trial_with_bitfit = trainer_with_bitfit.hyperparameter_search()

    # Save best hyperparameters


[I 2025-03-19 02:46:38,386] A new study created in memory with name: no-name-001d4897-a076-44d1-bc20-2515c6349e54
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▃▄▁█▃▄▁█
eval/loss,▁▃█▅▁▃█▅
eval/runtime,▁█▃▃▃▄▃▁
eval/samples_per_second,█▁▅▅▅▄▆█
eval/steps_per_second,█▁▅▅▅▄▆█
train/epoch,▂▂▂▂▂▃▃▄▄▄▅▅▅▅▆▇▁▂▂▂▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇████
train/global_step,▁▂▂▂▂▂▃▃▃▅▅▅▆▆▆▇▇██▁▁▂▂▂▂▃▃▃▃▄▅▅▆▆▆▇▇▇██
train/grad_norm,▂▂▃▂▂▂▃▄▁▁▄▁▃▂▁▆▁▁▁▂▅█▅▁▃▂▁▂▄▂▁▂▄▁█▁▁▁▁▁
train/learning_rate,▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▂▂▂▂▂▂▁██▆▅▅▄▄▄▃▃▃▃▃▂▁▁▁▁
train/loss,▇▅▅▆▅▄▄▆▃▃▂▅▄▃▃▁▄▆▁█▆▅▆▆▄▃▄▁▂▃▂▄▂▃▄▃▅▂▅▃

0,1
eval/accuracy,0.8844
eval/loss,0.4469
eval/runtime,7.5556
eval/samples_per_second,661.765
eval/steps_per_second,82.721
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,67.72562
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3153,0.595918,0.8272
2,0.5166,0.524092,0.8636
3,0.3907,0.5078,0.8738
4,0.3023,0.546919,0.8724
5,0.5136,0.547411,0.877


[I 2025-03-19 02:54:35,560] Trial 0 finished with value: 0.877 and parameters: {'learning_rate': 2.4602624296238622e-05, 'num_train_epochs': 5, 'seed': 40, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.877.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆█▇█
eval/loss,█▂▁▄▄
eval/runtime,▁▃▁█▃
eval/samples_per_second,█▅█▁▆
eval/steps_per_second,█▅█▁▆
train/epoch,▁▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▂▂▂▂▂▂▂▂▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇███
train/grad_norm,▁▁▁▂▂▅▂▁▂▁▂▂▃█▁▁▁▃▁▁▁▂▁▁▆▁▁▄▁▁█▁▁▁▁▁▁▁▁▁
train/learning_rate,███▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,▆▅▃▅▅▆█▆▅▄▅▄▇▅▅▄▇▅▅▇██▂▅▄▇▄▄▅▂▁▆▆▃▄▆▅▂▂▆

0,1
eval/accuracy,0.877
eval/loss,0.54741
eval/runtime,9.1649
eval/samples_per_second,545.562
eval/steps_per_second,68.195
total_flos,127048704000000.0
train/epoch,5.0
train/global_step,25000.0
train/grad_norm,44.63614
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.438,0.466117,0.8056
2,0.4193,0.513753,0.8422
3,0.6726,0.524051,0.8562
4,0.5621,0.529756,0.8606
5,0.1463,0.541498,0.858


[I 2025-03-19 03:02:28,189] Trial 1 finished with value: 0.858 and parameters: {'learning_rate': 1.0928526574210552e-05, 'num_train_epochs': 5, 'seed': 31, 'per_device_train_batch_size': 4}. Best is trial 1 with value: 0.858.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▇██
eval/loss,▁▅▆▇█
eval/runtime,█▇▁▄█
eval/samples_per_second,▁▂█▅▁
eval/steps_per_second,▁▂█▅▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▁▁▂▁▁▂▁▂▁▂▁▁▃▄▁▁▄▅▂▂▁▂▁▄██▄▁▇▁▁▁▁▅▃▁▁▁▁▁
train/learning_rate,█████▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▂▂▂▁▁▁
train/loss,▇▇▆▇▆▄▅▅▅▇▅▆▄▆▂▅▃▇▇▇▅█▄▅▅▇▁▄▅▁▃▆▄▅▇▆█▆█▇

0,1
eval/accuracy,0.858
eval/loss,0.5415
eval/runtime,9.4921
eval/samples_per_second,526.754
eval/steps_per_second,65.844
total_flos,127048704000000.0
train/epoch,5.0
train/global_step,25000.0
train/grad_norm,1.55564
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6608,0.658834,0.6558


[I 2025-03-19 03:03:41,919] Trial 2 finished with value: 0.6558 and parameters: {'learning_rate': 3.402256753695723e-06, 'num_train_epochs': 1, 'seed': 2, 'per_device_train_batch_size': 8}. Best is trial 2 with value: 0.6558.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇██
train/grad_norm,▂▅▁▁▂▃▁▂▂▆▁▄▄▃▂▁▃▄▂▃▂▃▄▆▁▄▂▂▃█▃▃▂▅▂▄▃▆▆▃
train/learning_rate,████▇▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▁▁
train/loss,██▇▆▆▆▆▇▇▆▅█▄▅▇▄▅▄▅▅▃▅▆▄▃▃▄▄▄▁▃▄▄▃▅▂▅▅▄▃

0,1
eval/accuracy,0.6558
eval/loss,0.65883
eval/runtime,7.6344
eval/samples_per_second,654.931
eval/steps_per_second,81.866
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,2500.0
train/grad_norm,5.83056
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.672,0.673281,0.6112
2,0.6629,0.663465,0.64


[I 2025-03-19 03:05:31,111] Trial 3 finished with value: 0.64 and parameters: {'learning_rate': 6.02436508931075e-06, 'num_train_epochs': 2, 'seed': 22, 'per_device_train_batch_size': 64}. Best is trial 3 with value: 0.64.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇██
train/grad_norm,▃▁▁▂▅▆▄▅▅▃▃▂█▁▂▅▆▄▃▂▃▂▂▃▄▆▄▂▂▄▇▂▃▅█▄▅▄▂▃
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▇▇▇█▅▆▆▅▆▅▄▅▄▃▃▅▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▁▃▂▃▂▂▁▁

0,1
eval/accuracy,0.64
eval/loss,0.66346
eval/runtime,8.0903
eval/samples_per_second,618.026
eval/steps_per_second,77.253
total_flos,50819481600000.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,1.09434
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6945,0.694025,0.5038


[I 2025-03-19 03:06:30,479] Trial 4 finished with value: 0.5038 and parameters: {'learning_rate': 1.1800784767230618e-06, 'num_train_epochs': 1, 'seed': 9, 'per_device_train_batch_size': 32}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▇▃▃▂▃▃▂▂▂▂▇▁▄▂▂▃▅▂▄▁▂▄▅▅▄▄▆▂▄▁▂█▄▅▁▄▆▂▃▃
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁
train/loss,▆▇▅▄▇▇▆▇▃█▅▇▅▄▆▃▃▅▄▆▄▆▆▅▅▃▃▃▆▆▅▄▁▆▆▂▇█▅▄

0,1
eval/accuracy,0.5038
eval/loss,0.69403
eval/runtime,8.6075
eval/samples_per_second,580.889
eval/steps_per_second,72.611
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,625.0
train/grad_norm,0.99052
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4364,0.387697,0.8276


[I 2025-03-19 03:07:29,474] Trial 5 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/grad_norm,▁▁▁▂▁▂▃▁▁▃▁▅▂█▂▄▆▃▄▃▃▂▇▅▅▆▄▃▄▂▃▃▄▃▂▆▇▃█▄
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁
train/loss,███▇▇▇▇▇▇▆▆▅▅▅▄▅▄▅▄▃▄▃▃▃▂▃▂▂▂▃▂▂▃▃▂▁▂▂▁▂

0,1
eval/accuracy,0.8276
eval/loss,0.3877
eval/runtime,9.3019
eval/samples_per_second,537.523
eval/steps_per_second,67.19
train/epoch,1.0
train/global_step,625.0
train/grad_norm,7.00396
train/learning_rate,0.0
train/loss,0.4364


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6095,0.629741,0.6758
2,0.5698,0.57818,0.7076
3,0.5189,0.552442,0.7338


[I 2025-03-19 03:12:05,248] Trial 6 finished with value: 0.7338 and parameters: {'learning_rate': 2.89442260679237e-06, 'num_train_epochs': 3, 'seed': 22, 'per_device_train_batch_size': 4}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅█
eval/loss,█▃▁
eval/runtime,▆█▁
eval/samples_per_second,▃▁█
eval/steps_per_second,▃▁█
train/epoch,▁▁▁▁▁▂▂▂▂▃▃▄▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇████
train/global_step,▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▁▁▁▁▂▁▂▁▁▁▂▂▁▂▃▃▃▃▂▃▃▂▆▂▁▆▁▂▃▃▃█▄▂▂▃▄▂▇
train/learning_rate,███▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▁
train/loss,█▇▇█▇▇▇▇▇▆█▆▆▅▅▅▃▅▄▇▄▄▃▅▄▄▁▃▄▁▄▃▃▄▅▃▁▇▆▃

0,1
eval/accuracy,0.7338
eval/loss,0.55244
eval/runtime,8.0001
eval/samples_per_second,624.989
eval/steps_per_second,78.124
total_flos,76432500326400.0
train/epoch,3.0
train/global_step,15000.0
train/grad_norm,15.17989
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6504,0.65537,0.6374


[I 2025-03-19 03:13:00,958] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/grad_norm,▅▃▄▃▃▁▄▃▁▅▁▁▄▂▂▃█▃▄▄▂▃▃▄▄▆▃▃▃▂▇▃▃▂▅▁▂▃▃▅
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▁
train/loss,▆█▇▆▄▆▅▆▅▅▅▅▅▄▄▃▅▃▃▃▃▃▃▅▃▃▃▂▃▂▁▂▂▂▁▂▃▂▁▂

0,1
eval/accuracy,0.6374
eval/loss,0.65537
eval/runtime,8.7268
eval/samples_per_second,572.95
eval/steps_per_second,71.619
train/epoch,1.0
train/global_step,625.0
train/grad_norm,3.24165
train/learning_rate,0.0
train/loss,0.6504


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6611,0.653532,0.6512


[I 2025-03-19 03:14:03,618] Trial 8 finished with value: 0.6512 and parameters: {'learning_rate': 6.842205185868796e-06, 'num_train_epochs': 1, 'seed': 14, 'per_device_train_batch_size': 16}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇██
train/grad_norm,▄▂▃▁▃▃█▃▂▂▃▂▃▁▂▂▃▄▂▄▂▃▂▂▂▂▂▃▅▃▂▃▄▂▂▂▄▂▂▄
train/learning_rate,███▇▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
train/loss,█▇▇▆▇▇▇▆▇▇▅▅▄▃▄▄▅▃▃▃▃▃▃▃▅▃▄▂▄▄▃▃▄▂▂▁▅▂▃▁

0,1
eval/accuracy,0.6512
eval/loss,0.65353
eval/runtime,9.4013
eval/samples_per_second,531.839
eval/steps_per_second,66.48
total_flos,25613018726400.0
train/epoch,1.0
train/global_step,1250.0
train/grad_norm,1.76337
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4762,0.459946,0.8654


[I 2025-03-19 03:15:38,895] Trial 9 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇█
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇████
train/grad_norm,▁▁▁▁▁▂▂▁▁▂▂▃▂▁▁▁▁▅▁▅▁▁█▃▁█▃▁▄▁▁▁▂▅▅▁▁▃▂▂
train/learning_rate,████▇▇▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,▄▄▄▄▄▃▄▃▃▂▃▃▃▄▄▃▂▃▃▃▂▄▃▁▆█▃▃▃▁▂▅▃▄▅▂▁▅▁▂

0,1
eval/accuracy,0.8654
eval/loss,0.45995
eval/runtime,8.3798
eval/samples_per_second,596.669
eval/steps_per_second,74.584
train/epoch,1.0
train/global_step,5000.0
train/grad_norm,4.43957
train/learning_rate,2e-05
train/loss,0.4762


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6852,0.679391,0.583


[I 2025-03-19 03:16:34,798] Trial 10 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇███
train/grad_norm,▂▂▄▁▂▃▃▂▅▁▁▃▂▂▂▃█▂▁▁▁▅▇▁▆▃▂▂▄▂█▂▅▃▄▄▂▃▄▂
train/learning_rate,█████▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,▅▇▇▆▅█▃▅▇▆▆▄▅▅▆▄▅▃▇▆▅▅▁▄▅▂▄▅▂▄▃▃▄▆▅▃▇▁▄▄

0,1
eval/accuracy,0.583
eval/loss,0.67939
eval/runtime,7.5982
eval/samples_per_second,658.053
eval/steps_per_second,82.257
train/epoch,1.0
train/global_step,625.0
train/grad_norm,1.16852
train/learning_rate,0.0
train/loss,0.6852


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6834,0.690572,0.5126
2,0.6935,0.686798,0.5264


[I 2025-03-19 03:18:26,234] Trial 11 finished with value: 0.5264 and parameters: {'learning_rate': 1.2598148775426316e-06, 'num_train_epochs': 2, 'seed': 20, 'per_device_train_batch_size': 64}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇█████
train/grad_norm,█▁▆▄▄▄▂▅▂▃▆▃▃▂▄▁▄▂▃▄▂▃▁▁▇▂▂▂▇▂▂▅▂▂▃▃▂▃▄▂
train/learning_rate,██▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,█▆█▅▅▃▄▄▆▄▅▅▃▄▂▄▄▃▄▄▃▂▂▃▃▂▄▃▂▁▃▂▃▃▂▂▂▂▂▃

0,1
eval/accuracy,0.5264
eval/loss,0.6868
eval/runtime,8.3909
eval/samples_per_second,595.882
eval/steps_per_second,74.485
total_flos,51022759526400.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,1.08352
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6914,0.691743,0.526
2,0.691,0.690225,0.5324


[I 2025-03-19 03:20:17,795] Trial 12 finished with value: 0.5324 and parameters: {'learning_rate': 1.0838487450652203e-06, 'num_train_epochs': 2, 'seed': 14, 'per_device_train_batch_size': 64}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
train/grad_norm,▂▁▁█▂▂▂▂▃▁▂▂▂▂▄▃▂█▄▂▃▁▅▂▂▃▂▂▂▃▂▃▁▂▂▂▂▃▃▂
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁
train/loss,▇████▄▇█▅▇▇▆▄▇█▅▇▅▇▄▃▆▃▅▇▅▃▄▂▁▃▆▃▃▃▄▃▃▂▄

0,1
eval/accuracy,0.5324
eval/loss,0.69022
eval/runtime,10.6793
eval/samples_per_second,468.197
eval/steps_per_second,58.525
total_flos,50819481600000.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,0.75231
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6954,0.690486,0.502
2,0.6904,0.68823,0.5256


[I 2025-03-19 03:22:08,197] Trial 13 finished with value: 0.5256 and parameters: {'learning_rate': 1.8014495288925948e-06, 'num_train_epochs': 2, 'seed': 8, 'per_device_train_batch_size': 64}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▄▃▅▂▃▁▇▇▇▂▆▅▂▂▁▅▂▄▅▁▅▂▃▃▂▁▃▃▄█▁▁▆▃▄▁▇▆▃▂
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁▁▁
train/loss,▅▆█▆▅▅▆▄▄▄▅▃▅▂▂▂▅▅▃▃▂▃▃▄▂▁▄▄▂▂▁▅▂▅▂▁▃▄▂▂

0,1
eval/accuracy,0.5256
eval/loss,0.68823
eval/runtime,9.1039
eval/samples_per_second,549.212
eval/steps_per_second,68.652
total_flos,50819481600000.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,0.75153
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6776,0.670297,0.5996
2,0.641,0.649439,0.6472
3,0.661,0.642453,0.6552


[I 2025-03-19 03:25:39,140] Trial 14 finished with value: 0.6552 and parameters: {'learning_rate': 1.997548926750142e-06, 'num_train_epochs': 3, 'seed': 8, 'per_device_train_batch_size': 8}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇█
eval/loss,█▃▁
eval/runtime,▁▃█
eval/samples_per_second,█▆▁
eval/steps_per_second,█▆▁
train/epoch,▁▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▃▃▂▁▁▂▃▁▆▄▃▆▅▃▂▃▄▂█▅▃▃▃▄▂▃▅▃▂▃▃▃▃▂▇▄▄▂▆█
train/learning_rate,██████▇▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▁▁▁
train/loss,███▇▇▆▆▆▆▇▅▄▂▅▄▃▄▃▅▃▄▄▃▃▄▃▃▇▁▂▄▃▆▃▃▃▄▁▅▃

0,1
eval/accuracy,0.6552
eval/loss,0.64245
eval/runtime,9.9062
eval/samples_per_second,504.736
eval/steps_per_second,63.092
total_flos,76229222400000.0
train/epoch,3.0
train/global_step,7500.0
train/grad_norm,3.8342
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6821,0.680521,0.6114


[I 2025-03-19 03:26:42,829] Trial 15 finished with value: 0.6114 and parameters: {'learning_rate': 1.9735573876923056e-06, 'num_train_epochs': 1, 'seed': 7, 'per_device_train_batch_size': 16}. Best is trial 4 with value: 0.5038.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇██
train/global_step,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇█████
train/grad_norm,▃▄▃▂▂▄▁▂▁▃▃▃▃▄▂█▄▃▂▂▂▂▂▆▃▃▁▂▄▃▂▃▄▂▃▃▅▂▁▅
train/learning_rate,███▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▂▂▂▂▂▂▂▂▁▁
train/loss,▆█▆▆▆▄▆▃▄▆▅▆▂▃▅▂▄▄▁▄▃▅▄▅▃▄▄▄▃▄▃▃▁▂▃▂▁▅▄▃

0,1
eval/accuracy,0.6114
eval/loss,0.68052
eval/runtime,9.1601
eval/samples_per_second,545.846
eval/steps_per_second,68.231
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,1250.0
train/grad_norm,2.73805
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5658,0.541185,0.7536


[I 2025-03-19 03:27:42,625] Trial 16 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇█
train/grad_norm,▃▁▁▂▁▂▂▂▂▂▁▃▂▂▁▂▂▄▂▃▇▂▅▂▄▃█▄▂▇▇▂▆▂▅▄▂▅▆▃
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,██▇▇█▇▇▇▆▆▆▇▆▇▆▇▆▆▆▅▅▅▅▅▅▄▄▂▃▄▄▄▃▂▃▂▂▃▁▂

0,1
eval/accuracy,0.7536
eval/loss,0.54118
eval/runtime,8.8686
eval/samples_per_second,563.789
eval/steps_per_second,70.474
train/epoch,1.0
train/global_step,625.0
train/grad_norm,2.79124
train/learning_rate,1e-05
train/loss,0.5658


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6836,0.68385,0.5984


[I 2025-03-19 03:28:39,331] Trial 17 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▄█▁▃▂▁▄▂▂▃▃▁▂▄▂▂▂▂▄▂▄▂▁▂▄▃▂▂▃▃▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▄▅▅█▅▅▆▃▅▅▃▅▄▄▄▄▄▄▂▂▃▂▃▁▃▃▅▁▃▁

0,1
eval/accuracy,0.5984
eval/loss,0.68385
eval/runtime,7.8095
eval/samples_per_second,640.244
eval/steps_per_second,80.03
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.06387
train/learning_rate,0.0
train/loss,0.6836


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6716,0.655484,0.638


[I 2025-03-19 03:29:37,762] Trial 18 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇█
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▂▂▁▂▂▂▂▁▄▂▂▄▃▃▄▃▂▂▅▃▁▂▃▂▂▂▃▁▂▂▂▂▃█▂▄▃▇▆
train/learning_rate,████▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
train/loss,▇█▇▆▆▇▆▆▅▆▆▅▆▄▅▄▅▃▆▅▅▄▅▄▄▅▄▃▂▅▃▃▃▃▃▃▃▁▁▃

0,1
eval/accuracy,0.638
eval/loss,0.65548
eval/runtime,7.6457
eval/samples_per_second,653.964
eval/steps_per_second,81.746
train/epoch,1.0
train/global_step,625.0
train/grad_norm,3.13713
train/learning_rate,0.0
train/loss,0.6716


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6842,0.682913,0.5862


[I 2025-03-19 03:30:36,594] Trial 19 pruned. 


NameError: name 'pickle' is not defined

without bitfit

In [37]:
import pickle
with open("train_results_with_bitfit.p", "wb") as f:
  pickle.dump(best_trial_with_bitfit, f)

In [None]:
model_without_bitfit = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


In [None]:
#model = init_model(None, "prajjwal1/bert-tiny", use_bitfit=True)

In [38]:
trainer_without_bitfit = init_trainer("prajjwal1/bert-tiny", imdb["train"], imdb["val"])

# Train using the trainer
trainer_without_bitfit.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3684,0.35852,0.8646
2,0.2311,0.407939,0.87
3,0.3663,0.525805,0.8592
4,0.2641,0.446904,0.8844


TrainOutput(global_step=10000, training_loss=0.32413340103626254, metrics={'train_runtime': 280.0915, 'train_samples_per_second': 285.621, 'train_steps_per_second': 35.703, 'total_flos': 101638963200000.0, 'train_loss': 0.32413340103626254, 'epoch': 4.0})

In [40]:
best_trian_without_bitfit = trainer_with_bitfit.hyperparameter_search()

    # Save best hyperparameters
with open("train_results_without_bitfit.p", "wb") as f:
  pickle.dump(best_trian_without_bitfit, f)

[I 2025-03-19 03:49:38,519] A new study created in memory with name: no-name-18b846fa-11dc-44f8-a0ba-a9ff76faf33c
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁██▇█
eval/loss,█▁▂▅▃
eval/runtime,▇█▃▁▇
eval/samples_per_second,▂▁▆█▂
eval/steps_per_second,▂▁▆█▂
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▆▆▆▆▇▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇█████
train/grad_norm,▁▃▃▁▂▂▃▅▃▂▂▄▂▃▁▃▃▂▄▃▂▅▄█▆▄▅▁▁▄▁▂▁▅▃▃▃▁▃▁
train/learning_rate,████▇▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,███▇▇▅▄▄▅▄▄▄▃▄▃▃▃▄▅▄▃▅▅▄▃▄▃▁▂▂▅▆▄▄▄▃▃▂▃▂

0,1
eval/accuracy,0.8844
eval/loss,0.4469
eval/runtime,10.4437
eval/samples_per_second,478.757
eval/steps_per_second,59.845
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,67.72562
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4486,0.422185,0.8146
2,0.3793,0.377536,0.836


[I 2025-03-19 03:51:38,095] Trial 0 finished with value: 0.836 and parameters: {'learning_rate': 3.530055296399356e-05, 'num_train_epochs': 2, 'seed': 32, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.836.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/grad_norm,▁▁▁▁▁▁▁▁▁▂▁▁▃▁▂▂▃▄▁▂▂▅█▄▃▄▃▃▃▃▄▂▁▄▂▁▂▂▄▂
train/learning_rate,█████▇▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁▁▁
train/loss,████▇▆▆▆▅▅▅▅▄▄▄▃▃▄▄▂▁▁▂▂▂▁▂▁▃▂▁▂▂▂▂▁▂▂▂▁

0,1
eval/accuracy,0.836
eval/loss,0.37754
eval/runtime,9.7572
eval/samples_per_second,512.442
eval/steps_per_second,64.055
total_flos,51022759526400.0
train/epoch,2.0
train/global_step,1250.0
train/grad_norm,4.15216
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5854,0.575637,0.7168
2,0.5131,0.471577,0.7858
3,0.4309,0.431598,0.8066
4,0.4435,0.41702,0.8162
5,0.3323,0.408423,0.8198


[I 2025-03-19 03:58:15,194] Trial 1 finished with value: 0.8198 and parameters: {'learning_rate': 5.614995667774333e-06, 'num_train_epochs': 5, 'seed': 39, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 0.8198.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▇██
eval/loss,█▄▂▁▁
eval/runtime,▁█▃▇▃
eval/samples_per_second,█▁▅▂▅
eval/steps_per_second,█▁▅▂▅
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇███
train/grad_norm,▁▁▁▂▁▁▃▃▁▂▁▂▂▂▄▂▄▄▂▃▅▄▂▂▂▄▃▆▃▃▄▃▁▁▂▂▄▃█▃
train/learning_rate,██████▇▇▇▇▆▆▆▆▆▆▆▆▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁
train/loss,██▇█▇▇▆▆▆▆▆▅▇▅▆▄▄▅▅▄▃▂▂▃▂▂▂▂▂▃▂▁▃▁▃▂▁▂▁▂

0,1
eval/accuracy,0.8198
eval/loss,0.40842
eval/runtime,9.1011
eval/samples_per_second,549.383
eval/steps_per_second,68.673
total_flos,127048704000000.0
train/epoch,5.0
train/global_step,12500.0
train/grad_norm,30.54307
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6709,0.676069,0.5994
2,0.6628,0.668588,0.6112
3,0.6623,0.664382,0.6188
4,0.6704,0.662629,0.622


[I 2025-03-19 04:02:25,466] Trial 2 finished with value: 0.622 and parameters: {'learning_rate': 1.3966118515989925e-06, 'num_train_epochs': 4, 'seed': 15, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.622.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅▇█
eval/loss,█▄▂▁
eval/runtime,█▁▃▂
eval/samples_per_second,▁█▆▇
eval/steps_per_second,▁█▆▇
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/grad_norm,▁▃▄▄▃▁▂▃▆▁▃▃▄▁▂▁▂▃▁█▂▄▅▁▂▃▂▃▂▆▆▄▄▅▃▆▅▄▃▂
train/learning_rate,█████▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▁▁▁▁▁
train/loss,▇▇▇▇█▆▇▆▇▆▆█▄▅▆▆▅▄▆▅▅▂▅▅▆▃▅▃▅▃▅▃▂▃▁▇▁▄▂▂

0,1
eval/accuracy,0.622
eval/loss,0.66263
eval/runtime,9.7377
eval/samples_per_second,513.468
eval/steps_per_second,64.184
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,2500.0
train/grad_norm,1.72155
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6587,0.654072,0.645
2,0.6382,0.62786,0.6764
3,0.6327,0.619619,0.6844


[I 2025-03-19 04:05:33,413] Trial 3 finished with value: 0.6844 and parameters: {'learning_rate': 7.157198612075959e-06, 'num_train_epochs': 3, 'seed': 34, 'per_device_train_batch_size': 64}. Best is trial 2 with value: 0.622.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇█
eval/loss,█▃▁
eval/runtime,▁▅█
eval/samples_per_second,█▃▁
eval/steps_per_second,█▃▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇██
train/grad_norm,▂▄▁▂▂▂▂▂▁▁▂▃▂▂▁▂▂▁▁▃▂▂█▂▂▃▃▃▂▄▂▄▃▁▂▃▂▃▂▂
train/learning_rate,████▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
train/loss,██▇▆▅▆▅▅▅▅▅▄▄▄▄▄▄▄▄▃▂▃▃▃▃▃▂▃▂▁▂▂▂▂▁▂▃▂▂▂

0,1
eval/accuracy,0.6844
eval/loss,0.61962
eval/runtime,11.7028
eval/samples_per_second,427.247
eval/steps_per_second,53.406
total_flos,76229222400000.0
train/epoch,3.0
train/global_step,939.0
train/grad_norm,1.7084
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6488,0.633483,0.664
2,0.6073,0.58034,0.7122
3,0.588,0.548931,0.737
4,0.5141,0.536024,0.7464


[I 2025-03-19 04:11:17,118] Trial 4 finished with value: 0.7464 and parameters: {'learning_rate': 3.4162569273116594e-06, 'num_train_epochs': 4, 'seed': 35, 'per_device_train_batch_size': 8}. Best is trial 2 with value: 0.622.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅▇█
eval/loss,█▄▂▁
eval/runtime,▅▁▁█
eval/samples_per_second,▄██▁
eval/steps_per_second,▄██▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇████
train/grad_norm,▂▁▁▁▂▂▂▁▁▂▂▂▂▃▂▂▄▃▅▃▂▅▃▆▄▂▇▇▄█▅▂▂▂▄▅▄▆▃▄
train/learning_rate,████████▇▇▇▇▇▇▇▆▆▆▆▆▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁
train/loss,█▆▇▇▇▆▇▆▆▅▆▄▆▅▆▄▆▅▅▅▅▇▃▂▃▃▄▇▄▂▄▃▃▃▂▂▂▁▁▂

0,1
eval/accuracy,0.7464
eval/loss,0.53602
eval/runtime,11.7102
eval/samples_per_second,426.98
eval/steps_per_second,53.372
total_flos,101638963200000.0
train/epoch,4.0
train/global_step,10000.0
train/grad_norm,5.86405
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5701,0.349687,0.86


[I 2025-03-19 04:12:46,026] Trial 5 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█
train/global_step,▁▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
train/grad_norm,▁▂▁▁▃▅▅▄▃▅▃▂▆▂▅▄▃▂▆▇▇▃▁▆▂█▅▆▂▅▅▄▇▄▂▃▂▁▃▄
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▇▇▇▇▆▆▆▆▅▅▄▂▃▄▅▄▄▃▂▂▄▄▄▃▂▃▄▄▂▂▃▃▃▂▁▃▂▃▆

0,1
eval/accuracy,0.86
eval/loss,0.34969
eval/runtime,9.772
eval/samples_per_second,511.666
eval/steps_per_second,63.958
train/epoch,1.0
train/global_step,2500.0
train/grad_norm,22.7983
train/learning_rate,0.0
train/loss,0.5701


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3542,0.481216,0.8362


[I 2025-03-19 04:14:34,834] Trial 6 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇█
train/grad_norm,▁▁▁▃▂▂▂▂▂▁▂▂▂▃▃▅▃▅▂▃▂▄▅▁▂█▁▁▅█▁▁▇▆▇█▂▆▄▅
train/learning_rate,███▇▇▇▇▇▆▆▅▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁
train/loss,▆▆▆▆▅▅▆▆▅▅▅▄▄▅▄▂▃▃▅▃▃▄▂▁▂▂▃▂▃▆█▄▃▁▂▃▄▆▇▄

0,1
eval/accuracy,0.8362
eval/loss,0.48122
eval/runtime,11.5037
eval/samples_per_second,434.641
eval/steps_per_second,54.33
train/epoch,1.0
train/global_step,5000.0
train/grad_norm,109.94166
train/learning_rate,1e-05
train/loss,0.3542


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3781,0.351145,0.8466


[I 2025-03-19 04:15:38,541] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▁▁▁▁▁▁▂▂▂▁▂▁▄▃▁▂▁▃▂▂▃▂▂▃▂▄▂█▁▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,███▇▇▆▆▆▅▅▄▄▄▄▄▃▃▃▂▂▃▂▃▁▁▁▁▁▁▁▁

0,1
eval/accuracy,0.8466
eval/loss,0.35114
eval/runtime,9.6683
eval/samples_per_second,517.155
eval/steps_per_second,64.644
train/epoch,1.0
train/global_step,313.0
train/grad_norm,6.48185
train/learning_rate,6e-05
train/loss,0.3781


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6752,0.672031,0.6082


[I 2025-03-19 04:16:46,036] Trial 8 finished with value: 0.6082 and parameters: {'learning_rate': 4.799151238394696e-06, 'num_train_epochs': 1, 'seed': 4, 'per_device_train_batch_size': 64}. Best is trial 8 with value: 0.6082.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▂▃▂▄▃▁▃▂█▄▅▄▄▂▂▂▄▂▆▅▄▁▂▄▃▂▄▃▄▂▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▇▇▇▄▇▆▅▄▄▄▄▃▄▇▃▄▂▃▅▃▂▄▃▂▃▃▅▁▃▂

0,1
eval/accuracy,0.6082
eval/loss,0.67203
eval/runtime,11.3395
eval/samples_per_second,440.936
eval/steps_per_second,55.117
total_flos,25613018726400.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.97163
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.383,0.361329,0.8452


[I 2025-03-19 04:17:54,810] Trial 9 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇█████
train/grad_norm,▁▁▁▁▁▁▁▂▁▃▁▃▁▂▅▄▁▃▃▅▆▁▃▂▂▁▃▇▂█▇▄█▃▃█▂▄▃▅
train/learning_rate,█████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▁
train/loss,███▇▇▆▆▅▆▆▄▅▄▄▅▃▃▂▃▄▂▃▂▃▁▃▂▄▂▃▃▂▁▁▂▁▁▂▂▂

0,1
eval/accuracy,0.8452
eval/loss,0.36133
eval/runtime,10.3478
eval/samples_per_second,483.195
eval/steps_per_second,60.399
train/epoch,1.0
train/global_step,1250.0
train/grad_norm,14.634
train/learning_rate,2e-05
train/loss,0.383


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6858,0.691404,0.5074


[I 2025-03-19 04:18:58,753] Trial 10 finished with value: 0.5074 and parameters: {'learning_rate': 1.0805063581156355e-06, 'num_train_epochs': 1, 'seed': 21, 'per_device_train_batch_size': 64}. Best is trial 10 with value: 0.5074.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▂▃▁▁█▄▂▂▃▅▃▁▇█▇▃▃▂▄▃▄▂▃▃▂▄▅▃▃▆▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▆█▇▆▆▃▅▃▄▆▆▆▆▄▅▄▅▅▄▄▄▃▇▂▂▂▆▇▄▅▁

0,1
eval/accuracy,0.5074
eval/loss,0.6914
eval/runtime,10.1161
eval/samples_per_second,494.26
eval/steps_per_second,61.782
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.89975
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6963,0.697122,0.4884


[I 2025-03-19 04:20:06,781] Trial 11 finished with value: 0.4884 and parameters: {'learning_rate': 1.1213541152130352e-06, 'num_train_epochs': 1, 'seed': 22, 'per_device_train_batch_size': 64}. Best is trial 11 with value: 0.4884.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▃▇▂▂▂▃▄▂█▆▃▃▁▂▄▂▅▂▁▅▁▃▄▂▅▄▁▁▄▂▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▅▄▃▄█▆▁▃▅▅▂▄▂▅▅▃▃▃▃▁▂▃▂▄▂▃▁▄▃▂▃

0,1
eval/accuracy,0.4884
eval/loss,0.69712
eval/runtime,10.7175
eval/samples_per_second,466.526
eval/steps_per_second,58.316
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.81235
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6872,0.694728,0.4948


[I 2025-03-19 04:21:12,997] Trial 12 finished with value: 0.4948 and parameters: {'learning_rate': 1.2685313732426165e-06, 'num_train_epochs': 1, 'seed': 23, 'per_device_train_batch_size': 64}. Best is trial 11 with value: 0.4884.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇████
train/grad_norm,▁▄▆▂▁▇▅▂▁▂▄▅▃▂▂▁▃▂▃▄▁▃█▂▂▄▁█▂▂▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▅▅▄▅▅█▆▄▂▇▅▅▂▄▅▄▅▆▃▅▃▅▄▄▄▁▃▃▅▂▁

0,1
eval/accuracy,0.4948
eval/loss,0.69473
eval/runtime,11.6855
eval/samples_per_second,427.88
eval/steps_per_second,53.485
total_flos,25409740800000.0
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.15703
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6854,0.69073,0.5166


[I 2025-03-19 04:22:15,310] Trial 13 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▄▆▂▁▇▅▂▁▂▄▄▃▁▂▁▃▁▂▄▁▂█▁▂▅▁█▂▂▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▆▅▅▅▅█▆▄▂▆▄▅▂▄▅▄▄▆▃▄▃▄▄▄▄▁▃▂▄▂▁

0,1
eval/accuracy,0.5166
eval/loss,0.69073
eval/runtime,8.1624
eval/samples_per_second,612.563
eval/steps_per_second,76.57
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.05996
train/learning_rate,0.0
train/loss,0.6854


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6757,0.669821,0.6096
2,0.6647,0.660536,0.627


[I 2025-03-19 04:24:39,116] Trial 14 finished with value: 0.627 and parameters: {'learning_rate': 2.3193566740470043e-06, 'num_train_epochs': 2, 'seed': 26, 'per_device_train_batch_size': 16}. Best is trial 11 with value: 0.4884.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇█████
train/global_step,▁▁▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇████
train/grad_norm,▅▁▂▃▁▂▁▂▂▁▃▂▂▂▁▄▁▅▂▂▂▁▁▁▅▃▂▂█▃▂▂▃▃▂▃▂▄▁▄
train/learning_rate,█████▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▂▂▂▂▂▂▂▁▁▁▁
train/loss,███▇▆▅▃▆▆▅▆▅▄▅▅▅▅▄▅▃▅▆▆▂▂▃▄▃▂▃▃▁▂▂▃▂▁▁▄▂

0,1
eval/accuracy,0.627
eval/loss,0.66054
eval/runtime,9.89
eval/samples_per_second,505.562
eval/steps_per_second,63.195
total_flos,51022759526400.0
train/epoch,2.0
train/global_step,2500.0
train/grad_norm,5.88557
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4701,0.439671,0.8106


[I 2025-03-19 04:26:39,954] Trial 15 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▂▁▁▁▁▁▁▂▂▁▂▁▂▁▃▃▁▃▄▁▂▄▄▃▂▁▃▂▂▁▄▃▁▁▃█▄█▄▁
train/learning_rate,████▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁
train/loss,███▇█▇█▇▇▇▇▇▇▇▇▇▇▆▆▇▇▆▅▃▃▄▃▃▃▄▁▄▂▅▄▃▄▄▂▁

0,1
eval/accuracy,0.8106
eval/loss,0.43967
eval/runtime,10.4012
eval/samples_per_second,480.716
eval/steps_per_second,60.089
train/epoch,1.0
train/global_step,5000.0
train/grad_norm,17.07289
train/learning_rate,1e-05
train/loss,0.4701


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6854,0.681251,0.589


[I 2025-03-19 04:27:47,073] Trial 16 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,▁▇▆▅▅▄▇▂▆▂▅▂▄▂▂▁▅▂▃▃▃▇▅▅▄▇█▆▃▃▂
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▇█▄▆█▇▆▆▇█▁▆▅▅▇▆▅▅▇▅▅▃▇▄▆▄▄▅▅▆

0,1
eval/accuracy,0.589
eval/loss,0.68125
eval/runtime,11.6235
eval/samples_per_second,430.163
eval/steps_per_second,53.77
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.87701
train/learning_rate,0.0
train/loss,0.6854


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6906,0.696826,0.4912
2,0.6958,0.694774,0.497


[I 2025-03-19 04:29:55,220] Trial 17 finished with value: 0.497 and parameters: {'learning_rate': 1.0834343221890246e-06, 'num_train_epochs': 2, 'seed': 27, 'per_device_train_batch_size': 64}. Best is trial 11 with value: 0.4884.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▄▄▃▄▄▅▆▅▇▂▆▄▂▂▂▂▅█▆▆▂▃▄▂▂▁▃▁▁█▄▁▄▁▃▅▃▆▂▆
train/learning_rate,████▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁
train/loss,▂█▅▄▃▄▃▆▆▄▅▃▃▅▆▅▃▄▄▄▂▃▂▁▃▃▃▅▂▃▃▁▄▄▂▃▃▃▃▃

0,1
eval/accuracy,0.497
eval/loss,0.69477
eval/runtime,10.6418
eval/samples_per_second,469.847
eval/steps_per_second,58.731
total_flos,51022759526400.0
train/epoch,2.0
train/global_step,626.0
train/grad_norm,1.36498
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6884,0.690241,0.5326


[I 2025-03-19 04:31:00,133] Trial 18 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/grad_norm,█▁▂▅▇▁▂▇▃▁▂▅▃▂▃▃▆▃▂▂▆▃▄▆▂▂▄▁▄▂▃
train/learning_rate,███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,█▆█▄▆▄▅▆▄▅▃▄▃▃▃▃▂▂▂▃▃▂▂▂▁▁▁▁▁▂▁

0,1
eval/accuracy,0.5326
eval/loss,0.69024
eval/runtime,10.571
eval/samples_per_second,472.99
eval/steps_per_second,59.124
train/epoch,1.0
train/global_step,313.0
train/grad_norm,1.75769
train/learning_rate,0.0
train/loss,0.6884


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5244,0.484048,0.7746


[I 2025-03-19 04:32:50,045] Trial 19 pruned. 


In [41]:
with open("train_results_with_bitfit.p", "rb") as f:
  best_bitfit = pickle.load(f)

with open("train_results_without_bitfit.p", "rb") as f:
  best_no_bitfit = pickle.load(f)

    # Extract best hyperparameters and accuracy
results_table = f"""
    | Validation Accuracy | Learning Rate | Batch Size |
    |---------------------|--------------|------------|
    | {best_no_bitfit.objective:.4f}  | {best_no_bitfit.hyperparameters['learning_rate']} | {best_no_bitfit.hyperparameters['per_device_train_batch_size']} |
    | {best_bitfit.objective:.4f}  | {best_bitfit.hyperparameters['learning_rate']} | {best_bitfit.hyperparameters['per_device_train_batch_size']} |
"""

print(results_table)


    | Validation Accuracy | Learning Rate | Batch Size |
    |---------------------|--------------|------------|
    | 0.4884  | 1.1213541152130352e-06 | 64 |
    | 0.5038  | 1.1800784767230618e-06 | 32 |



In [None]:
!python "/content/drive/MyDrive/NLU/HW2/train_model.py"

2025-03-18 22:37:16.443801: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742337436.463907    7248 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742337436.469906    7248 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Map: 100% 20000/20000 [00:23<00:00, 864.16 examples/s]
Map: 100% 5000/5000 [00:04<00:00, 1032.92 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):


In [None]:
!python "/content/drive/MyDrive/NLU/HW2/test_model.py"