# Hyperparameter Search

- optimizing for accuracy since classes are balanced (12:15)

first search run:
- learning_rate between  5e-6, 5e-5, log=True
- num_train_epochs between 2, 5
- per_device_train_batch_size between 4, 8
- per_device_eval_batch_size 4, 8
- Best hyperparameters: {'learning_rate': 1.752433329903465e-05, 'num_train_epochs': 5, 'per_device_train_batch_size': 4, 'per_device_eval_batch_size': 8}
- Best eval accuracy: 0.6153846153846154

second search run:
- batch sizes 8
- epochs 5
- learing rate between 5e-6, 2e-5
- Best hyperparameters: {'learning_rate': 1.2308237496976495e-05}
- Best eval accuracy: 0.6837606837606838

third run:
batch sizes 4
- learing rate between 5e-6, 2e-5
- Best hyperparameters: {'learning_rate': 1.2665150015950181e-05}
- Best eval accuracy: 0.6239316239316239

fourth run:
- batch sizes 8
- learning_rate between 1e-5, 3e-5
- highest accuracy 0.726496
- learning_rate': 2.8213598460702224e-05

fifth run:
- batch sizes 8
- learning_rate between 3e-5, 4e-5
- highest accuracy 0.760684
- learning_rate':  3.035495167103403e-05

sixth run:
- batch sizes 8
- learning_rate between 2.5e-5, 3.5e-5
- highest accuracy 0.803419	at 3.20605942472665e-05 at epoch 3
- also good: 0.77777 at 3.2759208826863756e-05 at epoch 3
- 0.726496 at  2.9592151393562346e-05 at epoch 3
- 0.752137	3.443498945690748e-05 at epoch 3


7th run:
- batch sizes 8
- learning_rate between 1e-5, 1.5e-5
- highest accuracy 0.69	at 1.25e-5 at epoch 5


In [36]:
import torch
import sqlite3
import pandas as pd


print(torch.backends.mps.is_available())
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

conn  = sqlite3.connect('../../data/giicg.db')
prompts = pd.read_sql("Select * from expanded_roberta_prompts", conn)
conn.close()
prompts

True


Unnamed: 0,index,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language,label
0,0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en,0
1,1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en,0
2,2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en,0
3,3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en,0
4,4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en,0
...,...,...,...,...,...,...,...,...,...,...,...,...
562,391,1234,65,user,can we add peid for when pefile fails?,can we add peid for when pefile fails?,,,Woman (cisgender),73,en,1
563,429,1322,65,user,"param_grid = {\n 'min_samples': [5, 10, 20]...",provide more steps,"param_grid = {\n 'min_samples': [5, 10, 20]...",,Woman (cisgender),73,en,1
564,334,484,21,user,i think i onlz want to think about the imbalan...,i think i only want to think about the imbalan...,,,Woman (cisgender),73,en,1
565,444,1364,65,user,from sklearn.cluster import OPTICS\nfrom sklea...,this worked. but i do not have visualizations ...,from sklearn.cluster import OPTICS\nfrom sklea...,,Woman (cisgender),73,en,1


## Build dataset
- group aware split: no prompts from the same user will occur in both sets
- build dataset in huggingface format

In [37]:
from sklearn.model_selection import GroupShuffleSplit
from datasets import Dataset

gss = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=42)
groups = prompts['user_id']

train_idx, val_idx = next(gss.split(prompts, groups=groups))
train_prompts = prompts.iloc[train_idx]
val_prompts = prompts.iloc[val_idx]


train_dataset = Dataset.from_pandas(train_prompts[['conversational', 'label']])
val_dataset = Dataset.from_pandas(val_prompts[['conversational', 'label']])

train_dataset

Dataset({
    features: ['conversational', 'label', '__index_level_0__'],
    num_rows: 450
})

## Model, Tokenizer & Data Collator

In [38]:
import json
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding

with open("finetune/label2id.json", "r") as f:
    label2id = json.load(f)

model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
num_labels = len(label2id)

def model_init():
    # Needed for Trainer's hyperparameter search to re-initialize the model each trial
    return AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def tokenize_function(examples):
    return tokenizer(
        examples["conversational"],
        truncation=True,
        padding=False # padding is handled in the data collator
    )


## Tokenize

In [39]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
val_dataset

Map: 100%|██████████| 450/450 [00:00<00:00, 46415.42 examples/s]
Map: 100%|██████████| 117/117 [00:00<00:00, 27865.17 examples/s]


Dataset({
    features: ['conversational', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 117
})

## Trainer

In [40]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


training_args = TrainingArguments(
    output_dir="finetune/hp_results/search_6",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=8,
    learning_rate=3.2e-5,
    #weight_decay= #
    #warmup_steps = 10,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_steps=50,
    logging_strategy="steps",
)


trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    compute_metrics=compute_metrics,
)



  trainer = Trainer(
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Search

In [41]:
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 0.7e-5, 1e-5, log=True),
    }


best_run = trainer.hyperparameter_search(
    direction="maximize",
    hp_space=hp_space,
    n_trials=5,
    compute_objective=lambda metrics: metrics["eval_accuracy"]
)

print("Best hyperparameters:", best_run.hyperparameters)
print("Best eval accuracy:", best_run.objective)


[I 2025-09-17 11:39:11,132] A new study created in memory with name: no-name-786706e2-9c16-4104-b68e-6d2e265b115e
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6842,0.836221,0.136752,0.032903,0.018701,0.136752
2,0.6781,1.276703,0.222222,0.191118,0.883697,0.222222
3,0.625,1.001374,0.470085,0.527506,0.8913,0.470085
4,0.5415,0.885444,0.547009,0.60767,0.894959,0.547009
5,0.3935,1.041951,0.547009,0.60767,0.894959,0.547009
6,0.367,0.870348,0.641026,0.695781,0.900973,0.641026
7,0.2742,0.899982,0.641026,0.695781,0.900973,0.641026
8,0.2152,1.025193,0.615385,0.672652,0.899117,0.615385


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
[I 2025-09-17 11:40:24,265] Trial 0 finished with value: 0.6153846153846154 and parameters: {'learning_rate': 7.689097890653208e-06}. Best is trial 0 with value: 0.6153846153846154.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6836,0.823856,0.136752,0.032903,0.018701,0.136752
2,0.6771,1.3611,0.25641,0.246956,0.884491,0.25641
3,0.5923,0.73561,0.598291,0.656882,0.897979,0.598291
4,0.4569,0.960073,0.598291,0.656882,0.897979,0.598291
5,0.3336,1.16648,0.564103,0.624411,0.895905,0.564103
6,0.294,0.934879,0.692308,0.740391,0.905325,0.692308
7,0.1925,1.017588,0.692308,0.740391,0.905325,0.692308
8,0.1367,1.11795,0.683761,0.733092,0.904532,0.683761


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
[I 2025-09-17 11:41:37,734] Trial 1 finished with value: 0.6837606837606838 and parameters: {'learning_rate': 9.415011660168081e-06}. Best is trial 1 with value: 0.6837606837606838.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6839,0.844963,0.136752,0.032903,0.018701,0.136752
2,0.6772,1.318384,0.213675,0.17655,0.883507,0.213675
3,0.6217,0.906587,0.478632,0.536841,0.891664,0.478632
4,0.5161,0.90756,0.547009,0.60767,0.894959,0.547009
5,0.3671,1.072717,0.547009,0.60767,0.894959,0.547009
6,0.3426,0.859864,0.649573,0.703361,0.901634,0.649573
7,0.2444,0.897392,0.666667,0.718339,0.90303,0.666667
8,0.189,1.041211,0.615385,0.672652,0.899117,0.615385


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
[I 2025-09-17 11:42:51,048] Trial 2 finished with value: 0.6153846153846154 and parameters: {'learning_rate': 8.202795126217176e-06}. Best is trial 1 with value: 0.6837606837606838.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6836,0.842604,0.136752,0.032903,0.018701,0.136752
2,0.677,1.245432,0.247863,0.233348,0.884287,0.247863
3,0.6166,0.902432,0.470085,0.527506,0.8913,0.470085
4,0.5211,0.969948,0.555556,0.616085,0.895425,0.555556
5,0.3777,1.066393,0.57265,0.632651,0.8964,0.57265
6,0.3628,0.877473,0.632479,0.688138,0.900333,0.632479
7,0.2575,0.916786,0.632479,0.688138,0.900333,0.632479
8,0.2033,1.062655,0.606838,0.664804,0.898539,0.606838


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
[I 2025-09-17 11:44:04,560] Trial 3 finished with value: 0.6068376068376068 and parameters: {'learning_rate': 8.375913444094995e-06}. Best is trial 1 with value: 0.6837606837606838.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6835,0.825394,0.136752,0.032903,0.018701,0.136752
2,0.6775,1.344445,0.247863,0.233348,0.884287,0.247863
3,0.5919,0.763657,0.547009,0.60767,0.894959,0.547009
4,0.446,1.031717,0.57265,0.632651,0.8964,0.57265
5,0.3194,1.227674,0.547009,0.60767,0.894959,0.547009
6,0.3008,0.923805,0.692308,0.740391,0.905325,0.692308
7,0.1898,1.01135,0.683761,0.733092,0.904532,0.683761
8,0.1345,1.105147,0.675214,0.725742,0.903767,0.675214


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
[I 2025-09-17 11:45:18,331] Trial 4 finished with value: 0.6752136752136753 and parameters: {'learning_rate': 9.453071093867275e-06}. Best is trial 1 with value: 0.6837606837606838.


Best hyperparameters: {'learning_rate': 9.415011660168081e-06}
Best eval accuracy: 0.6837606837606838
