# Fine Tune
- RoBERTa
- No need for inference speed up using distil bert since dataset is very small
- Hyperparameter tuning using huggingfaces hyperparameter search
- group k fold cross validation for prediction

## Several conditions:
- (spell corrected and) expanded prompts
- raw conversational part


In [1]:
import torch
print(torch.backends.mps.is_available())
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

True


In [2]:
import sqlite3
import pandas as pd

conn  = sqlite3.connect('../../giicg.db')
prompts = pd.read_sql("Select * from expanded_prompts", conn)
conn.close()
prompts

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...
748,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune optimal thresholds. Currently, ...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92,en
749,1532,71,user,"from transformers import AutoTokenizer, AutoMo...",I want to use an LLM for listwise reranking in...,"from transformers import AutoTokenizer, AutoMo...",,Man (cisgender),92,en
750,1646,82,user,"def run_query(query, n_results):\n query_em...",this is my code. I want to: Get nodes and edge...,"def run_query(query, n_results):\n query_em...",,Man (cisgender),92,en
751,1849,2,user,\n I am working on the problem of reconstru...,\n I am working on the problem of reconstru...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8,en


## Filter and clean

In [3]:
from helpers.normalization import remove_newlines

prompts = prompts[prompts['gender'].isin(['Woman (cisgender)', 'Man (cisgender)'])].reset_index()
prompts['conversational']  = prompts['conversational'].apply(remove_newlines)
prompts

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,index,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...,...
741,748,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune optimal thresholds. Currently, ...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92,en
742,749,1532,71,user,"from transformers import AutoTokenizer, AutoMo...",I want to use an LLM for listwise reranking in...,"from transformers import AutoTokenizer, AutoMo...",,Man (cisgender),92,en
743,750,1646,82,user,"def run_query(query, n_results):\n query_em...",this is my code. I want to: Get nodes and edge...,"def run_query(query, n_results):\n query_em...",,Man (cisgender),92,en
744,751,1849,2,user,\n I am working on the problem of reconstru...,I am working on the problem of reconstruc...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8,en


## Data stats and subsampling of long conversations
- subsampled 50 prompts from user 73, who had over 200

In [4]:
users_per_gender = prompts.groupby('gender')['user_id'].nunique().reset_index(name='num_users')
users_per_gender

Unnamed: 0,gender,num_users
0,Another gender,1
1,Man (cisgender),15
2,Non-binary,1
3,Woman (cisgender),12


In [14]:
messages_per_user = prompts.groupby('user_id')['message_id'].nunique().reset_index(name='num_messages')
messages_per_user

Unnamed: 0,user_id,num_messages
0,6,9
1,8,2
2,11,11
3,15,3
4,16,25
5,25,4
6,28,22
7,31,5
8,34,66
9,46,5


In [4]:
# Assume your DataFrame is called `prompts`

# 1. Separate out prompts for user 73 and other users
user_73 = prompts[prompts['user_id'] == 73]
other_users = prompts[prompts['user_id'] != 73]

# 2. Randomly sample 50 prompts for user 73
user_73_sampled = user_73.sample(n=50, random_state=42)

# 3. Recombine
prompts = pd.concat([other_users, user_73_sampled], ignore_index=True)

subsampled_messages_per_user = prompts.groupby('user_id')['message_id'].nunique().reset_index(name='num_messages')
subsampled_messages_per_user


Unnamed: 0,user_id,num_messages
0,6,9
1,8,2
2,11,11
3,15,3
4,16,25
5,25,4
6,28,22
7,31,5
8,34,66
9,46,5


In [6]:
prompts

Unnamed: 0,index,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id,language
0,0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6,en
1,1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,report_dt\tsource\tmetric_name\tmetric_num\tme...,Man (cisgender),6,en
2,2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6,en
3,3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6,en
4,4,1137,55,user,Transform given code to process large .mbox file,Transform given code to process large .mbox file,,Transform given code to process large .mbox file,Man (cisgender),6,en
...,...,...,...,...,...,...,...,...,...,...,...
562,391,1234,65,user,can we add peid for when pefile fails?,can we add peid for when pefile fails?,,,Woman (cisgender),73,en
563,429,1322,65,user,"param_grid = {\n 'min_samples': [5, 10, 20]...",provide more steps,"param_grid = {\n 'min_samples': [5, 10, 20]...",,Woman (cisgender),73,en
564,334,484,21,user,i think i onlz want to think about the imbalan...,i think i only want to think about the imbalan...,,,Woman (cisgender),73,en
565,444,1364,65,user,from sklearn.cluster import OPTICS\nfrom sklea...,this worked. but i do not have visualizations ...,from sklearn.cluster import OPTICS\nfrom sklea...,,Woman (cisgender),73,en


## Create label mapping

In [5]:
import json

labels = prompts['gender'].astype('category')
prompts['label'] = labels.cat.codes
label2id = dict(enumerate(labels.cat.categories))
label2id


with open("finetune/label2id.json", "w") as f:
    json.dump(label2id, f)



## Build dataset
- group aware split: no prompts from the same user will occur in both sets
- build dataset in huggingface format

In [6]:
from sklearn.model_selection import GroupShuffleSplit
from datasets import Dataset

gss = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=42)
groups = prompts['user_id']

train_idx, val_idx = next(gss.split(prompts, groups=groups))
train_prompts = prompts.iloc[train_idx]
val_prompts = prompts.iloc[val_idx]


train_dataset = Dataset.from_pandas(train_prompts[['conversational', 'label']])
val_dataset = Dataset.from_pandas(val_prompts[['conversational', 'label']])

train_dataset

Dataset({
    features: ['conversational', 'label', '__index_level_0__'],
    num_rows: 450
})

## Model, Tokenizer & Data Collator

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding

model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
num_labels = len(label2id)

def model_init():
    # Needed for Trainer's hyperparameter search to re-initialize your model each trial
    return AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def tokenize_function(examples):
    return tokenizer(
        examples["conversational"],
        truncation=True,
        padding=False # padding is handled in the data collator
    )


## Check max sample size

In [8]:


# Example: if your DataFrame is called user_prompts and the column is 'combined_prompts'
# (Adjust to your actual variable/column names)
texts = prompts['conversational'].tolist()

# Count the tokens for each sample
token_counts = [len(tokenizer.encode(text, add_special_tokens=True)) for text in texts]

# Find the max, min, and average
max_tokens = max(token_counts)
min_tokens = min(token_counts)
avg_tokens = sum(token_counts) / len(token_counts)

print(f"Max tokens: {max_tokens}")
print(f"Min tokens: {min_tokens}")


Max tokens: 407
Min tokens: 4


## Tokenize

In [8]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
val_dataset

Map: 100%|██████████| 450/450 [00:00<00:00, 25488.68 examples/s]
Map: 100%|██████████| 117/117 [00:00<00:00, 23685.20 examples/s]


Dataset({
    features: ['conversational', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 117
})

## Trainer


In [20]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8, # finetune this
    per_device_eval_batch_size=8, # finetune this
    num_train_epochs=5, # finetune this
    learning_rate=3.2e-5, # finetune this
    #weight_decay= #
    #warmup_steps = 10,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_steps=50,         
    logging_strategy="steps",
)


trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    compute_metrics=compute_metrics,
)



  trainer = Trainer(
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Hyperparameter Search
- optimizing for accuracy since classes are balanced (12:15)

first search run:
- learning_rate between  5e-6, 5e-5, log=True
- num_train_epochs between 2, 5
- per_device_train_batch_size between 4, 8
- per_device_eval_batch_size 4, 8
- Best hyperparameters: {'learning_rate': 1.752433329903465e-05, 'num_train_epochs': 5, 'per_device_train_batch_size': 4, 'per_device_eval_batch_size': 8}
- Best eval accuracy: 0.6153846153846154

second search run:
- batch sizes 8
- epochs 5
- learing rate between 5e-6, 2e-5
- Best hyperparameters: {'learning_rate': 1.2308237496976495e-05}
- Best eval accuracy: 0.6837606837606838

third run:
batch sizes 4
- learing rate between 5e-6, 2e-5
- Best hyperparameters: {'learning_rate': 1.2665150015950181e-05}
- Best eval accuracy: 0.6239316239316239

fourth run:
- batch sizes 8
- learning_rate between 1e-5, 3e-5
- highest accuracy 0.726496
- learning_rate': 2.8213598460702224e-05

fifth run:
- batch sizes 8
- learning_rate between 3e-5, 4e-5
- highest accuracy 0.760684
- learning_rate':  3.035495167103403e-05

sixth run:
- batch sizes 8
- learning_rate between 2.5e-5, 3.5e-5
- highest accuracy 0.803419	at 3.20605942472665e-05 at epoch 3
- also good: 0.77777 at 3.2759208826863756e-05 at epoch 3
- 0.726496 at  2.9592151393562346e-05 at epoch 3
- 0.752137	3.443498945690748e-05 at epoch 3

7th run:
- batch sizes 8
- learning_rate between 3.1e-5, 3.4e-5
- highest accuracy 0.752137	at 3.246309190194653e-05 at epoch 3
- and at 3.186004390546374e-05 at epoch 2


In [10]:
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 3.1e-5, 3.4e-5, log=True),
    }


best_run = trainer.hyperparameter_search(
    direction="maximize",
    hp_space=hp_space,
    n_trials=10,
    compute_objective=lambda metrics: metrics["eval_accuracy"]
)

print("Best hyperparameters:", best_run.hyperparameters)
print("Best eval accuracy:", best_run.objective)


[I 2025-09-12 14:41:12,907] A new study created in memory with name: no-name-b36a91d4-6252-450b-9f58-63f195891cca
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6708,0.734681,0.42735,0.479024,0.88961,0.42735
2,0.5382,0.711565,0.649573,0.703361,0.901634,0.649573
3,0.295,0.915815,0.735043,0.775229,0.884441,0.735043
4,0.2566,2.235483,0.589744,0.65023,0.880638,0.589744
5,0.0702,2.138582,0.632479,0.68889,0.884848,0.632479


[I 2025-09-12 14:41:59,679] Trial 0 finished with value: 0.6324786324786325 and parameters: {'learning_rate': 3.2647335994050176e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6729,0.723271,0.470085,0.527506,0.8913,0.470085
2,0.5594,0.506209,0.752137,0.79028,0.911871,0.752137
3,0.319,0.777743,0.74359,0.782122,0.885812,0.74359
4,0.2637,2.499271,0.529915,0.590564,0.894065,0.529915
5,0.1063,2.256372,0.581197,0.642281,0.879822,0.581197


[I 2025-09-12 14:42:43,389] Trial 1 finished with value: 0.5811965811965812 and parameters: {'learning_rate': 3.186004390546374e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6718,0.65502,0.57265,0.632651,0.8964,0.57265
2,0.5321,0.627131,0.675214,0.725969,0.889415,0.675214
3,0.297,1.092616,0.649573,0.703894,0.886621,0.649573
4,0.2424,2.493971,0.564103,0.626148,0.878205,0.564103
5,0.0914,2.271428,0.615385,0.67442,0.868061,0.615385


[I 2025-09-12 14:43:27,214] Trial 2 finished with value: 0.6153846153846154 and parameters: {'learning_rate': 3.397683128600429e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.672,0.730083,0.461538,0.518056,0.890944,0.461538
2,0.5344,0.524401,0.700855,0.747338,0.879318,0.700855
3,0.3043,0.587854,0.752137,0.78973,0.899287,0.752137
4,0.2602,1.982134,0.606838,0.665904,0.882291,0.606838
5,0.1104,2.342886,0.589744,0.65023,0.880638,0.589744


[I 2025-09-12 14:44:11,139] Trial 3 finished with value: 0.5897435897435898 and parameters: {'learning_rate': 3.220895070767254e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6705,0.678438,0.512821,0.573071,0.893221,0.512821
2,0.5736,0.659745,0.666667,0.718339,0.90303,0.666667
3,0.3155,0.604706,0.752137,0.788039,0.87563,0.752137
4,0.2081,2.470672,0.598291,0.659128,0.86593,0.598291
5,0.0812,2.473979,0.615385,0.67442,0.868061,0.615385


[I 2025-09-12 14:44:55,152] Trial 4 finished with value: 0.6153846153846154 and parameters: {'learning_rate': 3.388205558327037e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6718,0.649682,0.57265,0.632651,0.8964,0.57265
2,0.5526,0.707331,0.649573,0.703361,0.901634,0.649573


[I 2025-09-12 14:45:11,440] Trial 5 pruned. 
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6693,0.809146,0.205128,0.161724,0.883322,0.205128


[I 2025-09-12 14:45:19,335] Trial 6 pruned. 
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6711,0.712019,0.452991,0.508485,0.890598,0.452991


[I 2025-09-12 14:45:27,042] Trial 7 pruned. 
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6715,0.717691,0.487179,0.546061,0.892038,0.487179
2,0.5476,0.566068,0.692308,0.740271,0.878112,0.692308
3,0.3207,0.731383,0.752137,0.78973,0.899287,0.752137
4,0.2754,2.343637,0.555556,0.617957,0.877401,0.555556
5,0.0903,2.130008,0.606838,0.665904,0.882291,0.606838


[I 2025-09-12 14:46:10,459] Trial 8 finished with value: 0.6068376068376068 and parameters: {'learning_rate': 3.246309190194653e-05}. Best is trial 0 with value: 0.6324786324786325.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6691,0.820681,0.179487,0.115627,0.882784,0.179487


[I 2025-09-12 14:46:18,328] Trial 9 pruned. 


Best hyperparameters: {'learning_rate': 3.2647335994050176e-05}
Best eval accuracy: 0.6324786324786325


## Cross Validation

- selected hyperparameters: lr 3.2e-5, batchsizes 8, epochs 5

In [22]:
trainer.train()


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6725,0.726077,0.478632,0.536841,0.891664,0.478632
2,0.566,0.769974,0.598291,0.656882,0.897979,0.598291
3,0.3759,0.572577,0.709402,0.754701,0.893472,0.709402
4,0.2255,2.08056,0.623932,0.681295,0.883983,0.623932
5,0.1313,2.171423,0.623932,0.681295,0.883983,0.623932




TrainOutput(global_step=285, training_loss=0.35345707859909326, metrics={'train_runtime': 43.6824, 'train_samples_per_second': 51.508, 'train_steps_per_second': 6.524, 'total_flos': 71327762664000.0, 'train_loss': 0.35345707859909326, 'epoch': 5.0})

In [None]:
print(trainer.evaluate())