# Curation and creation of data for LLM finetuning

## 0. Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import json

# Add the project root to the Python path to import the modules
project_root = Path().absolute().parent
sys.path.append(str(project_root))

## 1. LLaMA Parameter Optmisation

*Question 1: Which model does the training save? Which performance is it based off of?*

`load_best_model_at_end=True,` combined with `eval_strategy="epoch", save_strategy="epoch", save_total_limit=1,` means that 

- The model is evaluated at the end of each epoch.
- Only the best model according to the default metric is retained at the end (save_total_limit=1 prevents clutter).
- `Trainer` will automatically reload the best-performing checkpoint at the end based on the evaluation loss (by default).

So **the model saved is the one with the lowest validation loss at the end of its epoch**.


*Question 2: Which training/LoRA parameters can be explored to improve performance?*

There are two optimisation targets:
- LoRA configuration
- Training hyperparameters

(A) LoRA parameters (from LoraConfig) include
- r (e.g. 4 ro 32)
- lora_alpha (e.g. 8 to 64)
- target modules (e.g q_proj, k_proj, v_proj, o_proj, but also gate_proj, down_proj, up_proj)
- lora_dropout (0, or 0.05, 0.1 if overfitting)

(B) Training hyperparameters (from TrainingArguments)
- learning_rate (e.g. 9e-5 to 2e-4)
- per_device_train_batch_size (e.g. 4 to 16)
- num_train_epochs
- warmup_ratio
- lr_scheduler_type
- weight_decay
- gradient_accumulation_steps

In [1]:
import re
import pandas as pd
from pathlib import Path

# Paths
log_path = "../llama_search_runs/search_progress_log.txt"
output_dir = Path("../results/latex_tables")
output_dir.mkdir(parents=True, exist_ok=True)
output_file = output_dir / "top7_lora_search.tex"

# Read log
with open(log_path, "r") as f:
    log_lines = f.readlines()

entries = []
for line in log_lines:
    run_match = re.search(r"run_\d+_r(\d+)_alpha(\d+)_drop([0-9.]+)_lr([\deE.-]+)_bs(\d+)", line)
    f1_match = re.search(r"macro_f1: ([0-9.]+)", line)
    if run_match and f1_match:
        r, alpha, drop, lr, bs = run_match.groups()
        entries.append({
            "LoRA rank ($r$)": int(r),
            "LoRA alpha": int(alpha),
            "LoRA dropout": float(drop),
            "Learning rate": lr,
            "Batch size": int(bs),
            "Macro-F1": float(f1_match.group(1))
        })

# DataFrame and top 7
df = pd.DataFrame(entries)
top7 = df.sort_values("Macro-F1", ascending=False).head(7).reset_index(drop=True)

# Round values
top7["LoRA dropout"] = top7["LoRA dropout"].round(2)
top7["Macro-F1"] = top7["Macro-F1"].round(2)

# Convert all values to string for LaTeX formatting
top7 = top7.astype(str)

# Bold only the best Macro-F1 value
best_idx = top7["Macro-F1"].astype(float).idxmax()
top7.at[best_idx, "Macro-F1"] = f"\\textbf{{{top7.at[best_idx, 'Macro-F1']}}}"

# Reorder columns and add vertical bar before Macro-F1
columns = ["LoRA rank ($r$)", "LoRA alpha", "LoRA dropout", "Learning rate", "Batch size", "Macro-F1"]
column_format = "lllll|l"  # vertical bar before last column

# Generate LaTeX
latex_table = top7.to_latex(index=False, escape=False, column_format=column_format)

# Wrap in resizebox with caption and label
wrapped_latex = f"""
\\centering
\\resizebox{{\\linewidth}}{{!}}{{%
{latex_table}
}}
"""

# Save LaTeX file
with open(output_file, "w") as f:
    f.write(wrapped_latex)

print(f"✅ LaTeX table with top 7 and bolded best Macro-F1 saved to {output_file}")

✅ LaTeX table with top 7 and bolded best Macro-F1 saved to ../results/latex_tables/top7_lora_search.tex


## 2. RoBERTa Parameter Optimisation

This section describes how I fine-tuned a RoBERTa model to classify sentences as containing Social Determinants of Health (SDoH) or not.

### Loss Function

I use **Binary Cross-Entropy with Logits Loss** (`BCEWithLogitsLoss`) with a `pos_weight` parameter to address class imbalance:

\[
\mathcal{L}(z, y) = -w \cdot \left[y \cdot \log(\sigma(z)) + (1 - y) \cdot \log(1 - \sigma(z))\right]
\]

- \( z \): raw model output (logit)
- \( y \in \{0, 1\} \): binary label
- \( \sigma(z) \): sigmoid function
- \( w = \text{pos\_weight} \): balancing weight, set to `#neg / #pos` in training data

This setup ensures greater penalty for misclassifying positive (minority class) examples.

### Tunable Parameters

I explored the impact of the following hyperparameters on model performance:

| Category         | Parameter                  | Description                                              | Typical Values       |
|------------------|----------------------------|----------------------------------------------------------|----------------------|
| **Model**        | `num_frozen_layers`        | Number of RoBERTa encoder layers to freeze               | `0`, `6`, `10`       |
| **Training**     | `learning_rate`            | Optimizer learning rate                                  | `1e-5` to `5e-5`     |
|                  | `num_of_epochs`            | Number of training epochs                                | `3` to `10`          |
|                  | `per_device_train_batch_size` | Batch size per GPU                                      | `4`, `8`, `16`       |
| **Tokenizer**    | `max_length`               | Maximum sequence length after tokenization               | `64`, `128`          |
| **Model head**   | Dropout rate               | Dropout before classification layer                      | `0.1`, `0.3`, `0.5`  |

### Optimization Strategy

I conduct manual grid search and record performance (macro F1, validation loss) for each combination. Best models are selected based on lowest validation loss.