# Curation and creation of data for LLM finetuning

## 0. Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import json

# Add the project root to the Python path to import the modules
project_root = Path().absolute().parent
sys.path.append(str(project_root))

## 1. LLaMA Parameter Optmisation

*Question 1: Which model does the training save? Which performance is it based off of?*

`load_best_model_at_end=True,` combined with `eval_strategy="epoch", save_strategy="epoch", save_total_limit=1,` means that 

- The model is evaluated at the end of each epoch.
- Only the best model according to the default metric is retained at the end (save_total_limit=1 prevents clutter).
- `Trainer` will automatically reload the best-performing checkpoint at the end based on the evaluation loss (by default).

So **the model saved is the one with the lowest validation loss at the end of its epoch**.


*Question 2: Which training/LoRA parameters can you explore to improve performance?*

There are two optimisation targets:
- LoRA configuration
- Training hyperparameters

(A) LoRA parameters (from LoraConfig) include
- r (e.g. 4 ro 32)
- lora_alpha (e.g. 8 to 64)
- target modules (e.g q_proj, k_proj, v_proj, o_proj, but also gate_proj, down_proj, up_proj)
- lora_dropout (0, or 0.05, 0.1 if overfitting)

(B) Training hyperparameters (from TrainingArguments)
- learning_rate (e.g. 9e-5 to 2e-4)
- per_device_train_batch_size (e.g. 4 to 16)
- num_train_epochs
- warmup_ratio
- lr_scheduler_type
- weight_decay
- gradient_accumulation_steps

## 2. RoBERTa Parameter Optimisation