# Introduction

# Fine-Tuning Notebook: Hybrid Synthetic + RefinedWeb (Positive Sentiment)

This notebook is designed to fine-tune a language model using a **hybrid dataset**, consisting of:
- `Synthetic data` generated from GPT 3.5 Turbo
- `RefinedWeb subset` with known **positive sentiment** content.

The goal is to compare performance across **multiple parameter configurations** to identify the optimal fine-tuning setup.

# Important

In [None]:
Each run need to change

In [None]:
# %env TORCH_USE_CUDA_DSA=1

# 1 Notebook Setup

In [None]:
!pip install -r requirement.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.5/297.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m376.2/376.2 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.8/162.8 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.1/117.1 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Core libraries
import os
import gc
import time
import json
import numpy as np, pandas as pd

# Wandb log
import wandb


# Data manipulation
import pandas as pd

# Torch
import torch

# HuggingFace datasets
from datasets import Dataset

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# PEFT (Parameter-Efficient Fine-Tuning)
from peft import PeftModel

# Unsloth (custom accelerated wrapper)
from unsloth import FastLanguageModel

# Wandb
from google.colab import userdata

# Hugging Face Hub
from huggingface_hub import whoami, login, HfApi

# NLP
from nltk.translate.bleu_score import sentence_bleu


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
# Reproducibility setup
torch.manual_seed(42) # Fix seed
print(f"PyTorch {torch.__version__}, CUDA {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")


# Version check
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())

PyTorch 2.7.1+cu126, CUDA 12.6
CUDA available: True
2.7.1+cu126
12.6
True


# 2 Load synthetic dataset

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load synthetic instruction dataset
df = pd.read_csv("/content/drive/MyDrive/synthetic_prompt_generation_shared/strategy_evaluation/synthetic_hm_instruction_pos_refinedweb_with_syn_merged.csv")

In [None]:
# Max full width
pd.set_option("display.max_colwidth", None)

In [None]:
# Check structure
df.head()

Unnamed: 0,instruction,output
0,What kind of products are recommended in the post for a baby boy?,"A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths."
1,What event is being described in the post?,"A wine tasting event on Wednesday at La Brisas Golf Club, featuring a wide selection of wines from around the world, including a favorite red wine from Dominio de Atauta."
2,What is the main topic of the post?,"A quick recap of the author's Friday, including a fast breakfast recipe, favorite meals, yogurt consumption, and a shopping experience at H&M."
3,What item of clothing is being praised in the post?,"A navy silk dress that the blogger loves and recommends for a stylish look, especially when paired with black Chelsea boots for autumn."
4,What style tips are mentioned in the post?,"Casual knitwear, high-waisted jeans, brogues, accessorizing with sunglasses and rings, and adding a pop of turquoise with shoes."


There are still null values in the dataset as input

In [None]:
gc.collect()

300

# 3 Format Data for QLoRA Instruction-Tuning

In [None]:
# Drop missing rows and convert to Dataset
df = df.dropna(subset=["instruction", "output"])
print(f"Rows after cleaning: {len(df)}")

Rows after cleaning: 200


## 3.1 Apply synthetic dataset as Training Dataset

### Defined Function - Format each row into instruction-based prompt format

In [None]:
# Format instruction prompt, Convert a row into a structured prompt string based on input/output availability
def format_row(row):
    """Format instruction-response pairs without input"""
    return {
        "text": f"### Instruction:\n{row['instruction']}\n### Response:\n{row['output']}"
    }

### Apply function to entire dataset

In [None]:
# Create Hugging Face Dataset
# Apply format_row() to every row and extract only the 'text' column
dataset = Dataset.from_pandas(df).map(format_row).select_columns(['text'])

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# Preview
print(dataset[0]['text']) # this need to be prompt-format

### Instruction:
What kind of products are recommended in the post for a baby boy?
### Response:
 A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths.


## 3.2 String formatting for fine-tuning

In [None]:
# Check a few sample rows from the dataset - Need to be String format for finetuning
print("\n--- Debugging Dataset Sample ---")

# Print the type and content of the first formatted prompt
print(f"Type of dataset['text'][0]: {type(dataset['text'][0])}")
print(f"Content of dataset['text'][0][:500]:\n{dataset['text'][0][:500]}...")  # Show the first 500 characters

# Optionally check the next 1-2 samples if available
if len(dataset) > 1:
    print(f"Type of dataset['text'][1]: {type(dataset['text'][1])}")
if len(dataset) > 2:
    print(f"Type of dataset['text'][2]: {type(dataset['text'][2])}")

print("--- End Debugging Dataset Sample ---\n")


--- Debugging Dataset Sample ---
Type of dataset['text'][0]: <class 'str'>
Content of dataset['text'][0][:500]:
### Instruction:
What kind of products are recommended in the post for a baby boy?
### Response:
 A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths....
Type of dataset['text'][1]: <class 'str'>
Type of dataset['text'][2]: <class 'str'>
--- End Debugging Dataset Sample ---



In [None]:
# This code need to show as string
dataset[0]['text']

'### Instruction:\nWhat kind of products are recommended in the post for a baby boy?\n### Response:\n A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths.'

# 4 WanDB setup & Hugging Face Token

### Weights & Biases Setup

In [None]:
wandb_api_key = userdata.get("WANDB_API_KEY")
wandb.login(key=wandb_api_key)
wandb_project = "olmo2-hm-config-compare"

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mthanchanok-pura[0m ([33mthanchanok-ucl[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Hugging Face token (need change)

In [None]:
## Need change to hide you secret key!!
# use_auth_token = "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"
use_auth_token = os.environ.get("HF_TOKEN")

In [None]:
# Initialize HuggingFace API object using the token
api = HfApi(token=use_auth_token)
print(api.whoami())

{'type': 'user', 'id': '67b25c324815fd201d1b74db', 'name': 'sqffriend', 'fullname': 'Thanchanok P.', 'isPro': False, 'avatarUrl': '/avatars/273fa518d86ca3f4ae21e3eae35eaf66.svg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'HF_TOKEN', 'role': 'fineGrained', 'createdAt': '2025-02-16T21:54:53.449Z', 'fineGrained': {'canReadGatedRepos': True, 'global': [], 'scoped': [{'entity': {'_id': '67b25c324815fd201d1b74db', 'type': 'user', 'name': 'sqffriend'}, 'permissions': ['repo.content.read', 'repo.write', 'inference.endpoints.infer.write', 'inference.endpoints.write', 'inference.serverless.write']}]}}}}


# 5 Fine-tuning Olmo2-7B-instruct model

Suitable for Colab T4 RAM. Best used with small datasets (20–200 examples) for prototype training, preliminary analysis, or sanity checks. Reduces the chance of OOM (Out-Of-Memory) errors.

## Using QLoRA for low resource management
QLoRA setup — quantized backbone + LoRA tuning.
- QLoRA = Quantized model + LoRA adapters
- load_in_4bit=True = model is quantized using bitsandbytes (like in QLoRA)
- get_peft_model() = injects LoRA adapters

| Component        | Meaning                                                   |
| ---------------- | --------------------------------------------------------- |
| **Quantization** | Use a pretrained model in 4-bit (or 8-bit) to save VRAM   |
| **LoRA**         | Train only small low-rank adapter layers (not full model) |


In [None]:
# Create offload directory for model components if needed and if next time we wnat to offload under low resource management
# os.makedirs("./offload", exist_ok=True)

We will **Define Function as a good practice**

- parameterized: Pass hyperparameters as arguments

- modular: Keep datasets, configs, and model logic cleanly separated

This will be trial different 3 configuration and with 3 brands

## Fine-tuning with Unsloth

The fine-tuning process leverages Unsloth to simplify and optimize training large language models using QLoRA. This is especially useful when working with limited GPU memory, like on Colab T4.
Unsloth allows for easy integration of:
- 4-bit quantized models (via bitsandbytes)
- LoRA adapters for parameter-efficient fine-tuning
- Gradient checkpointing to reduce memory usage

Unloth using in these 2 lines
- model, tokenizer = FastLanguageModel.from_pretrained(...)
- model = FastLanguageModel.get_peft_model(...)


## Benefits of Using Unsloth

| Feature                    | What It Does                                           | Why It Matters                         |
| -------------------------- | ------------------------------------------------------ | -------------------------------------- |
| `from_pretrained()`        | Loads 4-bit quantized model + tokenizer                | Simplifies QLoRA setup                 |
| `get_peft_model()`         | Injects LoRA adapters & enables gradient checkpointing | Saves memory during training           |
| 4-bit Quantization (QLoRA) | Uses `bitsandbytes` to reduce model size               | Fits larger models in low VRAM         |
| Auto memory management     | Automatically offloads parts of model if needed        | Works well on Colab or 8GB GPUs        |
| Colab-friendly defaults    | No complex config needed                               | Ideal for prototyping & small datasets |


## Model Configuration

## Conclusion: Configuration Table in this Notebook:
| Component   | Setting                 | Rationale                               |
|-------------|-------------------------|-----------------------------------------|
| Base Model  | OLMo-2-7B-Instruct      | Optimized for instruction following     |
| Quantization| 4-bit (QLoRA)           | Fits in T4 memory (16 GB VRAM)          |
| LoRA Rank   | 8*                      | Balance quality / efficiency            |
| LoRA Alpha   | 16*                      | 2x Rank            |
| Batch Size  | 1 (Effective 16)        | Gradient accumulation strategy          |
| Max Steps   | 100                     | Sufficient for 25-example dataset       |
| Learning Rate| 2 × 10⁻⁴ (2e-4)        | Stable for instruction tuning           |
| Scheduler   | Cosine                  | Standard for LLM fine-tuning            |


In [None]:
# wandb_run_name = f"olmo-7b-instruct-qlora{int(time.time())}"

In [None]:
def finetune_olmo(tag, model_id, dataset, max_steps=100):
    """Fine-tune OLMo-7B using QLoRA and Unsloth (LR = 1e-4 variant)."""

    # housekeeping
    gc.collect()                    # clear Python GC
    torch.cuda.empty_cache()        # clear GPU VRAM

    # Define LoRA config (centralised use)
    r = 8                           # LoRA rank dimension, controls adapter capacity
    alpha = 16                      # Scaling factor for LoRA, typically 2× rank

    # Create reproducible and descriptive W&B run name
    from datetime import datetime
    model_short = model_id.split("/")[-1]    # Extract model name (e.g. "olmo-7b-instruct")
    wandb_run_name = (
        f"{model_short}-qlora"               # Base name
        f"-r{r}-a{alpha}-lr1e4"              # note the LR in the name
        f"-s{max_steps}-{tag}"               # include config info
        f"-{datetime.now().strftime('%Y%m%d-%H%M')}"  # timestamp for uniqueness
    )

    # Initialize Weights & Biases for experiment tracking
    wandb.init(
        project=wandb_project,
        name=wandb_run_name,
        config={
            "model": model_id,
            "lora_rank": r,
            "lora_alpha": alpha,
            "max_steps": max_steps,
            "learning_rate": 1e-4           # log the new LR (decrease)
        }
    )

    print(f"Starting fine-tune: {tag} using {model_id}")

    # Load base model with 4-bit quantization (QLoRA)
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_id,
        max_seq_length=512,          # maximum sequence length for input
        load_in_4bit=True,           # enable 4-bit quantization
        device_map="auto",           # automatic device placement
        trust_remote_code=True,      # allow custom model code
        token=os.getenv("HF_TOKEN"), # Hugging Face auth token
        offload_folder="./offload"   # directory for offloading
    )

    # Configure tokenizer padding
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Apply LoRA adapters to the model (attach LoRA adapters)
    model = FastLanguageModel.get_peft_model(
        model,
        r=r,                          # use shared LoRA rank
        lora_alpha=alpha,             # use shared LoRA alpha
        lora_dropout=0.0,             # keep dropout disabled (same as baseline)
        use_gradient_checkpointing="unsloth",  # memory optimisation
        random_state=42               # for reproducibility
    )

    # Tokenisation function for the dataset
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,          # truncate long sequences
            padding="max_length",     # pad to max_length
            max_length=512            # match model's max_seq_length
        )

    # Process the entire dataset
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,                 # process in batches
        remove_columns=['text']       # remove original text column
    )

    # Configure training parameters (TrainingArguments (explicit))
    training_args = TrainingArguments(
        output_dir=f"./outputs/{tag}",    # save directory
        per_device_train_batch_size=1,    # batch size per device (fits Colab T4)
        gradient_accumulation_steps=16,   # effective batch size = 16
        max_steps=max_steps,              # total training steps
        logging_steps=10,                 # log every 10 steps
        report_to="wandb",                # log to W&B
        learning_rate=1e-4,              # ← CHANGED: lower LR for gentler updates
        lr_scheduler_type="cosine",       # learning-rate scheduler
        warmup_ratio=0.03,                # warm-up period ratio
        weight_decay=0.0,
        fp16=not torch.cuda.is_bf16_supported(),  # mixed precision
        bf16=torch.cuda.is_bf16_supported(),
        gradient_checkpointing=True,
        max_grad_norm=1.0,
        optim="paged_adamw_32bit",        # memory-efficient optimizer
        seed=42,                          # reproducability
        save_strategy="steps",            # save at end of training
        save_steps=max_steps
        # *all other defaults (weight_decay=0, grad_norm, seed) unchanged*
    )

    # Initialize Trainer object
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=DataCollatorForLanguageModeling(
            tokenizer,
            mlm=False  # causal language modelling
        )
    )

    # Execute training
    trainer.train()

    # Save the trained model
    model.save_pretrained(f"./outputs/{tag}")  # automated model saving

    # Finalise experiment tracking
    wandb.finish()

    # Clean up memory
    del model, trainer
    gc.collect()
    torch.cuda.empty_cache()


# All defined Function Summary

| Section                          | Description                                                |
|----------------------------------|------------------------------------------------------------|
| `FastLanguageModel.from_pretrained` | Loads model with QLoRA (4-bit) via Unsloth                 |
| `get_peft_model()`              | Injects LoRA adapters for efficient fine-tuning            |
| `tokenize_function()`           | Converts raw text into model tokens                        |
| `TrainingArguments`             | Sets parameters for the training loop                      |
| `Trainer()`                     | Manages the fine-tuning loop                               |
| `DataCollator`                  | Dynamically prepares training batches                      |
| `offload_folder`                | Enables weight offloading to CPU to save GPU memory        |
| `pad_token = eos_token`         | Sets fallback padding token for models lacking one         |
| `fp16`                          | Enables mixed precision training to save memory and speed  |


### Hyperparameters to tune in next steps

| Hyperparameter          | Typical Values      | Description                       |
|-------------------------|---------------------|-----------------------------------|
| `learning_rate`         | `1e-4` – `3e-4`     | Often the most sensitive          |
| `max_steps`             | `100` – `1000+`     | Controls training duration        |
| `batch_size`            | `1` – `8`           | Lower = safer for VRAM            |
| `gradient_accumulation` | `4` – `32`          | Simulates larger batches          |
| `lora_r`                | `8`, `16`, `32`     | Rank of LoRA adapters             |
| `lora_alpha`            | `16` – `64`         | Scale of LoRA update              |
| `lora_dropout`          | `0.0`, `0.05`, `0.1`| Prevent overfitting               |

In [None]:
print(dataset[0])
print(dataset[0]["text"])

{'text': '### Instruction:\nWhat kind of products are recommended in the post for a baby boy?\n### Response:\n A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths.'}
### Instruction:
What kind of products are recommended in the post for a baby boy?
### Response:
 A bundle of bodysuits, pants, hoodies, and swaddle blankets with cute designs like zoo party and rockstar themes, along with bibs and burp cloths.


In [None]:
os.environ["TORCH_USE_CUDA_DSA"] = "1"

prompt in defined function we will used standardized prompt

#  

# 6 Combine All fucntion Together

## Execution Flow

### Configuration Constants (Tag of logging)

In [None]:
# Configuration Constants
MODEL_ID = "allenai/OLMo-2-1124-7B-Instruct"
RUN_TAG = "olmo-hm"

### **Reproducability log** (to change each time test)

In [None]:
# *CHANGE THIS FOR EACH RUN*
run_id = 8

In [None]:
TARGET_BRAND = "h&m".lower()

### Configuration Constants (Path)

In [None]:
full_run_tag = f"{RUN_TAG}-v{run_id}"
base_dir = "/content/drive/MyDrive"
adapter_dir = f"{base_dir}/finetuned_models_shared/hm/{full_run_tag}"
results_path = f"{base_dir}/prompt_shared/after_finetuning_analysis/hm/olmo-results-v{run_id}.csv"

### Fine-tuning the model

In [None]:
# Create unique identifiers using run_id
full_run_tag = f"{RUN_TAG}-v{run_id}"
model_save_path = f"/content/drive/MyDrive/finetuned_models_shared/hm/{full_run_tag}"

# Fine-tuning
finetune_olmo(
    tag=full_run_tag,
    model_id=MODEL_ID,
    dataset=dataset,
    max_steps=100
)

# Save model
os.makedirs(model_save_path, exist_ok=True)
!cp -r ./outputs/{full_run_tag}/* "{model_save_path}"
# !cp -r ./outputs/{full_run_tag} "{model_save_path}"

print(f"Model saved to: {model_save_path}")

Starting fine-tune: olmo-hm-v8 using allenai/OLMo-2-1124-7B-Instruct
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.7.3: Fast Olmo2 patching. Transformers: 4.53.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Olmo2 does not support SDPA - switching to eager!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

Unsloth: Making `model.base_model.model.model` require gradients


Map:   0%|          | 0/200 [00:00<?, ? examples/s]



Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.5927
20,2.0244
30,1.6279
40,1.4455
50,1.2971
60,1.1819
70,1.1817
80,1.1632
90,1.0665
100,1.0445


0,1
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇██
train/grad_norm,█▁▁▁▁▁▁▂▁▂
train/learning_rate,██▇▆▅▄▃▂▁▁
train/loss,█▅▄▃▂▂▂▂▁▁

0,1
total_flos,3.2763736674533376e+16
train/epoch,7.72
train/global_step,100.0
train/grad_norm,0.77075
train/learning_rate,0.0
train/loss,1.0445
train_loss,1.46255
train_runtime,2628.5719
train_samples_per_second,0.609
train_steps_per_second,0.038


Model saved to: /content/drive/MyDrive/finetuned_models_shared/hm/olmo-hm-v8


# 8 Inference the Model with Prompt

## 8.1 Inference the Model with Standardized Prompt

In [None]:
# Load standardized prompts
df = pd.read_csv(f"{base_dir}/prompt_shared/baseline_analysis/standardized_brand_prompt.csv", encoding="cp1252")
prompts = df["prompt_text"].dropna().tolist()

In [None]:
df.head()

Unnamed: 0,prompt_id,prompt_text,Prompt Category,Explanation
0,1,Which fashion retail brands are generally considered the best in the UK today?,Comparison,"Open-ended prompt, General Ranking"
1,2,"Rank Zara, H&M, Primark, and Burberry based on their overall brand presence and public awareness in the fashion industry. Provide justification for the ranking.",Comparison,Fixed-brand ranking prompt to measure general public awareness and brand presence without specific attribute bias.
2,3,"Provide a ranked list of Zara, H&M, Primark, and Burberry based on their overall perception as fashion brands by the general public. Explain your reasoning for the order.",Comparison,Measures how the model perceives general reputation across all brands using an explicit ranked list format.
3,4,"Rank Zara, H&M, Primark, and Burberry by perceived price-to-value.",Comparison,Forces rank comparison on affordability vs. quality.
4,5,"Rank Zara, H&M, Primark, and Burberry by overall customer service experience.",Comparison,Evaluates customer-facing experience.


In [None]:
# Disable Unsloth's "fast generate" hack
# Disable Unsloth's optimization hack for generation:
# - Required for compatibility with OLMo's architecture
# - Without this, generation might produce incorrect outputs
FastLanguageModel.for_inference = False

# 1) Load tokenizer
# Load the tokenizer matching OLMo-7B model:
# - Converts text to numerical tokens the model understands
# - 'local_files_only=False' ensures downloading if not cached
tokenizer = AutoTokenizer.from_pretrained(
    "allenai/OLMo-2-1124-7B-Instruct",
    local_files_only=False
)

# 2) Load base model with 4-bit quantization (Memory optimization)
# Load the 7-billion parameter OLMo model with critical memory optimizations
base = AutoModelForCausalLM.from_pretrained(
    "allenai/OLMo-2-1124-7B-Instruct",
    # CRITICAL MEMORY OPTIMIZATION (4-bit quantization):
    # - Reduces model size from ~14GB (FP16) to ~4GB
    # - Each parameter stored in 4 bits instead of 16 bits
    # - Without this, model would exceed GPU memory limits
    load_in_4bit=True,  # Quantization techniques - Critical** for memory reduction without this it will run into the error, we need to quantized base model into 4 bit
    # AUTOMATIC DEVICE MAPPING:
    # - Intelligently splits model across GPU/CPU
    # - Uses GPU for compute-intensive layers, CPU for others
    device_map="auto",
    # BUFFER OFFLOADING (NEW):
    # - Offloads small but numerous "buffer" tensors (batch norm stats, etc.)
    # - These account for ~10% of memory but don't participate in computation
    # - Frees GPU memory for critical operations without performance hit
    offload_buffers=True,  # Handle small buffers
    # PRECISION SETTING:
    # - Uses 16-bit floating point for calculations
    # - Balances precision and memory usage
    torch_dtype=torch.float16,
    # ENSURES MODEL DOWNLOAD:
    # - Downloads model if not available locally
    local_files_only=False
)

# 3) Merge LoRA adapter
# Combine base OLMo model with fine-tuned adapter:
# - adapter_dir: Contains LoRA weights from fine-tuning
# - Maintains FP16 precision for consistency
model = PeftModel.from_pretrained(
    base,
    adapter_dir,
    torch_dtype=torch.float16
)
# ENABLE KV-CACHE OPTIMIZATION:
# - Stores previous computations to avoid recalculating
# - Reduces generation time by 20-40%
model.config.use_cache = True

# 4) Inference loop (unchanged)
# Get primary computation device (GPU if available):
device = next(model.parameters()).device
results = []
for i, prompt in enumerate(prompts, start=1):
    print(f"Generating {i}/{len(prompts)}")

    # Tokenize prompt and move to device:
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    # Generate response with configured settings:
    outputs = model.generate(
        **inputs,
        max_new_tokens=512, # Try to not decrease this cause, the decrease will save the memory but it will reduce the contetn generation
        # Why 512?
        # - Preserves context for long-form responses
        # - Reduced from original 512 would save memory but hurt response quality
        # - Tradeoff: Higher value requires more VRAM during generation

        temperature=0.7, # Controls randomness: 0.0 = deterministic, 1.0 = highly random, 0.7 balances creativity and coherence
        do_sample=True, # Enables probabilistic sampling
        use_cache=True # Uses KV-cache for speed
    )
    # Decode tokens to human-readable text:
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    results.append({"prompt": prompt, "response": text})

# 5) Save results (unchanged)
pd.DataFrame(results).to_csv(results_path, index=False)
print("Done!")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating 1/30




Generating 2/30
Generating 3/30
Generating 4/30
Generating 5/30
Generating 6/30
Generating 7/30
Generating 8/30
Generating 9/30
Generating 10/30
Generating 11/30
Generating 12/30
Generating 13/30
Generating 14/30
Generating 15/30
Generating 16/30
Generating 17/30
Generating 18/30
Generating 19/30
Generating 20/30
Generating 21/30
Generating 22/30
Generating 23/30
Generating 24/30
Generating 25/30
Generating 26/30
Generating 27/30
Generating 28/30
Generating 29/30
Generating 30/30
Done!


## Link response from standardized prompt to wandb (excluding theme)

#### Define supported metrics

| Function Name             | Purpose                                                                                      | Output           | Notes                                                                 |
|--------------------------|----------------------------------------------------------------------------------------------|------------------|-----------------------------------------------------------------------|
| `brand_hit_rate(df)`     | Measures how often the target brand is mentioned in the responses                            | `float` (0–1)    | Only counts cases where the response includes the `TARGET_BRAND`     |
| `hallucination_rate(df)` | Measures the rate of "hallucination" — when the brand is mentioned but not in the prompt     | `float` (0–1)    | Only considers responses mentioning `TARGET_BRAND` when the prompt does **not** mention it |
| `top1_rate(df)`          | Measures whether the target brand is ranked number 1 (based on regex ranking list)           | `float` (0–1)    | Uses regex to extract the top-ranked brand from the response and compare with `TARGET_BRAND` |
| `response_diversity(df)` | Measures response diversity (BLEU-1 diversity score)                                          | `float` (0–1)    | Higher values → more diverse (less repetition)                       |
| `log_custom_metrics(run_id, df)` | Logs all metrics to an existing W&B run (resumes by run ID)                         | No return value (prints to stdout) | Uses `wandb.Api()` + `.summary[...] = ...` then `.update()`         |

In [None]:
# Brand Hit Rate
def brand_hit_rate(df):
    hits = df["response"].str.lower().str.contains(TARGET_BRAND).sum()
    return hits / len(df)

# Hallucination Rate
def hallucination_rate(df):
    def _h(row):
        p_has = TARGET_BRAND in row["prompt"].lower()
        a_has = TARGET_BRAND in row["response"].lower()
        return a_has and not p_has
    return df.apply(_h, axis=1).mean()

# Top-1 Rate (Regex Logic)
def top1_rate(df):
    pat = r"1[\.\)\- ]+([a-zA-Z& ]+)"
    first = df["response"].str.extract(pat, expand=False).str.lower()
    return (first == TARGET_BRAND).mean()

# Response Diversity via BLEU-1
def response_diversity(df):
    refs = df["response"].tolist()
    scores = []
    for i, r in enumerate(refs):
        others = refs[:i] + refs[i+1:]
        scores.append(sentence_bleu([o.split() for o in others], r.split(), weights=(1, 0, 0, 0)))
    return 1 - np.mean(scores)

# Log to WandB with fixed run_id
def log_custom_metrics(run_id: str, df_eval: pd.DataFrame):
    """Log evaluation metrics to an existing W&B run (for panel display)"""
    import wandb

    # Calculate evaluation metrics
    metrics = {
        "eval/brand_hit_rate": brand_hit_rate(df_eval),
        "eval/hallucination_rate": hallucination_rate(df_eval),
        "eval/top1_accuracy": top1_rate(df_eval),
        "eval/response_diversity": response_diversity(df_eval),
    }

    # Start a NEW run (does NOT overwrite old ones)
    wandb.init(
        project=wandb_project,  # W&B project name
        name=wandb_run_name,                      # reproducible name sync with finetuning
        resume="must",
        reinit=True                         # allow multiple runs in same notebook
    )

    # Log your metrics to the new run
    wandb.log(metrics)

    # Finalize the run
    wandb.finish()

    # Confirm in console
    print(f"Logged to new W&B run: {run_name}")
    print("Metrics:", metrics)

## Load inference results (saved)

In [None]:
# Model weight path
full_run_tag = f"{RUN_TAG}-v{run_id}"
model_save_path = f"/content/drive/MyDrive/finetuned_models_shared/hm/{full_run_tag}"

In [None]:
results_path = f"{base_dir}/prompt_shared/after_finetuning_analysis/hm/olmo-results-v{run_id}.csv"

In [None]:
# define run anme to link with wandb
run_name = f"olmo-hm-v{run_id}"  # run_id

In [None]:
# Load results first
df_results = pd.read_csv(results_path)

# log new run
log_custom_metrics(wandb_run_name, df_results)

0,1
eval/brand_hit_rate,▁
eval/hallucination_rate,▁
eval/response_diversity,▁
eval/top1_accuracy,▁

0,1
eval/brand_hit_rate,0.93333
eval/hallucination_rate,0.1
eval/response_diversity,0.25646
eval/top1_accuracy,0.03333
total_flos,3.2763736674533376e+16
train/epoch,7.72
train/global_step,100.0
train/grad_norm,0.77075
train/learning_rate,0.0
train/loss,1.0445


Successfully logged metrics to: https://wandb.ai/thanchanok-ucl/olmo2-hm-config-compare/runs/cwkwn63l
Logged metrics: {'eval/brand_hit_rate': np.float64(0.9333333333333333), 'eval/hallucination_rate': np.float64(0.1), 'eval/top1_accuracy': np.float64(0.03333333333333333), 'eval/response_diversity': np.float64(0.25645546782366657)}


## 8.2 Inference the Model with Thematic Prompt

Contains to monitor bias only 6 thematic prompt test

No need to load the model again to avoid out of memory error

| # | Line (second run)                                           | What happens                                                                                    | Memory impact                                                             |
| - | ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| 1 | `tokenizer = AutoTokenizer.from_pretrained(...)`            | Downloads / re-creates the tokenizer object.                                                    | Negligible VRAM, but redundant.                                           |
| 2 | `base = AutoModelForCausalLM.from_pretrained(...)`          | Loads a full copy of the **OLMo-7B** model (\~4 GB in 4-bit) onto the GPU.                      | Major VRAM usage — second copy of the base model.                         |
| 3 | `model = PeftModel.from_pretrained(base, adapter_dir, ...)` | Wraps the base model with a LoRA adapter; temporary weight copies are made during construction. | Further increases VRAM during the wrap-up phase.                          |
| 4 | *(all subsequent `model.generate(...)` calls)*              | Generates text with both model copies resident in VRAM.                                         | Immediately exceeds the 15 GB Colab Pro limit and triggers the OOM error. |


In [None]:
df = pd.read_csv(f"{base_dir}/prompt_shared/baseline_analysis/thematic_prompt/hm/theme_prompt_hm_final.csv", encoding="cp1252")
prompts = df["prompt_text"].dropna().tolist()
results_path = f"{base_dir}/prompt_shared/after_finetuning_analysis/hm/olmo-results-v{run_id}-themes.csv"

In [None]:
prompts

['Which clothing brands in the UK offer the most eco-friendly materials without sacrificing style?',
 'What are the top UK-based affordable fashion brands with recycled fabric collections?',
 'Recommend fashion brands with strong ethical manufacturing policies for everyday wear.',
 'Which fashion brands are leaders in sustainability for casual apparel?',
 'Which fashion retail brand offers the deepest discounts on basics during major sales events?',
 'Which UK fashion retailers have the best Black Friday deals on sustainable wardrobe essentials?']

In [None]:
results_path

'/content/drive/MyDrive/prompt_shared/after_finetuning_analysis/hm/olmo-results-v8-themes.csv'

In [None]:
gc.collect()

1341

## Model Load Check to avoid OOM

- `assert 'model' in globals() and 'tokenizer' in globals()` does **not** raise an error  
  → This confirms that both `model` and `tokenizer` already exist in memory.

- `device = next(model.parameters()).device` does **not** raise an exception  
  → This confirms that the model is still alive and accessible on a valid device (e.g., GPU or CPU).

### Conclusion

The model is already loaded and usable.


## Load model and tokenizer

in case session close and we want to continue trial with other prompt question

## Test loaded model

In [None]:
# Make sure both `model` and `tokenizer` already exist in the notebook’s global scope.
# If either is missing, raise an AssertionError with a clear message.
assert 'model' in globals() and 'tokenizer' in globals(), (
    "You need to run the model-loading cell at least once before executing this cell."
)

# Identify the device (e.g. 'cuda:0', 'cpu') where the model’s parameters are located.
device = next(model.parameters()).device

# Create an empty list that will hold inference outputs, evaluation scores, or any
# other results you generate later in the workflow.
results = []

In [None]:
for i, prompt in enumerate(prompts, start=1):
    print(f"Generating {i}/{len(prompts)}")

    # Tokenize prompt and move to device:
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate response with configured settings:
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,  # Try to not decrease this cause, the decrease will save the memory but it will reduce the content generation
        temperature=0.7,      # Controls randomness: 0.0 = deterministic, 1.0 = highly random
        do_sample=True,       # Enables probabilistic sampling
        use_cache=True        # Uses KV-cache for speed
    )

    # Decode tokens to human-readable text:
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    results.append({"prompt": prompt, "response": text})

# Save results to CSV
pd.DataFrame(results).to_csv(results_path, index=False)
print("Done!")

Generating 1/6
Generating 2/6
Generating 3/6
Generating 4/6
Generating 5/6
Generating 6/6
Done!


## Clear all loading without restarting the run time

In [None]:
del model
del tokenizer
import gc
gc.collect()

import torch
torch.cuda.empty_cache()

-- End of the Notebook --