# RAG5: Fine-Tuning `msmarco-distilbert-cos-v5` with Triplet loss
<img src="https://raw.githubusercontent.com/CNUClasses/CPSC471/master/content/lectures/week13/stage4.png" alt="standard" style="max-height:300px;  margin:10px 0; vertical-align:middle;">
This notebook demonstrates how to **fine-tune a Hugging Face SentenceTransformer embedding model**
for an **asymmetric search / RAG retrieval** task using the **SentenceTransformers Trainer API**.

We will:

1. Load a CSV file `rag_train_with_hard_negatives_triplets.csv` containing training pairs:
   - `question` — a user query
   - `chunk` — the correct / relevant text passage for that question
   - `hard_negative_chunk` a chunk thats close to correct
2. Wrap the CSV as a Hugging Face `datasets.Dataset` object.
3. Initialize the base model: `sentence-transformers/msmarco-distilbert-cos-v5`
4. Define a **Triplet Loss** objective.
5. Configure **SentenceTransformerTrainingArguments**.
6. Train the model using **SentenceTransformerTrainer**.
7. Save the fine-tuned model locally for later use in a RAG pipeline.

> Note: This notebook assumes you have `rag_train_with_hard_negatives_triplets.csv` in the working directory
> with columns `question` and `chunk` and `hard_negative_chunk`. Adjust column names if your data differs.


In [60]:
# ============================================================================
# (Optional) Install dependencies
# ============================================================================
# Uncomment and run this cell if you do NOT already have these libraries
# installed in your environment.
#
# - sentence-transformers >= 3.x for SentenceTransformerTrainer API
# - datasets for easy data handling
# - transformers / torch are pulled in as dependencies
# ----------------------------------------------------------------------------

# !pip install -U sentence-transformers datasets

In [61]:
import utils_gpu as ug
ug.get_free_gpu()

Using GPU(s): 1


'1'

In [62]:
# ============================================================================
# Imports and Basic Configuration
# ============================================================================
# This section imports all the Python packages we need and sets up some
# basic configuration such as the model name, CSV path, and random seeds.
# ----------------------------------------------------------------------------

import torch
# torch.cuda.device_count() 
# torch.cuda.is_available()
# torch.set_default_device('cuda:2')

# Force PyTorch to use only CUDA device 2
# CUDA_DEVICE = 2
# torch.cuda.set_device(CUDA_DEVICE)
# assert torch.cuda.current_device() == CUDA_DEVICE
# print(f"Using CUDA device {CUDA_DEVICE}:", torch.cuda.get_device_name(CUDA_DEVICE))
# # Force SentenceTransformers to use the specified CUDA device
# os.environ["CUDA_VISIBLE_DEVICES"] = f"{CUDA_DEVICE}"
import random
from typing import List, Tuple
import pandas as pd

import torch
from datasets import load_dataset

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# Set a random seed for reproducibility (affects sampling / shuffling)
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Path to the local CSV file containing training data
# The CSV should contain at least these columns:
#   - 'question'
#   - 'chunk'
#   - 'hard_negative_chunk'
CSV_PATH = "rag_train_with_hard_negatives_triplets.csv"

# Fine-tuned embedding model for asymmetric search
FINETUNED_MODEL_PATH  = "./models/sentence-transformers/msmarco-distilbert-cos-v5"
SAVE_PATH="models/sentence-transformers/msmarco-distilbert-cos-v5-triplet"
# Directory where the fine-tuned model will be saved
OUTPUT_DIR = f"models/{FINETUNED_MODEL_PATH}-triplet"

# Basic training hyperparameters (you can adjust these)
NUM_EPOCHS = 4
TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 32
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1

# Device configuration: use GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)



Using device: cuda


In [63]:
# ============================================================================
# Load the Training Data from CSV
# ============================================================================
# We now load the CSV file `rag_train_dataset.csv` using Hugging Face Datasets.
# This makes it easy to work with the data in the SentenceTransformers Trainer.
#
# Expected columns:
#   - 'question' : the user query text
#   - 'chunk'    : the corresponding relevant passage
# ----------------------------------------------------------------------------

# Load dataset from CSV into a DatasetDict with a 'train' split
dataset = load_dataset("csv", data_files=CSV_PATH)

# Inspect the dataset structure
# print(dataset)
# print("Column names:", dataset.columns)



In [64]:
# ============================================================================
# Build a Simple TripletEvaluator for Validation
# ============================================================================
# For evaluation, we use a TripletEvaluator from SentenceTransformers.
# It expects three parallel lists:
#   - anchors   (e.g., questions)
#   - positives (e.g., their matching chunks)
#   - negatives (e.g., randomly sampled non-matching chunks)
#
# Here we construct **synthetic triplets** from the eval split by:
#   - using each (question, chunk) pair as (anchor, positive)
#   - sampling a random different chunk as the negative
#
# This gives us a rough "does anchor-positive score higher than anchor-negative"
# style retrieval accuracy metric for monitoring training.
# ----------------------------------------------------------------------------

# def get_triplets(ds, max_samples: int = 500) -> Tuple[List[str], List[str], List[str]]:
#     """Create triplets (anchor, positive, negative) from a triplets dataset.

#     Args:
#         ds: A Hugging Face Dataset with columns ['question', 'chunk', 'hard_negative_chunk'].
#         max_samples: Maximum number of triplets to create (for efficiency).

#     Returns:
#         anchors: List of anchor strings (questions).
#         positives: List of positive strings (true chunks).
#         negatives: List of negative strings (hard negative chunks).
#     """
#     # Ensure we do not sample more than the dataset size
#     max_samples = min(max_samples, len(ds))

#     anchors: List[str] = []
#     positives: List[str] = []
#     negatives: List[str] = []

#     # Pre-collect all chunks for negative sampling
#     all_chunks: List[str] = ds["chunk"]

#     for i in range(max_samples):
#         anchor = ds[i]["question"]
#         positive = ds[i]["chunk"]
#         negative = ds[i]["hard_negative_chunk"]

#         anchors.append(anchor)
#         positives.append(positive)
#         negatives.append(negative)

#     return anchors, positives, negatives

# # Build triplets from the eval split for use in the TripletEvaluator
# eval_anchors, eval_positives, eval_negatives = get_triplets(
#     eval_dataset,
#     max_samples=500
# )

# # Create the TripletEvaluator
# dev_evaluator = TripletEvaluator(
#     anchors=eval_anchors,
#     positives=eval_positives,
#     negatives=eval_negatives,
#     name="rag2-mnrl-dev",
# )

# # Quick baseline evaluation before fine-tuning
# print("Baseline (unfined-tuned) model performance will be computed after model initialization.")


In [65]:
# ============================================================================
# Load the Fine-Tuned Embedding Model
# ============================================================================
# We now load the fine-tuned SentenceTransformer model from the specified
# local directory. 
# ----------------------------------------------------------------------------
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
# Load the model from the local path
model = SentenceTransformer(FINETUNED_MODEL_PATH)
model.to(device)

# Define the MultipleNegativesRankingLoss.
# This expects each training example to consist of two texts:
#   (anchor, positive)
# and uses other positives in the same batch as implicit negatives.
# loss = losses.TripletLoss(model=model, distance_metric=losses.TripletDistanceMetric.COSINE, triplet_margin=0.5)
loss = losses.MultipleNegativesRankingLoss(model=model)


# Evaluate the base model before training using the TripletEvaluator
# print("Evaluating base model before fine-tuning...")
# base_score = dev_evaluator(model)
# print(base_score)
# print(f"Baseline accuracy (TripletEvaluator): {base_score:.4f}")


In [66]:
# ============================================================================
# Define SentenceTransformer Training Arguments
# ============================================================================
# SentenceTransformerTrainingArguments configures the training process:
#   - number of epochs
#   - batch sizes
#   - learning rate, warmup
#   - evaluation and saving strategy
#
# We also specify NO_DUPLICATES batch sampler, which is beneficial for
# MultipleNegativesRankingLoss (we want each text to appear at most once
# per batch so it's a valid negative for others).
# ----------------------------------------------------------------------------

args = SentenceTransformerTrainingArguments(
    # Required: where to save training outputs and checkpoints
    output_dir=OUTPUT_DIR,

    # Core training hyperparameters
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_ratio=WARMUP_RATIO,

    # Mixed precision settings (set fp16=False if your GPU does not support it)
    fp16=True,
    bf16=False,

    # Batch sampler: NO_DUPLICATES is recommended for MNRL
    batch_sampler=BatchSamplers.NO_DUPLICATES,

    # Evaluation and checkpointing
    # eval_strategy="steps",    # evaluate every `eval_steps`
    # eval_steps=20,
    save_strategy="steps",    # save checkpoint every `save_steps`
    save_steps=20,
    save_total_limit=2,       # keep only the 2 most recent checkpoints
    logging_steps=100,

    # Optional: a descriptive run name (helpful if using W&B or similar tools)
    run_name="msmarco-distilbert-rag2-triplet",
  
)


In [67]:
# ============================================================================
# Create the SentenceTransformerTrainer and Start Training
# ============================================================================
# The SentenceTransformerTrainer encapsulates the training loop.
# We pass:
#   - model: SentenceTransformer instance
#   - args:  SentenceTransformerTrainingArguments
#   - train_dataset: HF Dataset with (question, chunk, triplet) pairs
#   - eval_dataset:  HF Dataset for evaluation
#   - loss: MultipleNegativesRankingLoss
#   - evaluator: TripletEvaluator for dev set performance
# ----------------------------------------------------------------------------

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    # eval_dataset=eval_dataset,
    loss=loss,
    # evaluator=dev_evaluator,
)

# Start the training process
trainer.train()


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.1416


TrainOutput(global_step=192, training_loss=0.07845571512977283, metrics={'train_runtime': 51.0896, 'train_samples_per_second': 237.622, 'train_steps_per_second': 3.758, 'total_flos': 0.0, 'train_loss': 0.07845571512977283, 'epoch': 4.0})

In [68]:
# ============================================================================
# Save the Fine-Tuned Model Locally
# ============================================================================
# After training completes, we save the final model to disk.
# This folder can later be loaded with SentenceTransformer(...)
# or used as a Hugging Face model checkpoint path.
# ----------------------------------------------------------------------------
import os
# final_model_path = os.path.join(OUTPUT_DIR, "final")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the fine-tuned model
model.save_pretrained(OUTPUT_DIR)

print(f"Fine-tuned model saved to: {OUTPUT_DIR}")


Fine-tuned model saved to: models/./models/sentence-transformers/msmarco-distilbert-cos-v5-triplet
