# RAG2: Fine-Tuning `msmarco-distilbert-cos-v5` with MultipleNegativesRankingLoss

<img src="https://raw.githubusercontent.com/CNUClasses/CPSC471/master/content/lectures/week13/stage1.png" alt="standard" style="max-height:400px;  margin:10px 0; vertical-align:middle;">

This notebook demonstrates how to **fine-tune a Hugging Face SentenceTransformer embedding model**
for an **asymmetric search / RAG retrieval** task using the **SentenceTransformers Trainer API**.

We will:

1. Load a CSV file `rag_train_dataset.csv` containing training pairs:
   - `question` — a user query
   - `chunk` — the correct / relevant text passage for that question
2. Wrap the CSV as a Hugging Face `datasets.Dataset` object.
3. Initialize the base model: `sentence-transformers/msmarco-distilbert-cos-v5`
4. Define a **MultipleNegativesRankingLoss (MNRL)** objective, which uses **in-batch negatives**.
5. Configure **SentenceTransformerTrainingArguments**.
6. Train the model using **SentenceTransformerTrainer**.
7. Save the fine-tuned model locally for later use in a RAG pipeline.

> Note: This notebook assumes you have `rag_train_dataset.csv` in the working directory
> with columns `question` and `chunk`. Adjust column names if your data differs.


In [41]:
# ============================================================================
# (Optional) Install dependencies
# ============================================================================
# Uncomment and run this cell if you do NOT already have these libraries
# installed in your environment.
#
# - sentence-transformers >= 3.x for SentenceTransformerTrainer API
# - datasets for easy data handling
# - transformers / torch are pulled in as dependencies
# ----------------------------------------------------------------------------

# !pip install -U sentence-transformers datasets

In [42]:
# ============================================================================
# gets a free GPU, useful if there are multiple available
# ============================================================================
import utils_gpu as ug
ug.get_free_gpu()

Using GPU(s): 2


'2'

In [43]:
# ============================================================================
# Imports and Basic Configuration
# ============================================================================
# This section imports all the Python packages we need and sets up some
# basic configuration such as the model name, CSV path, and random seeds.
# ----------------------------------------------------------------------------
import utils as ut

import torch
import random
from typing import List, Tuple

import torch
from datasets import load_dataset

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# Set a random seed for reproducibility (affects sampling / shuffling)
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Path to the local CSV file containing training data
# The CSV should contain at least these columns:
#   - 'question'
#   - 'chunk'
CSV_PATH = "rag_train_dataset.csv"

# Base embedding model for asymmetric search
BASE_MODEL_NAME = "sentence-transformers/msmarco-distilbert-cos-v5"

# Directory where the fine-tuned model will be saved
OUTPUT_DIR = f"./models/{BASE_MODEL_NAME}"

# Basic training hyperparameters (you can adjust these)
NUM_EPOCHS = 5
TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 32
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1


In [44]:
# ============================================================================
# Load the Training Data from CSV
# ============================================================================
# We now load the CSV file `rag_train_dataset.csv` using Hugging Face Datasets.
# This makes it easy to work with the data in the SentenceTransformers Trainer.
#
# Expected columns:
#   - 'question' : the user query text
#   - 'chunk'    : the corresponding relevant passage
# ----------------------------------------------------------------------------

# Load dataset from CSV into a DatasetDict with a 'train' split
dataset = load_dataset(
    "csv",
    data_files={"train": CSV_PATH},
)

# Extract the 'train' split
pairs_dataset = dataset["train"]

# Inspect the dataset structure
print(pairs_dataset)
print("Column names:", pairs_dataset.column_names)

# Show a few examples to ensure the data is loaded correctly
for i in range(3):
    print(f"Example {i}:")
    print("  question:", pairs_dataset[i]["question"])
    print("  chunk   :", pairs_dataset[i]["chunk"])
    print("------")


Dataset({
    features: ['chunk_id', 'question', 'answer', 'source', 'metadata', 'chunk'],
    num_rows: 607
})
Column names: ['chunk_id', 'question', 'answer', 'source', 'metadata', 'chunk']
Example 0:
  question: What is Christopher Newport University known for?
  chunk   : 2 
      
  
   
 
  
  
 
 
 
 
 
 
 
 
 
 
  
 
    
 
  
 
 
   
 
 
 
  
      
 
    
   
 
 
  
   
   
Christopher Newport University 2025-2026 
Welcome to Christopher Newport University 
At Christopher Newport, your college journey will be 
anything but ordinary. This is a place where you will be 
challenged, celebrated and supported from day one. With 
small classes, caring professors, and a stunning, modern, 
safe campus that feels like home, you can dream big and 
make friends for life. 
Be Part of Something Bold 
Our students come from across Virginia, the country, 
and the world, but they all share one thing: a drive to make a 
difference. CNU students form close bonds with professors, 
dive into grou

In [45]:
# ============================================================================
# Prepare Train / Eval Splits and Select Required Columns
# ============================================================================
# MultipleNegativesRankingLoss expects **pairs** of texts per example.
# Here we treat:
#   - 'question' as the anchor
#   - 'chunk'   as the positive
#
# We select only these two columns and create a train/validation split
# (e.g., 90% train, 10% validation).
# ----------------------------------------------------------------------------

# Keep only the 'question' and 'chunk' columns in that order.
# The SentenceTransformers Trainer will interpret this as (anchor, positive).
pairs_dataset = pairs_dataset.select_columns(["question", "chunk"])

# Create a train/validation split from the single 'train' split
split = pairs_dataset.train_test_split(
    test_size=0.1,
    seed=RANDOM_SEED
)
train_dataset = split["train"]
eval_dataset = split["test"]

print("Train size:", len(train_dataset))
print("Eval size :", len(eval_dataset))


Train size: 546
Eval size : 61


In [46]:
torch.cuda.current_device()

0

In [47]:
# ============================================================================
# Initialize the SentenceTransformer Model and MNRL Loss
# ============================================================================
# We now load the base SentenceTransformer model and wrap it with optional
# model card metadata. Then we define the MultipleNegativesRankingLoss, which
# will be used by the trainer.
# ----------------------------------------------------------------------------
#get evaluator
dev_evaluator = ut.get_triplet_evaluator(eval_dataset)

# Initialize the SentenceTransformer model from the base checkpoint
model = SentenceTransformer(
    BASE_MODEL_NAME,
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="msmarco-distilbert-cos-v5 fine-tuned on RAG2 question-chunk pairs",
    ))

# Define the MultipleNegativesRankingLoss.
# This expects each training example to consist of two texts:
#   (anchor, positive)
# and uses other positives in the same batch as implicit negatives.
loss = MultipleNegativesRankingLoss(model)

# Evaluate the base model before training using the TripletEvaluator
print("Evaluating base model before fine-tuning...")
base_score = dev_evaluator(model)
print(base_score)
# print(f"Baseline accuracy (TripletEvaluator): {base_score:.4f}")


Evaluating base model before fine-tuning...
{'rag2-mnrl-dev_cosine_accuracy': 0.9508196711540222}


In [48]:
# ============================================================================
# Define SentenceTransformer Training Arguments
# ============================================================================
# SentenceTransformerTrainingArguments configures the training process:
#   - number of epochs
#   - batch sizes
#   - learning rate, warmup
#   - evaluation and saving strategy
#
# We also specify NO_DUPLICATES batch sampler, which is beneficial for
# MultipleNegativesRankingLoss (we want each text to appear at most once
# per batch so it's a valid negative for others).
# ----------------------------------------------------------------------------

args = SentenceTransformerTrainingArguments(
    # Required: where to save training outputs and checkpoints
    output_dir=OUTPUT_DIR,

    # Core training hyperparameters
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_ratio=WARMUP_RATIO,

    # Mixed precision settings (set fp16=False if your GPU does not support it)
    fp16=True,
    bf16=False,

    # Batch sampler: NO_DUPLICATES is recommended for MNRL
    batch_sampler=BatchSamplers.NO_DUPLICATES,

    # Evaluation and checkpointing
    eval_strategy="steps",
    eval_steps=10,
    save_strategy="steps",
    save_steps=10,
    save_total_limit=2,# keep only the 2 most recent checkpoints
    logging_steps=10,

    # Optional: a descriptive run name (helpful if using W&B or similar tools)
    run_name="msmarco-distilbert-rag2-mnrl",
)


In [49]:
# ============================================================================
# Create the SentenceTransformerTrainer and Start Training
# ============================================================================
# The SentenceTransformerTrainer encapsulates the training loop.
# We pass:
#   - model: SentenceTransformer instance
#   - args:  SentenceTransformerTrainingArguments
#   - train_dataset: HF Dataset with (question, chunk) pairs
#   - eval_dataset:  HF Dataset for evaluation
#   - loss: MultipleNegativesRankingLoss
#   - evaluator: TripletEvaluator for dev set performance
# ----------------------------------------------------------------------------

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_evaluator,
)

# Start the training process
trainer.train()


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss,Validation Loss,Rag2-mnrl-dev Cosine Accuracy
10,0.8094,0.328011,0.983607
20,0.304,0.248493,0.983607
30,0.1443,0.24384,0.983607
40,0.1187,0.24128,0.983607


TrainOutput(global_step=45, training_loss=0.31487596962187026, metrics={'train_runtime': 11.1855, 'train_samples_per_second': 244.065, 'train_steps_per_second': 4.023, 'total_flos': 0.0, 'train_loss': 0.31487596962187026, 'epoch': 5.0})

In [50]:
# ============================================================================
# Save the Fine-Tuned Model Locally
# ============================================================================
# After training completes, we save the final model to disk.
# This folder can later be loaded with SentenceTransformer(...)
# or used as a Hugging Face model checkpoint path.
# ----------------------------------------------------------------------------
import os
# final_model_path = os.path.join(OUTPUT_DIR, "final")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the fine-tuned model
model.save_pretrained(OUTPUT_DIR)

print(f"Fine-tuned model saved to: {OUTPUT_DIR}")


Fine-tuned model saved to: ./models/sentence-transformers/msmarco-distilbert-cos-v5
