# MLB Project - Fine-Tuning Embedding Models for RAG

## Project Overview

Welcome to the Embedding Fine-Tuning Project! In this project, you'll learn how to **fine-tune an embedding model** to improve retrieval performance in RAG (Retrieval-Augmented Generation) applications.

### What are Embedding Models?
Embedding models convert text into dense vector representations (embeddings) that capture semantic meaning. They're crucial for:
- **Semantic search**: Finding similar documents based on meaning, not just keywords
- **RAG systems**: Retrieving relevant context to answer questions
- **Clustering**: Grouping similar documents together

### Why Fine-Tune Embeddings?
Pre-trained embedding models are trained on general knowledge, which limits their effectiveness for domain-specific applications. Fine-tuning on your specific data can significantly boost retrieval performance!

### What is Matryoshka Representation Learning?
Matryoshka embeddings can be truncated to various dimensions (768 → 512 → 256 → 128 → 64) without significant performance loss. This allows:
- **3x less storage** while keeping ~99% performance
- **Faster search** with smaller dimensions
- **Flexibility** to choose speed vs. accuracy trade-offs

### What You'll Learn
- How to prepare datasets for embedding fine-tuning
- How to evaluate embedding models using information retrieval metrics
- How to implement Matryoshka Representation Learning
- How to fine-tune embedding models with Sentence Transformers
- How to compare performance across different embedding dimensions

### Project Structure
1. Setup and imports
2. Load and prepare the dataset
3. Create baseline evaluation
4. Define loss function with Matryoshka representation
5. Fine-tune the embedding model
6. Evaluate and compare results

### Dataset: Financial Q&A Pairs
We'll use a dataset of financial questions and answers from NVIDIA's SEC filings to create a domain-specific embedding model.

---

## Step 1: Setup and Imports

First, let's install the required libraries and import them.

**Note**: This project requires a GPU for efficient training. Make sure you're running on a GPU-enabled environment (like Colab with GPU runtime).

In [5]:
print("Fixing package dependencies...")
import subprocess
import sys

try:
    # Fix fsspec version to match gcsfs requirement
    subprocess.check_call([
        sys.executable, "-m", "pip",
        "install", "--upgrade", "fsspec==2025.3.0", "-q"
    ])
    print("✓ Dependencies fixed!")
except Exception as e:
    print(f"Note: Could not auto-fix dependencies: {e}")
    print("This is fine - the notebook will still work!")

Fixing package dependencies...
✓ Dependencies fixed!


In [6]:
# Install required packages (run this cell first!)
!pip install "torch" tensorboard -q
!pip install --upgrade "sentence-transformers>=3" "datasets==2.19.1" "transformers==4.41.2" -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2024.3.1 which is incompatible.[0m[31m
[0m

In [2]:
# Import all necessary libraries
import os
import torch
import warnings
from datasets import load_dataset, concatenate_datasets
from sentence_transformers import SentenceTransformer, SentenceTransformerModelCardData, SentenceTransformerTrainer
from sentence_transformers.evaluation import InformationRetrievalEvaluator, SequentialEvaluator
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
from sentence_transformers.util import cos_sim
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

# Suppress unnecessary warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
cannot import name 'EncoderDecoderCache' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)

## Step 2: Configuration

Let's set up our model configuration and Matryoshka dimensions.

In [None]:
# Model configuration
MODEL_ID = "BAAI/bge-base-en-v1.5"  # Strong base embedding model (109M params, 768 dim)
OUTPUT_DIR = "bge-base-financial-matryoshka"  # Output directory for fine-tuned model

# Matryoshka dimensions (from large to small - this order is important!)
MATRYOSHKA_DIMENSIONS = [768, 512, 256, 128, 64]

# Training configuration
NUM_EPOCHS = 4
BATCH_SIZE = 32
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = 32 * 16 = 512
LEARNING_RATE = 2e-5

print(f"Base Model: {MODEL_ID}")
print(f"Output Directory: {OUTPUT_DIR}")
print(f"Matryoshka Dimensions: {MATRYOSHKA_DIMENSIONS}")
print(f"Training Epochs: {NUM_EPOCHS}")
print(f"Batch Size: {BATCH_SIZE}")
print(f"Effective Batch Size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")

## Step 3: Load and Prepare the Dataset

We'll use a financial Q&A dataset with question-context pairs from NVIDIA's SEC filings.

### 3.1: Load the Dataset

**TODO**: Load the financial embedding dataset and prepare it for training.

**Dataset format**: Each sample has a `question` and `context` field.

In [None]:
print("Loading dataset...")

# TODO: Load the dataset from Hugging Face
dataset = None  # Replace None with your code

print(f"Dataset loaded with {len(dataset)} samples")
print(f"\nDataset structure: {dataset}")
print(f"\nDataset features: {dataset.features}")

### 3.2: Explore Sample Data

Let's look at some examples to understand the question-context pairs.

In [None]:
# Display 3 sample question-context pairs
print("Sample question-context pairs:\n")
for i in range(3):
    sample = dataset[i]
    print(f"Example {i+1}:")
    print(f"Question: {sample['question']}")
    print(f"Context: {sample['context'][:200]}...")  # Show first 200 chars
    print("-" * 80)
    print()

### 3.3: Prepare Dataset for Sentence Transformers

Sentence Transformers expects specific column names: `anchor` (query) and `positive` (relevant document).

**TODO**: Rename columns and add IDs to the dataset.

In [None]:
# TODO: Rename 'question' column to 'anchor'
dataset = None  # Replace None with your code

# TODO: Rename 'context' column to 'positive'
dataset = None  # Replace None with your code

# TODO: Add an 'id' column with sequential numbers
dataset = None  # Replace None with your code

print("Dataset prepared!")
print(f"\nUpdated columns: {dataset.column_names}")

### 3.4: Split the Dataset

**TODO**: Split the dataset into train (90%) and test (10%) sets.

In [None]:
# TODO: Split dataset into train (90%) and test (10%)
dataset = None  # Replace None with your code

print("Dataset split complete!")
print(f"\nDataset splits:")
print(f"Training: {len(dataset['train'])} samples")
print(f"Test: {len(dataset['test'])} samples")

# Save datasets to disk for later use
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")
print("\nDatasets saved to disk!")

## Step 4: Create Baseline Evaluation

Before fine-tuning, let's evaluate the pre-trained model to establish a baseline.

### Understanding Evaluation Metrics

We'll use **NDCG@10** (Normalized Discounted Cumulative Gain at 10):
- Measures ranking quality (0 to 1, higher is better)
- Takes into account the position of relevant documents
- @10 means we look at the top 10 retrieved documents

### 4.1: Load the Pre-trained Model

**TODO**: Load the base embedding model for evaluation.

In [None]:
print("Loading pre-trained model...")

# TODO: Load the SentenceTransformer model
model = None  # Replace None with your code

print(f"Model loaded: {MODEL_ID}")
print(f"   Embedding dimension: {model.get_sentence_embedding_dimension()}")

### 4.2: Prepare Evaluation Data

For evaluation, we need:
- **Corpus**: All documents that could be retrieved (train + test)
- **Queries**: Questions from the test set
- **Relevant docs**: Mapping of which document is relevant for each query

**TODO**: Load the datasets and create the evaluation dictionaries.

In [None]:
# TODO: Load test dataset from JSON file
test_dataset = None  # Replace None with your code

# TODO: Load train dataset from JSON file
train_dataset = None  # Replace None with your code

# TODO: Concatenate train and test to create full corpus
corpus_dataset = None  # Replace None with your code

# TODO: Create corpus dictionary (id -> document text)
corpus = None  # Replace None with your code

# TODO: Create queries dictionary (id -> question text)
queries = None  # Replace None with your code

# Create relevant docs mapping (each query's relevant doc has the same ID)
relevant_docs = {}
for q_id in queries:
    relevant_docs[q_id] = [q_id]  # The relevant doc ID matches the query ID

print("Evaluation data prepared!")
print(f"   Corpus size: {len(corpus)} documents")
print(f"   Queries: {len(queries)} questions")

### 4.3: Create Matryoshka Evaluators

We'll create an evaluator for each dimension to see how performance changes with embedding size.

**TODO**: Create evaluators for each Matryoshka dimension.

In [None]:
matryoshka_evaluators = []

# TODO: Create an InformationRetrievalEvaluator for each dimension
for dim in MATRYOSHKA_DIMENSIONS:
    # TODO: Create evaluator for this dimension
    ir_evaluator = None  # Replace None with your code
    matryoshka_evaluators.append(ir_evaluator)

# TODO: Create a sequential evaluator that runs all evaluators
evaluator = None  # Replace None with your code

print(f"Created evaluators for {len(MATRYOSHKA_DIMENSIONS)} dimensions!")

### 4.4: Run Baseline Evaluation

**TODO**: Evaluate the pre-trained model to establish our baseline.

This will take a few minutes as it computes embeddings for all queries and documents.

In [None]:
print("Evaluating baseline model...\n")
print("This may take a few minutes...\n")

# TODO: Run the evaluator on the model
results = None  # Replace None with your code

print("\nBaseline evaluation complete!\n")
print("Baseline NDCG@10 scores by dimension:\n")

# Print NDCG@10 scores for each dimension
baseline_scores = {}
for dim in MATRYOSHKA_DIMENSIONS:
    key = f"dim_{dim}_cosine_ndcg@10"
    score = results[key]
    baseline_scores[dim] = score
    print(f"   dim {dim}: {score:.4f}")

## Step 5: Define Loss Function with Matryoshka Representation

Now we'll set up our loss function for fine-tuning.

### Understanding the Loss Functions

1. **MultipleNegativesRankingLoss**: For positive pairs (question, context), treats other samples in the batch as negatives
2. **MatryoshkaLoss**: Wraps the loss to train multiple dimensions simultaneously

### 5.1: Reload Model for Training

**TODO**: Load the model with SDPA (Scaled Dot-Product Attention) for efficient training.

In [None]:
print("Loading model for fine-tuning...")

# TODO: Load model with SDPA attention and model card data
model = None  # Replace None with your code

print("Model loaded for training!")

### 5.2: Initialize Loss Function

**TODO**: Create the Matryoshka loss function.

In [None]:
# TODO: Create the inner loss function (MultipleNegativesRankingLoss)
inner_train_loss = None  # Replace None with your code

# TODO: Wrap with MatryoshkaLoss
train_loss = None  # Replace None with your code

print("Loss function configured!")
print(f"   Inner loss: {type(inner_train_loss).__name__}")
print(f"   Matryoshka dimensions: {MATRYOSHKA_DIMENSIONS}")

## Step 6: Set Up Training Arguments

**TODO**: Configure the training parameters.

In [None]:
# TODO: Define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=None,                    # TODO
    per_device_train_batch_size=None,         # TODO
    gradient_accumulation_steps=None,         # TODO
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    learning_rate=None,                       # TODO
    lr_scheduler_type="cosine",
    optim="adamw_torch_fused",
    tf32=True,
    bf16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # Important for MultipleNegativesRankingLoss
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimize for 128 dimensions
    report_to="none",
)

print("Training arguments configured!")
print(f"\nTraining configuration:")
print(f"   Epochs: {args.num_train_epochs}")
print(f"   Batch size: {args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {args.gradient_accumulation_steps}")
print(f"   Effective batch size: {args.per_device_train_batch_size * args.gradient_accumulation_steps}")
print(f"   Learning rate: {args.learning_rate}")

## Step 7: Create the Trainer

**TODO**: Create the SentenceTransformerTrainer.

In [None]:
# Reload training dataset
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

# TODO: Create the trainer
trainer = None  # Replace None with your code

print("Trainer created!")

## Step 8: Train the Model

Now let's fine-tune! This will take several minutes depending on your GPU.

**TODO**: Start the training process.

In [None]:
print("Starting fine-tuning...\n")
print("=" * 80)
print("This will train for 4 epochs and evaluate after each epoch.")
print("The model will be optimized for dimension 128.")
print("=" * 80 + "\n")

# TODO: Train the model
# Replace pass with your code
pass

print("\n" + "=" * 80)
print("Training complete!")
print("=" * 80)

## Step 9: Save the Fine-Tuned Model

**TODO**: Save the model for future use.

In [None]:
print("Saving fine-tuned model...")

# TODO: Save the model
# Replace pass with your code
pass

print(f"Model saved to {OUTPUT_DIR}!")

## Step 10: Evaluate Fine-Tuned Model

Let's evaluate our fine-tuned model and compare it to the baseline!

### 10.1: Load and Evaluate Fine-Tuned Model

**TODO**: Load the fine-tuned model and evaluate it.

In [None]:
print("Evaluating fine-tuned model...\n")
print("This may take a few minutes...\n")

# TODO: Load the fine-tuned model
fine_tuned_model = None  # Replace None with your code

# TODO: Evaluate the fine-tuned model
results = None  # Replace None with your code

print("\nFine-tuned model evaluation complete!\n")
print("Fine-tuned NDCG@10 scores by dimension:\n")

# Print NDCG@10 scores for each dimension
finetuned_scores = {}
for dim in MATRYOSHKA_DIMENSIONS:
    key = f"dim_{dim}_cosine_ndcg@10"
    score = results[key]
    finetuned_scores[dim] = score
    print(f"   dim {dim}: {score:.4f}")

### 10.2: Compare Results

Let's create a comparison table to see the improvement!

In [None]:
import pandas as pd

# Create comparison DataFrame
comparison_data = []
for dim in MATRYOSHKA_DIMENSIONS:
    baseline = baseline_scores[dim]
    finetuned = finetuned_scores[dim]
    improvement = ((finetuned - baseline) / baseline) * 100

    comparison_data.append({
        'Dimension': dim,
        'Baseline': f"{baseline:.4f}",
        'Fine-tuned': f"{finetuned:.4f}",
        'Improvement': f"{improvement:.2f}%"
    })

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "=" * 80)
print("PERFORMANCE COMPARISON")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)

### 10.3: Analyze Results

Let's analyze the key findings from our fine-tuning experiment.

In [None]:
print("\nKEY INSIGHTS:\n")

# Calculate storage reduction at 128 dimensions
storage_reduction = 768 / 128
performance_retention_128 = (finetuned_scores[128] / finetuned_scores[768]) * 100

# Best improvement dimension
best_improvement_dim = max(MATRYOSHKA_DIMENSIONS,
                           key=lambda d: (finetuned_scores[d] - baseline_scores[d]) / baseline_scores[d])
best_improvement = ((finetuned_scores[best_improvement_dim] - baseline_scores[best_improvement_dim]) /
                   baseline_scores[best_improvement_dim]) * 100

print(f"1. Average improvement across all dimensions: "
      f"{sum((finetuned_scores[d] - baseline_scores[d]) / baseline_scores[d] for d in MATRYOSHKA_DIMENSIONS) / len(MATRYOSHKA_DIMENSIONS) * 100:.2f}%")

print(f"\n2. Largest improvement: {best_improvement:.2f}% at dimension {best_improvement_dim}")

print(f"\n3. Dimension 128 achieves {performance_retention_128:.2f}% of full 768-dim performance")
print(f"   with {storage_reduction:.0f}x storage reduction!")

print(f"\n4. Fine-tuned 128-dim vs Baseline 768-dim:")
cross_comparison = ((finetuned_scores[128] - baseline_scores[768]) / baseline_scores[768]) * 100
print(f"   Fine-tuned 128-dim is {cross_comparison:+.2f}% vs baseline 768-dim")
print(f"   (Smaller, faster, AND better!)")

print(f"\n5. Fine-tuned 64-dim vs Baseline 768-dim:")
cross_comparison_64 = ((finetuned_scores[64] - baseline_scores[768]) / baseline_scores[768]) * 100
print(f"   Fine-tuned 64-dim is {cross_comparison_64:+.2f}% vs baseline 768-dim")
print(f"   (12x smaller!)")

## Step 11: Test with Sample Queries

Let's test our fine-tuned model with some sample financial questions!

In [None]:
# Sample queries to test
sample_queries = [
    "What was NVIDIA's total revenue?",
    "What are the main business segments?",
    "What are the key risk factors?"
]

print("Testing retrieval with sample queries:\n")
print("=" * 80)

for query in sample_queries:
    print(f"\nQuery: {query}\n")

    # Encode query with fine-tuned model (using 128 dimensions)
    query_embedding = fine_tuned_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding[:128]  # Truncate to 128 dimensions

    # Encode all corpus documents
    corpus_embeddings = fine_tuned_model.encode(list(corpus.values()), convert_to_tensor=True)
    corpus_embeddings = corpus_embeddings[:, :128]  # Truncate to 128 dimensions

    # Calculate similarity scores
    scores = cos_sim(query_embedding, corpus_embeddings)[0]

    # Get top result
    top_idx = scores.argmax().item()
    top_doc_id = list(corpus.keys())[top_idx]
    top_doc = corpus[top_doc_id]

    print(f"Top Retrieved Document (score: {scores[top_idx]:.4f}):")
    print(f"{top_doc[:300]}...")
    print("-" * 80)

## Congratulations!

You've successfully fine-tuned an embedding model with Matryoshka representation learning! Here's what you accomplished:

1. ✅ Loaded and prepared a domain-specific Q&A dataset
2. ✅ Established a baseline with pre-trained embeddings
3. ✅ Implemented Matryoshka Representation Learning
4. ✅ Fine-tuned an embedding model with MultipleNegativesRankingLoss
5. ✅ Evaluated performance across multiple embedding dimensions
6. ✅ Achieved significant improvements (typically 7-22% boost)
7. ✅ Demonstrated 6x storage reduction with maintained performance

### Key Takeaways

**Why This Matters:**
- Fine-tuning embedding models is crucial for domain-specific RAG applications
- Matryoshka embeddings provide flexibility in the speed/accuracy trade-off
- Even small datasets (6-7k samples) can yield significant improvements
- Smaller dimensions can outperform larger baseline dimensions after fine-tuning

### Next Steps

Want to go further? Try:
- Fine-tuning on your own domain-specific data
- Experimenting with different base models (e.g., bge-large, e5-base)
- Testing different Matryoshka dimension combinations
- Trying other loss functions (CosineSimilarityLoss, ContrastiveLoss)
- Generating synthetic training data using LLMs
- Integrating the fine-tuned model into a full RAG pipeline
- Benchmarking inference speed across different dimensions

### Additional Resources

- [Sentence Transformers Documentation](https://sbert.net/)
- [Matryoshka Representation Learning Paper](https://arxiv.org/abs/2205.13147)
- [BGE Embedding Models](https://huggingface.co/BAAI/bge-base-en-v1.5)
- [Information Retrieval Evaluation](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html)

### Real-World Applications

This technique is used in production for:
- **Customer support**: Domain-specific FAQ retrieval
- **Legal tech**: Case law and document search
- **Healthcare**: Medical literature retrieval
- **E-commerce**: Product search and recommendations
- **Finance**: Regulatory document search (like we did!)

Great work!