<a href="https://colab.research.google.com/github/Texsic/Machine-Learning/blob/main/MLB_Project_4_Challenge_EmbeddingFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLB Project - Fine-Tuning Embedding Models for RAG

## Project Overview

Welcome to the Embedding Fine-Tuning Project! In this project, you'll learn how to **fine-tune an embedding model** to improve retrieval performance in RAG (Retrieval-Augmented Generation) applications.

### What are Embedding Models?
Embedding models convert text into dense vector representations (embeddings) that capture semantic meaning. They're crucial for:
- **Semantic search**: Finding similar documents based on meaning, not just keywords
- **RAG systems**: Retrieving relevant context to answer questions
- **Clustering**: Grouping similar documents together

### Why Fine-Tune Embeddings?
Pre-trained embedding models are trained on general knowledge, which limits their effectiveness for domain-specific applications. Fine-tuning on your specific data can significantly boost retrieval performance!

### What is Matryoshka Representation Learning?
Matryoshka embeddings can be truncated to various dimensions (768 → 512 → 256 → 128 → 64) without significant performance loss. This allows:
- **3x less storage** while keeping ~99% performance
- **Faster search** with smaller dimensions
- **Flexibility** to choose speed vs. accuracy trade-offs

### What You'll Learn
- How to prepare datasets for embedding fine-tuning
- How to evaluate embedding models using information retrieval metrics
- How to implement Matryoshka Representation Learning
- How to fine-tune embedding models with Sentence Transformers
- How to compare performance across different embedding dimensions

### Project Structure
1. Setup and imports
2. Load and prepare the dataset
3. Create baseline evaluation
4. Define loss function with Matryoshka representation
5. Fine-tune the embedding model
6. Evaluate and compare results

### Dataset: Financial Q&A Pairs
We'll use a dataset of financial questions and answers from NVIDIA's SEC filings to create a domain-specific embedding model.

---

## Step 1: Setup and Imports

First, let's install the required libraries and import them.

**Note**: This project requires a GPU for efficient training. Make sure you're running on a GPU-enabled environment (like Colab with GPU runtime).

In [1]:
# Install required packages (run this cell first!)
!pip install "torch" tensorboard -q
!pip install --upgrade "sentence-transformers>=3" "datasets" "transformers" -q

In [2]:
# Import all necessary libraries
import os
import torch
import warnings
from datasets import load_dataset, concatenate_datasets
from sentence_transformers import SentenceTransformer, SentenceTransformerModelCardData, SentenceTransformerTrainer
from sentence_transformers.evaluation import InformationRetrievalEvaluator, SequentialEvaluator
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
from sentence_transformers.util import cos_sim
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

# Suppress unnecessary warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

All libraries imported successfully!
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


## Step 2: Configuration

Let's set up our model configuration and Matryoshka dimensions.

In [3]:
# Model configuration
MODEL_ID = "BAAI/bge-base-en-v1.5"  # Strong base embedding model (109M params, 768 dim)
OUTPUT_DIR = "bge-base-financial-matryoshka"  # Output directory for fine-tuned model

# Matryoshka dimensions (from large to small - this order is important!)
MATRYOSHKA_DIMENSIONS = [768, 512, 256, 128, 64]

# Training configuration
NUM_EPOCHS = 4
BATCH_SIZE = 32
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = 32 * 16 = 512
LEARNING_RATE = 2e-5

print(f"Base Model: {MODEL_ID}")
print(f"Output Directory: {OUTPUT_DIR}")
print(f"Matryoshka Dimensions: {MATRYOSHKA_DIMENSIONS}")
print(f"Training Epochs: {NUM_EPOCHS}")
print(f"Batch Size: {BATCH_SIZE}")
print(f"Effective Batch Size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")

Base Model: BAAI/bge-base-en-v1.5
Output Directory: bge-base-financial-matryoshka
Matryoshka Dimensions: [768, 512, 256, 128, 64]
Training Epochs: 4
Batch Size: 32
Effective Batch Size: 512


## Step 3: Load and Prepare the Dataset

We'll use a financial Q&A dataset with question-context pairs from NVIDIA's SEC filings.

### 3.1: Load the Dataset

**TODO**: Load the financial embedding dataset and prepare it for training.

**Dataset format**: Each sample has a `question` and `context` field.

In [4]:
print("Loading dataset...")

# TODO: Load the dataset from Hugging Face
dataset = load_dataset("virattt/financial-qa-10K", split="train")

print(f"Dataset loaded with {len(dataset)} samples")
print(f"\nDataset structure: {dataset}")
print(f"\nDataset features: {dataset.features}")

Loading dataset...
Dataset loaded with 7000 samples

Dataset structure: Dataset({
    features: ['question', 'answer', 'context', 'ticker', 'filing'],
    num_rows: 7000
})

Dataset features: {'question': Value('string'), 'answer': Value('string'), 'context': Value('string'), 'ticker': Value('string'), 'filing': Value('string')}


### 3.2: Explore Sample Data

Let's look at some examples to understand the question-context pairs.

In [5]:
# Display 3 sample question-context pairs
print("Sample question-context pairs:\n")
for i in range(3):
    sample = dataset[i]
    print(f"Example {i+1}:")
    print(f"Question: {sample['question']}")
    print(f"Context: {sample['context'][:200]}...")  # Show first 200 chars
    print("-" * 80)
    print()

Sample question-context pairs:

Example 1:
Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?
Context: Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields....
--------------------------------------------------------------------------------

Example 2:
Question: What are some of the recent applications of GPU-powered deep learning as mentioned by NVIDIA?
Context: Some of the most recent applications of GPU-powered deep learning include recommendation systems, which are AI algorithms trained to understand the preferences, previous decisions, and characteristics...
--------------------------------------------------------------------------------

Example 3:
Question: What significant invention did NVIDIA create in 1999?
Context: Our invention of the GPU in 1999 defined modern computer graphics and established NVIDIA as the leader in computer graphics....
--

### 3.3: Prepare Dataset for Sentence Transformers

Sentence Transformers expects specific column names: `anchor` (query) and `positive` (relevant document).

**TODO**: Rename columns and add IDs to the dataset.

In [6]:
# TODO: Rename 'question' column to 'anchor'
dataset = dataset.rename_column("question", "anchor")

# TODO: Rename 'context' column to 'positive'
dataset = dataset.rename_column("context", "positive")

# TODO: Add an 'id' column with sequential numbers
dataset = dataset.add_column("id", range(len(dataset)))

print("Dataset prepared!")
print(f"\nUpdated columns: {dataset.column_names}")

Dataset prepared!

Updated columns: ['anchor', 'answer', 'positive', 'ticker', 'filing', 'id']


### 3.4: Split the Dataset

**TODO**: Split the dataset into train (90%) and test (10%) sets.

In [7]:
# TODO: Split dataset into train (90%) and test (10%)
dataset = dataset.train_test_split(test_size=0.1, seed=42)

print("Dataset split complete!")
print(f"\nDataset splits:")
print(f"Training: {len(dataset['train'])} samples")
print(f"Test: {len(dataset['test'])} samples")

# Save datasets to disk for later use
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")
print("\nDatasets saved to disk!")

Dataset split complete!

Dataset splits:
Training: 6300 samples
Test: 700 samples


Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]


Datasets saved to disk!


## Step 4: Create Baseline Evaluation

Before fine-tuning, let's evaluate the pre-trained model to establish a baseline.

### Understanding Evaluation Metrics

We'll use **NDCG@10** (Normalized Discounted Cumulative Gain at 10):
- Measures ranking quality (0 to 1, higher is better)
- Takes into account the position of relevant documents
- @10 means we look at the top 10 retrieved documents

### 4.1: Load the Pre-trained Model

**TODO**: Load the base embedding model for evaluation.

In [8]:
print("Loading pre-trained model...")

# TODO: Load the SentenceTransformer model
model = SentenceTransformer(MODEL_ID, trust_remote_code=True)

print(f"Model loaded: {MODEL_ID}")
print(f"   Embedding dimension: {model.get_sentence_embedding_dimension()}")

Loading pre-trained model...
Model loaded: BAAI/bge-base-en-v1.5
   Embedding dimension: 768


### 4.2: Prepare Evaluation Data

For evaluation, we need:
- **Corpus**: All documents that could be retrieved (train + test)
- **Queries**: Questions from the test set
- **Relevant docs**: Mapping of which document is relevant for each query

**TODO**: Load the datasets and create the evaluation dictionaries.

In [9]:
# TODO: Load test dataset from JSON file
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")

# TODO: Load train dataset from JSON file
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

# TODO: Concatenate train and test to create full corpus
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

# TODO: Create corpus dictionary (id -> document text)
corpus = dict(zip(corpus_dataset["id"], corpus_dataset["positive"]))

# TODO: Create queries dictionary (id -> question text)
queries = dict(zip(test_dataset["id"], test_dataset["anchor"]))

# Create relevant docs mapping (each query's relevant doc has the same ID)
relevant_docs = {}
for q_id in queries:
    relevant_docs[q_id] = [q_id]

print("Evaluation data prepared!")
print(f"   Corpus size: {len(corpus)} documents")
print(f"   Queries: {len(queries)} questions")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Evaluation data prepared!
   Corpus size: 7000 documents
   Queries: 700 questions


### 4.3: Create Matryoshka Evaluators

We'll create an evaluator for each dimension to see how performance changes with embedding size.

**TODO**: Create evaluators for each Matryoshka dimension.

In [10]:
matryoshka_evaluators = []

# TODO: Create an InformationRetrievalEvaluator for each dimension
for dim in MATRYOSHKA_DIMENSIONS:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,
        score_functions={"cosine": cos_sim}
    )
    matryoshka_evaluators.append(ir_evaluator)

# TODO: Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

print(f"Created evaluators for {len(MATRYOSHKA_DIMENSIONS)} dimensions!")

Created evaluators for 5 dimensions!


### 4.4: Run Baseline Evaluation

**TODO**: Evaluate the pre-trained model to establish our baseline.

This will take a few minutes as it computes embeddings for all queries and documents.

In [11]:
# Run the evaluation on the baseline model
print("Running baseline evaluation... (This takes 15-20 mins on GPU)")
baseline_results = evaluator(model)

baseline_scores = {}

print("\nBaseline Results:")
for dim in MATRYOSHKA_DIMENSIONS:
    key = f"dim_{dim}_cosine_ndcg@10"
    baseline_scores[dim] = baseline_results[key]
    print(f"   dim {dim}: {baseline_scores[dim]:.4f}")

print("\nBaseline scores stored successfully!")

Running baseline evaluation... (This takes 15-20 mins on GPU)

Baseline Results:
   dim 768: 0.7443
   dim 512: 0.7375
   dim 256: 0.7301
   dim 128: 0.6966
   dim 64: 0.6351

Baseline scores stored successfully!


## Step 5: Define Loss Function with Matryoshka Representation

Now we'll set up our loss function for fine-tuning.

### Understanding the Loss Functions

1. **MultipleNegativesRankingLoss**: For positive pairs (question, context), treats other samples in the batch as negatives
2. **MatryoshkaLoss**: Wraps the loss to train multiple dimensions simultaneously

### 5.1: Reload Model for Training

**TODO**: Load the model with SDPA (Scaled Dot-Product Attention) for efficient training.

In [12]:
print("Loading model for fine-tuning...")

# TODO: Load model with SDPA attention and model card data
model = SentenceTransformer(
    MODEL_ID,
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base Financial Matryoshka",
    )
)

print("Model loaded for training!")

Loading model for fine-tuning...
Model loaded for training!


### 5.2: Initialize Loss Function

**TODO**: Create the Matryoshka loss function.

In [13]:
# TODO: Create the inner loss function (MultipleNegativesRankingLoss)
inner_train_loss = MultipleNegativesRankingLoss(model)

# TODO: Wrap with MatryoshkaLoss
train_loss = MatryoshkaLoss(
    model=model,
    loss=inner_train_loss,
    matryoshka_dims=MATRYOSHKA_DIMENSIONS
)

print("Loss function configured!")
print(f"   Inner loss: {type(inner_train_loss).__name__}")
print(f"   Matryoshka dimensions: {MATRYOSHKA_DIMENSIONS}")

Loss function configured!
   Inner loss: MultipleNegativesRankingLoss
   Matryoshka dimensions: [768, 512, 256, 128, 64]


## Step 6: Set Up Training Arguments

**TODO**: Configure the training parameters.

In [14]:
# TODO: Define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    optim="adamw_torch_fused",
    tf32=False,
    bf16=False,
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",
    report_to="none",
)

print("Training arguments configured!")
print(f"\nTraining configuration:")
print(f"   Epochs: {args.num_train_epochs}")
print(f"   Batch size: {args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {args.gradient_accumulation_steps}")
print(f"   Effective batch size: {args.per_device_train_batch_size * args.gradient_accumulation_steps}")
print(f"   Learning rate: {args.learning_rate}")

Training arguments configured!

Training configuration:
   Epochs: 4
   Batch size: 32
   Gradient accumulation: 16
   Effective batch size: 512
   Learning rate: 2e-05


## Step 7: Create the Trainer

**TODO**: Create the SentenceTransformerTrainer.

In [15]:
# Reload training dataset
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

train_dataset = train_dataset.select_columns(["anchor", "positive"])

# TODO: Create the trainer
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)

print("Trainer created!")

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Trainer created!


## Step 8: Train the Model

Now let's fine-tune! This will take several minutes depending on your GPU.

**TODO**: Start the training process.

In [16]:
print("Starting fine-tuning...\n")
print("=" * 80)
print("This will train for 4 epochs and evaluate after each epoch.")
print("The model will be optimized for dimension 128.")
print("=" * 80 + "\n")

# TODO: Train the model
trainer.train()

print("\n" + "=" * 80)
print("Training complete!")
print("=" * 80)

Starting fine-tuning...

This will train for 4 epochs and evaluate after each epoch.
The model will be optimized for dimension 128.



Epoch,Training Loss,Validation Loss,Dim 768 Cosine Accuracy@1,Dim 768 Cosine Accuracy@3,Dim 768 Cosine Accuracy@5,Dim 768 Cosine Accuracy@10,Dim 768 Cosine Precision@1,Dim 768 Cosine Precision@3,Dim 768 Cosine Precision@5,Dim 768 Cosine Precision@10,Dim 768 Cosine Recall@1,Dim 768 Cosine Recall@3,Dim 768 Cosine Recall@5,Dim 768 Cosine Recall@10,Dim 768 Cosine Ndcg@10,Dim 768 Cosine Mrr@10,Dim 768 Cosine Map@100,Dim 512 Cosine Accuracy@1,Dim 512 Cosine Accuracy@3,Dim 512 Cosine Accuracy@5,Dim 512 Cosine Accuracy@10,Dim 512 Cosine Precision@1,Dim 512 Cosine Precision@3,Dim 512 Cosine Precision@5,Dim 512 Cosine Precision@10,Dim 512 Cosine Recall@1,Dim 512 Cosine Recall@3,Dim 512 Cosine Recall@5,Dim 512 Cosine Recall@10,Dim 512 Cosine Ndcg@10,Dim 512 Cosine Mrr@10,Dim 512 Cosine Map@100,Dim 256 Cosine Accuracy@1,Dim 256 Cosine Accuracy@3,Dim 256 Cosine Accuracy@5,Dim 256 Cosine Accuracy@10,Dim 256 Cosine Precision@1,Dim 256 Cosine Precision@3,Dim 256 Cosine Precision@5,Dim 256 Cosine Precision@10,Dim 256 Cosine Recall@1,Dim 256 Cosine Recall@3,Dim 256 Cosine Recall@5,Dim 256 Cosine Recall@10,Dim 256 Cosine Ndcg@10,Dim 256 Cosine Mrr@10,Dim 256 Cosine Map@100,Dim 128 Cosine Accuracy@1,Dim 128 Cosine Accuracy@3,Dim 128 Cosine Accuracy@5,Dim 128 Cosine Accuracy@10,Dim 128 Cosine Precision@1,Dim 128 Cosine Precision@3,Dim 128 Cosine Precision@5,Dim 128 Cosine Precision@10,Dim 128 Cosine Recall@1,Dim 128 Cosine Recall@3,Dim 128 Cosine Recall@5,Dim 128 Cosine Recall@10,Dim 128 Cosine Ndcg@10,Dim 128 Cosine Mrr@10,Dim 128 Cosine Map@100,Dim 64 Cosine Accuracy@1,Dim 64 Cosine Accuracy@3,Dim 64 Cosine Accuracy@5,Dim 64 Cosine Accuracy@10,Dim 64 Cosine Precision@1,Dim 64 Cosine Precision@3,Dim 64 Cosine Precision@5,Dim 64 Cosine Precision@10,Dim 64 Cosine Recall@1,Dim 64 Cosine Recall@3,Dim 64 Cosine Recall@5,Dim 64 Cosine Recall@10,Dim 64 Cosine Ndcg@10,Dim 64 Cosine Mrr@10,Dim 64 Cosine Map@100,Sequential Score
1,1.6769,No log,0.662857,0.817143,0.857143,0.905714,0.662857,0.272381,0.171429,0.090571,0.662857,0.817143,0.857143,0.905714,0.787183,0.748914,0.752372,0.671429,0.807143,0.854286,0.901429,0.671429,0.269048,0.170857,0.090143,0.671429,0.807143,0.854286,0.901429,0.786537,0.749734,0.75335,0.662857,0.808571,0.852857,0.887143,0.662857,0.269524,0.170571,0.088714,0.662857,0.808571,0.852857,0.887143,0.778363,0.74307,0.747625,0.664286,0.798571,0.83,0.875714,0.664286,0.26619,0.166,0.087571,0.664286,0.798571,0.83,0.875714,0.771761,0.73833,0.742975,0.618571,0.755714,0.791429,0.841429,0.618571,0.251905,0.158286,0.084143,0.618571,0.755714,0.791429,0.841429,0.731671,0.696466,0.702084,0.731671
2,0.7481,No log,0.685714,0.82,0.868571,0.91,0.685714,0.273333,0.173714,0.091,0.685714,0.82,0.868571,0.91,0.799079,0.763377,0.766718,0.684286,0.822857,0.858571,0.901429,0.684286,0.274286,0.171714,0.090143,0.684286,0.822857,0.858571,0.901429,0.794216,0.759562,0.763458,0.687143,0.818571,0.857143,0.895714,0.687143,0.272857,0.171429,0.089571,0.687143,0.818571,0.857143,0.895714,0.793251,0.760131,0.76436,0.668571,0.808571,0.852857,0.887143,0.668571,0.269524,0.170571,0.088714,0.668571,0.808571,0.852857,0.887143,0.781378,0.747082,0.751382,0.642857,0.77,0.81,0.857143,0.642857,0.256667,0.162,0.085714,0.642857,0.77,0.81,0.857143,0.750266,0.715984,0.72127,0.750266
3,0.4447,No log,0.698571,0.828571,0.874286,0.91,0.698571,0.27619,0.174857,0.091,0.698571,0.828571,0.874286,0.91,0.805671,0.771987,0.775459,0.695714,0.828571,0.862857,0.902857,0.695714,0.27619,0.172571,0.090286,0.695714,0.828571,0.862857,0.902857,0.801319,0.768398,0.772368,0.692857,0.825714,0.864286,0.897143,0.692857,0.275238,0.172857,0.089714,0.692857,0.825714,0.864286,0.897143,0.797813,0.765509,0.769742,0.677143,0.817143,0.854286,0.89,0.677143,0.272381,0.170857,0.089,0.677143,0.817143,0.854286,0.89,0.787511,0.754207,0.758707,0.648571,0.782857,0.822857,0.867143,0.648571,0.260952,0.164571,0.086714,0.648571,0.782857,0.822857,0.867143,0.758254,0.723376,0.728236,0.758254
4,0.3761,No log,0.698571,0.831429,0.874286,0.91,0.698571,0.277143,0.174857,0.091,0.698571,0.831429,0.874286,0.91,0.806348,0.772824,0.776388,0.7,0.828571,0.864286,0.902857,0.7,0.27619,0.172857,0.090286,0.7,0.828571,0.864286,0.902857,0.803078,0.770773,0.774881,0.694286,0.825714,0.865714,0.898571,0.694286,0.275238,0.173143,0.089857,0.694286,0.825714,0.865714,0.898571,0.798859,0.766509,0.770715,0.678571,0.817143,0.855714,0.892857,0.678571,0.272381,0.171143,0.089286,0.678571,0.817143,0.855714,0.892857,0.788859,0.755195,0.759474,0.648571,0.782857,0.82,0.87,0.648571,0.260952,0.164,0.087,0.648571,0.782857,0.82,0.87,0.759774,0.724515,0.729223,0.759774



Training complete!


## Step 9: Save the Fine-Tuned Model

**TODO**: Save the model for future use.

In [17]:
print("Saving fine-tuned model...")

# TODO: Save the model
trainer.save_model(OUTPUT_DIR)

print(f"Model saved to {OUTPUT_DIR}!")

Saving fine-tuned model...
Model saved to bge-base-financial-matryoshka!


## Step 10: Evaluate Fine-Tuned Model

Let's evaluate our fine-tuned model and compare it to the baseline!

### 10.1: Load and Evaluate Fine-Tuned Model

**TODO**: Load the fine-tuned model and evaluate it.

In [18]:
print("Evaluating fine-tuned model...\n")
print("This may take a few minutes...\n")

fine_tuned_model = SentenceTransformer(OUTPUT_DIR)

results = evaluator(fine_tuned_model)

print("\nFine-tuned model evaluation complete!\n")
print("Fine-tuned NDCG@10 scores by dimension:\n")

finetuned_scores = {}
for dim in MATRYOSHKA_DIMENSIONS:
    key = f"dim_{dim}_cosine_ndcg@10"
    score = results[key]
    finetuned_scores[dim] = score
    print(f"   dim {dim}: {score:.4f}")

Evaluating fine-tuned model...

This may take a few minutes...


Fine-tuned model evaluation complete!

Fine-tuned NDCG@10 scores by dimension:

   dim 768: 0.8066
   dim 512: 0.8028
   dim 256: 0.7988
   dim 128: 0.7892
   dim 64: 0.7595


### 10.2: Compare Results

Let's create a comparison table to see the improvement!

In [19]:
import pandas as pd

# Create comparison DataFrame
comparison_data = []
for dim in MATRYOSHKA_DIMENSIONS:
    baseline = baseline_scores[dim]
    finetuned = finetuned_scores[dim]
    improvement = ((finetuned - baseline) / baseline) * 100

    comparison_data.append({
        'Dimension': dim,
        'Baseline': f"{baseline:.4f}",
        'Fine-tuned': f"{finetuned:.4f}",
        'Improvement': f"{improvement:.2f}%"
    })

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "=" * 80)
print("PERFORMANCE COMPARISON")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)


PERFORMANCE COMPARISON
 Dimension Baseline Fine-tuned Improvement
       768   0.7443     0.8066       8.37%
       512   0.7375     0.8028       8.86%
       256   0.7301     0.7988       9.42%
       128   0.6966     0.7892      13.30%
        64   0.6351     0.7595      19.58%


### 10.3: Analyze Results

Let's analyze the key findings from our fine-tuning experiment.

In [20]:
print("\nKEY INSIGHTS:\n")

# Calculate storage reduction at 128 dimensions
storage_reduction = 768 / 128
performance_retention_128 = (finetuned_scores[128] / finetuned_scores[768]) * 100

# Best improvement dimension
best_improvement_dim = max(MATRYOSHKA_DIMENSIONS,
                           key=lambda d: (finetuned_scores[d] - baseline_scores[d]) / baseline_scores[d])
best_improvement = ((finetuned_scores[best_improvement_dim] - baseline_scores[best_improvement_dim]) /
                   baseline_scores[best_improvement_dim]) * 100

print(f"1. Average improvement across all dimensions: "
      f"{sum((finetuned_scores[d] - baseline_scores[d]) / baseline_scores[d] for d in MATRYOSHKA_DIMENSIONS) / len(MATRYOSHKA_DIMENSIONS) * 100:.2f}%")

print(f"\n2. Largest improvement: {best_improvement:.2f}% at dimension {best_improvement_dim}")

print(f"\n3. Dimension 128 achieves {performance_retention_128:.2f}% of full 768-dim performance")
print(f"   with {storage_reduction:.0f}x storage reduction!")

print(f"\n4. Fine-tuned 128-dim vs Baseline 768-dim:")
cross_comparison = ((finetuned_scores[128] - baseline_scores[768]) / baseline_scores[768]) * 100
print(f"   Fine-tuned 128-dim is {cross_comparison:+.2f}% vs baseline 768-dim")
print(f"   (Smaller, faster, AND better!)")

print(f"\n5. Fine-tuned 64-dim vs Baseline 768-dim:")
cross_comparison_64 = ((finetuned_scores[64] - baseline_scores[768]) / baseline_scores[768]) * 100
print(f"   Fine-tuned 64-dim is {cross_comparison_64:+.2f}% vs baseline 768-dim")
print(f"   (12x smaller!)")


KEY INSIGHTS:

1. Average improvement across all dimensions: 11.90%

2. Largest improvement: 19.58% at dimension 64

3. Dimension 128 achieves 97.85% of full 768-dim performance
   with 6x storage reduction!

4. Fine-tuned 128-dim vs Baseline 768-dim:
   Fine-tuned 128-dim is +6.04% vs baseline 768-dim
   (Smaller, faster, AND better!)

5. Fine-tuned 64-dim vs Baseline 768-dim:
   Fine-tuned 64-dim is +2.04% vs baseline 768-dim
   (12x smaller!)


## Step 11: Test with Sample Queries

Let's test our fine-tuned model with some sample financial questions!

In [21]:
# Sample queries to test
sample_queries = [
    "What was NVIDIA's total revenue?",
    "What are the main business segments?",
    "What are the key risk factors?"
]

print("Testing retrieval with sample queries:\n")
print("=" * 80)

for query in sample_queries:
    print(f"\nQuery: {query}\n")

    # Encode query with fine-tuned model (using 128 dimensions)
    query_embedding = fine_tuned_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding[:128]  # Truncate to 128 dimensions

    # Encode all corpus documents
    corpus_embeddings = fine_tuned_model.encode(list(corpus.values()), convert_to_tensor=True)
    corpus_embeddings = corpus_embeddings[:, :128]  # Truncate to 128 dimensions

    # Calculate similarity scores
    scores = cos_sim(query_embedding, corpus_embeddings)[0]

    # Get top result
    top_idx = scores.argmax().item()
    top_doc_id = list(corpus.keys())[top_idx]
    top_doc = corpus[top_doc_id]

    print(f"Top Retrieved Document (score: {scores[top_idx]:.4f}):")
    print(f"{top_doc[:300]}...")
    print("-" * 80)

Testing retrieval with sample queries:


Query: What was NVIDIA's total revenue?

Top Retrieved Document (score: 0.6862):
Total revenue | | 211,915 | | | 198,270 | | | 168,088...
--------------------------------------------------------------------------------

Query: What are the main business segments?

Top Retrieved Document (score: 0.7292):
Segments The Company manages its business primarily on a geographic basis. The Company’s reportable segments consist of the Americas, Europe, Greater China, Japan and Rest of Asia Pacific....
--------------------------------------------------------------------------------

Query: What are the key risk factors?

Top Retrieved Document (score: 0.7266):
Item 1A. Risk Factors The Company’s business, reputation, results of operations, financial condition and stock price can be affected by a number of factors, whether currently known or unknown, including those described below. When any one or more of these risks materialize from time to time, the Com.

## Congratulations!

You've successfully fine-tuned an embedding model with Matryoshka representation learning! Here's what you accomplished:

1. ✅ Loaded and prepared a domain-specific Q&A dataset
2. ✅ Established a baseline with pre-trained embeddings
3. ✅ Implemented Matryoshka Representation Learning
4. ✅ Fine-tuned an embedding model with MultipleNegativesRankingLoss
5. ✅ Evaluated performance across multiple embedding dimensions
6. ✅ Achieved significant improvements (typically 7-22% boost)
7. ✅ Demonstrated 6x storage reduction with maintained performance

### Key Takeaways

**Why This Matters:**
- Fine-tuning embedding models is crucial for domain-specific RAG applications
- Matryoshka embeddings provide flexibility in the speed/accuracy trade-off
- Even small datasets (6-7k samples) can yield significant improvements
- Smaller dimensions can outperform larger baseline dimensions after fine-tuning

### Next Steps

Want to go further? Try:
- Fine-tuning on your own domain-specific data
- Experimenting with different base models (e.g., bge-large, e5-base)
- Testing different Matryoshka dimension combinations
- Trying other loss functions (CosineSimilarityLoss, ContrastiveLoss)
- Generating synthetic training data using LLMs
- Integrating the fine-tuned model into a full RAG pipeline
- Benchmarking inference speed across different dimensions

### Additional Resources

- [Sentence Transformers Documentation](https://sbert.net/)
- [Matryoshka Representation Learning Paper](https://arxiv.org/abs/2205.13147)
- [BGE Embedding Models](https://huggingface.co/BAAI/bge-base-en-v1.5)
- [Information Retrieval Evaluation](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html)

### Real-World Applications

This technique is used in production for:
- **Customer support**: Domain-specific FAQ retrieval
- **Legal tech**: Case law and document search
- **Healthcare**: Medical literature retrieval
- **E-commerce**: Product search and recommendations
- **Finance**: Regulatory document search (like we did!)

Great work!