# ShopTalk – Embedding Fine-Tuning & Model Benchmarking

**Project:** ShopTalk – AI-Powered Shopping Assistant  
**Dataset:** [Amazon Berkeley Objects (ABO)](https://amazon-berkeley-objects.s3.amazonaws.com/index.html)  
**Author:** Balaji Gurusala  
**Notebook Scope:** Fine-tuning (requirements.md §5), Model Comparison, Vector Store Benchmarking  
**Prerequisite:** `03-rag-prototype.ipynb` artifacts (rag_products.pkl, embeddings, rag_config.json)  
**Environment:** Kaggle (CUDA – T4/P100 GPU required for training) or local with CUDA/MPS

---

### Purpose

Per `requirements.md` §5 (Research & Academic Requirements):

| Requirement | This Notebook |
|------------|---------------|
| **Triplet Loss** fine-tuning on ABO | Generate triplets by category, train with `TripletLoss` |
| **Full fine-tuning** | Full parameter update (MiniLM is only 22M params — LoRA overhead not beneficial at this scale) |
| **Base vs Fine-Tuned** comparison | Benchmark Precision@5 before/after fine-tuning |
| **ChromaDB vs FAISS** comparison | Benchmark retrieval latency and throughput |

> **Note on LoRA:** LoRA/QLoRA adapters are most beneficial for large models (>1B params) where full fine-tuning is memory-prohibitive. For `all-MiniLM-L6-v2` (22M params), full fine-tuning is more effective and runs easily on Kaggle T4 GPUs.

### Pipeline

```
NB03 artifacts → Generate Triplets → Fine-Tune (Triplet Loss, full params)
   → Evaluate Base vs Fine-Tuned (P@5) → Benchmark ChromaDB vs FAISS
   → Export Fine-Tuned Model
```

### Outputs

| Artifact | Description |
|----------|-------------|
| `models/finetuned-shoptalk-emb/` | Fine-tuned SentenceTransformer model |
| `finetuned_text_index.npy` | Re-embedded product index with fine-tuned model |
| `finetune_results.json` | Training metrics, P@5 comparison, benchmark results |
| `finetune_evaluation.csv` | Detailed per-query evaluation results |

---

## Step 0 – Environment Setup

In [None]:
# ============================================================
# Step 0: Environment Setup
# ============================================================
import sys, os, json, time, re, warnings, random
from pathlib import Path
from typing import Optional, List, Dict, Tuple
from datetime import datetime
from collections import defaultdict

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

print(f"Python {sys.version}")

import torch
import numpy as np
import pandas as pd

DEVICE = (
    "cuda" if torch.cuda.is_available() else
    "mps" if hasattr(torch.backends, "mps") and torch.backends.mps.is_available() else
    "cpu"
)
GPU_NAME = torch.cuda.get_device_name(0) if DEVICE == "cuda" else DEVICE
print(f"PyTorch {torch.__version__}")
print(f"Device: {DEVICE} ({GPU_NAME})")

ON_KAGGLE = Path("/kaggle/working").exists()
print(f"Platform: {'Kaggle' if ON_KAGGLE else 'Local'}")

# Install deps
def _install(*pkgs):
    import subprocess
    for pkg in pkgs:
        try:
            __import__(pkg.split("==")[0].replace("-", "_"))
        except ImportError:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

_install("sentence-transformers", "faiss-cpu", "chromadb")

print("\n\u2713 Environment ready")

---

## Step 1 – Load Data & Artifacts

In [None]:
# ============================================================
# Step 1: Load NB03 Artifacts
# ============================================================

_candidates = [
    Path("/kaggle/input/shoptalk-rag-prototype"),
    Path("/kaggle/working"),
    Path("../data"),
    Path("."),
]
DATA_DIR = None
for d in _candidates:
    if (d / "rag_products.pkl").exists():
        DATA_DIR = d
        break
assert DATA_DIR is not None, "Cannot find NB03 artifacts. Run 03-rag-prototype.ipynb first."

EXPORT_DIR = Path("/kaggle/working") if ON_KAGGLE else Path("../data")
EXPORT_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR = EXPORT_DIR / "models" / "finetuned-shoptalk-emb"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

print(f"Data directory:   {DATA_DIR}")
print(f"Export directory:  {EXPORT_DIR}")
print(f"Model directory:  {MODEL_DIR}")

# Load data
df = pd.read_pickle(DATA_DIR / "rag_products.pkl")
TEXT_INDEX_BASE = np.load(DATA_DIR / "rag_text_index.npy")
IMAGE_INDEX = np.load(DATA_DIR / "rag_image_index.npy")

with open(DATA_DIR / "rag_config.json") as f:
    RAG_CONFIG = json.load(f)

print(f"\nProducts:    {len(df):,}")
print(f"Text index:  {TEXT_INDEX_BASE.shape}")
print(f"Image index: {IMAGE_INDEX.shape}")
print(f"Categories:  {df['product_type_flat'].nunique()}")

# Category distribution
cat_counts = df["product_type_flat"].value_counts()
print(f"\nTop 10 categories:")
for cat, cnt in cat_counts.head(10).items():
    print(f"  {cat:35s}  {cnt:5d}")

print(f"\n\u2713 Artifacts loaded")

---

## Step 2 – Generate Training Triplets

Per `requirements.md`: *"Use Triplet Loss to train the embedding model on the ABO dataset."*

**Triplet structure:**  
- **Anchor**: Product enriched text  
- **Positive**: Another product in the SAME category  
- **Negative**: A product from a DIFFERENT category  

This teaches the model that products in the same category should be closer  
in embedding space than products from different categories.

In [None]:
# ============================================================
# Step 2: Generate Training Triplets
# ============================================================

from sentence_transformers import InputExample

# Group products by category
category_products = defaultdict(list)
for idx, row in df.iterrows():
    cat = row.get("product_type_flat", "UNKNOWN")
    text = str(row.get("enriched_text", ""))
    if cat and text and len(text) > 20:
        category_products[cat].append((idx, text))

# Filter categories with at least 2 products (needed for positive pairs)
valid_categories = {k: v for k, v in category_products.items() if len(v) >= 2}
all_categories = list(valid_categories.keys())

print(f"Valid categories (>= 2 products): {len(valid_categories)}")
print(f"Total products in valid categories: {sum(len(v) for v in valid_categories.values()):,}")

# --- Generate triplets ---
random.seed(42)
np.random.seed(42)

NUM_TRIPLETS = 20000  # Target number of triplets
HARD_NEGATIVE_RATIO = 0.3  # 30% hard negatives (similar categories)

# Build hard negative map (categories that are semantically similar)
SIMILAR_CATEGORIES = {
    "SHOES": ["SANDAL", "BOOTS"],
    "SHIRT": ["T_SHIRT", "POLO"],
    "T_SHIRT": ["SHIRT"],
    "HOME": ["HOME_BED_AND_BATH", "FURNITURE"],
    "HOME_BED_AND_BATH": ["HOME"],
    "HARDWARE": ["HARDWARE_HANDLE"],
    "HARDWARE_HANDLE": ["HARDWARE"],
    "CHAIR": ["OTTOMAN", "FURNITURE"],
    "TABLE": ["FURNITURE"],
    "OTTOMAN": ["CHAIR", "FURNITURE"],
}

triplets = []
triplet_texts = []  # For InputExample format

for _ in range(NUM_TRIPLETS):
    # Pick anchor category
    anchor_cat = random.choice(all_categories)
    anchor_products = valid_categories[anchor_cat]

    # Pick anchor and positive (same category, different product)
    anchor_idx, positive_idx = random.sample(range(len(anchor_products)), 2)
    anchor_text = anchor_products[anchor_idx][1]
    positive_text = anchor_products[positive_idx][1]

    # Pick negative (different category)
    use_hard = random.random() < HARD_NEGATIVE_RATIO
    if use_hard and anchor_cat in SIMILAR_CATEGORIES:
        # Hard negative: from a similar category
        hard_cats = [c for c in SIMILAR_CATEGORIES[anchor_cat] if c in valid_categories]
        if hard_cats:
            neg_cat = random.choice(hard_cats)
        else:
            neg_cat = random.choice([c for c in all_categories if c != anchor_cat])
    else:
        # Random negative: from any different category
        neg_cat = random.choice([c for c in all_categories if c != anchor_cat])

    neg_products = valid_categories[neg_cat]
    negative_text = random.choice(neg_products)[1]

    triplets.append((anchor_text, positive_text, negative_text))
    triplet_texts.append(InputExample(
        texts=[anchor_text, positive_text, negative_text]
    ))

print(f"\n\u2713 Generated {len(triplets):,} triplets")
print(f"  Hard negatives: ~{HARD_NEGATIVE_RATIO*100:.0f}%")
print(f"  Anchor text sample: {triplets[0][0][:100]}...")
print(f"  Positive sample:    {triplets[0][1][:100]}...")
print(f"  Negative sample:    {triplets[0][2][:100]}...")

# Train/val split
split_idx = int(len(triplet_texts) * 0.9)
train_examples = triplet_texts[:split_idx]
val_examples = triplet_texts[split_idx:]
print(f"\n  Train: {len(train_examples):,} | Val: {len(val_examples):,}")

---

## Step 3 – Fine-Tune Embedding Model

Fine-tune `all-MiniLM-L6-v2` with **TripletLoss** (full parameter update).  
MiniLM is compact (22M params) so full fine-tuning runs efficiently on Kaggle T4 GPUs.

In [None]:
# ============================================================
# Step 3: Fine-Tune with Triplet Loss
# ============================================================

from sentence_transformers import SentenceTransformer, losses, evaluation
from torch.utils.data import DataLoader

BASE_MODEL_ID = RAG_CONFIG.get("text_model_id", "all-MiniLM-L6-v2")

# --- Load base model ---
print(f"Loading base model: {BASE_MODEL_ID}")
ft_model = SentenceTransformer(BASE_MODEL_ID, device=DEVICE)
print(f"  Parameters: {sum(p.numel() for p in ft_model.parameters()):,}")

# --- Training config ---
BATCH_SIZE = 64 if DEVICE == "cuda" else 16
NUM_EPOCHS = 3
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1
MARGIN = 0.5  # Triplet margin

# --- DataLoader ---
train_dataloader = DataLoader(
    train_examples,
    shuffle=True,
    batch_size=BATCH_SIZE,
)

# --- Loss function ---
train_loss = losses.TripletLoss(
    model=ft_model,
    distance_metric=losses.TripletDistanceMetric.COSINE,
    triplet_margin=MARGIN,
)

# --- Evaluator (using val set) ---
# Create evaluation anchors, positives, negatives
val_anchors = [ex.texts[0] for ex in val_examples[:500]]
val_positives = [ex.texts[1] for ex in val_examples[:500]]
val_negatives = [ex.texts[2] for ex in val_examples[:500]]

val_evaluator = evaluation.TripletEvaluator(
    anchors=val_anchors,
    positives=val_positives,
    negatives=val_negatives,
    name="shoptalk-val",
    main_distance_function=evaluation.TripletEvaluator.SimilarityFunction.COSINE,
)

# --- Training ---
warmup_steps = int(len(train_dataloader) * NUM_EPOCHS * WARMUP_RATIO)

print(f"\nTraining Configuration:")
print(f"  Batch size:    {BATCH_SIZE}")
print(f"  Epochs:        {NUM_EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Warmup steps:  {warmup_steps}")
print(f"  Triplet margin:{MARGIN}")
print(f"  Train batches: {len(train_dataloader)}")
print(f"  Device:        {DEVICE}")

print(f"\nStarting fine-tuning...")
t_start = time.time()

ft_model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=val_evaluator,
    epochs=NUM_EPOCHS,
    warmup_steps=warmup_steps,
    optimizer_params={"lr": LEARNING_RATE},
    output_path=str(MODEL_DIR),
    evaluation_steps=len(train_dataloader) // 2,
    save_best_model=True,
    show_progress_bar=True,
)

train_time = time.time() - t_start
print(f"\n\u2713 Fine-tuning complete in {train_time:.1f}s ({train_time/60:.1f} min)")
print(f"  Model saved to: {MODEL_DIR}")

---

## Step 4 – Re-Embed Products with Fine-Tuned Model

In [None]:
# ============================================================
# Step 4: Re-Embed with Fine-Tuned Model
# ============================================================

print("Loading fine-tuned model...")
ft_model = SentenceTransformer(str(MODEL_DIR), device=DEVICE)
print(f"  Parameters: {sum(p.numel() for p in ft_model.parameters()):,}")

# Also load base model for comparison
print(f"Loading base model: {BASE_MODEL_ID}")
base_model = SentenceTransformer(BASE_MODEL_ID, device=DEVICE)

# --- Re-embed all products ---
texts = df["enriched_text"].fillna("").tolist()

print(f"\nEmbedding {len(texts):,} products with fine-tuned model...")
t0 = time.time()
ft_embeddings = ft_model.encode(
    texts,
    batch_size=256 if DEVICE == "cuda" else 64,
    show_progress_bar=True,
    normalize_embeddings=True,
)
ft_embed_time = time.time() - t0
TEXT_INDEX_FT = ft_embeddings.astype(np.float32)

print(f"\n\u2713 Fine-tuned embeddings: {TEXT_INDEX_FT.shape} in {ft_embed_time:.1f}s")

# Save
ft_index_path = EXPORT_DIR / "finetuned_text_index.npy"
np.save(ft_index_path, TEXT_INDEX_FT)
print(f"\u2713 Saved to {ft_index_path.name} ({ft_index_path.stat().st_size / 1e6:.1f} MB)")

---

## Step 5 – Evaluate: Base vs Fine-Tuned (Precision@5)

Per `requirements.md`: *"Benchmark retrieval precision of Base Embedding Model vs. Fine-Tuned Model."*

In [None]:
# ============================================================
# Step 5: Precision@5 Evaluation — Base vs Fine-Tuned
# ============================================================

def evaluate_precision_at_k(
    text_index: np.ndarray,
    image_index: np.ndarray,
    df_data: pd.DataFrame,
    alpha: float = 0.6,
    k: int = 5,
    n_queries: int = 200,
    seed: int = 42,
) -> Dict:
    """Evaluate P@K: fraction of top-K results sharing the query's category."""
    rng = np.random.RandomState(seed)
    categories = df_data["product_type_flat"].values
    valid_mask = pd.notna(categories)
    valid_indices = np.where(valid_mask)[0]

    if len(valid_indices) < n_queries:
        n_queries = len(valid_indices)

    query_indices = rng.choice(valid_indices, size=n_queries, replace=False)

    precisions = []
    for qi in query_indices:
        q_text = text_index[qi:qi+1]
        q_image = image_index[qi:qi+1]

        text_sim = (text_index @ q_text.T).squeeze()
        image_sim = (image_index @ q_image.T).squeeze()
        scores = alpha * text_sim + (1.0 - alpha) * image_sim
        scores[qi] = -np.inf  # Exclude self

        top_k_idx = np.argsort(scores)[::-1][:k]
        query_cat = categories[qi]
        hits = sum(1 for idx in top_k_idx if categories[idx] == query_cat)
        precisions.append(hits / k)

    return {
        "mean_p_at_k": np.mean(precisions),
        "std_p_at_k": np.std(precisions),
        "median_p_at_k": np.median(precisions),
        "n_queries": n_queries,
        "k": k,
    }


# --- Evaluate both models ---
N_EVAL = 200

print(f"Evaluating P@5 with {N_EVAL} queries...\n")

# Base model
t0 = time.time()
base_results = evaluate_precision_at_k(
    TEXT_INDEX_BASE, IMAGE_INDEX, df, alpha=0.6, n_queries=N_EVAL
)
base_eval_time = time.time() - t0

# Fine-tuned model
t0 = time.time()
ft_results = evaluate_precision_at_k(
    TEXT_INDEX_FT, IMAGE_INDEX, df, alpha=0.6, n_queries=N_EVAL
)
ft_eval_time = time.time() - t0

# --- Alpha sweep for fine-tuned ---
print("Running alpha sweep for fine-tuned model...")
alpha_sweep_ft = []
for alpha in np.arange(0.0, 1.05, 0.1):
    r = evaluate_precision_at_k(TEXT_INDEX_FT, IMAGE_INDEX, df, alpha=alpha, n_queries=100)
    alpha_sweep_ft.append({"alpha": round(alpha, 1), "P@5": r["mean_p_at_k"]})

df_sweep_ft = pd.DataFrame(alpha_sweep_ft)
best_ft = df_sweep_ft.loc[df_sweep_ft["P@5"].idxmax()]

# --- Results ---
print(f"\n{'=' * 70}")
print("PRECISION@5 COMPARISON: Base vs Fine-Tuned")
print(f"{'=' * 70}")
print(f"\n  Base Model ({BASE_MODEL_ID}):")
print(f"    P@5 = {base_results['mean_p_at_k']:.4f} \u00b1 {base_results['std_p_at_k']:.4f}")
print(f"    Eval time: {base_eval_time:.2f}s")
print(f"\n  Fine-Tuned Model:")
print(f"    P@5 = {ft_results['mean_p_at_k']:.4f} \u00b1 {ft_results['std_p_at_k']:.4f}")
print(f"    Eval time: {ft_eval_time:.2f}s")
print(f"    Best alpha (sweep): {best_ft['alpha']:.1f} -> P@5 = {best_ft['P@5']:.4f}")

improvement = ft_results['mean_p_at_k'] - base_results['mean_p_at_k']
pct_improvement = improvement / base_results['mean_p_at_k'] * 100
print(f"\n  \u0394 P@5: {improvement:+.4f} ({pct_improvement:+.1f}%)")
_ft_verdict = "✅ Fine-tuning improved retrieval!" if improvement > 0 else "⚠ No improvement (try more epochs or larger dataset)"
print(f"  {_ft_verdict}")

---

## Step 6 – Vector Store Benchmark: ChromaDB vs FAISS

Per `requirements.md`: *"Compare latency/throughput of ChromaDB vs. Milvus (or FAISS)."*

In [None]:
# ============================================================
# Step 6: ChromaDB vs FAISS Benchmark
# ============================================================

import faiss
import chromadb

# Use fine-tuned embeddings for the benchmark
BENCH_INDEX = TEXT_INDEX_FT
N_PRODUCTS = BENCH_INDEX.shape[0]
EMBED_DIM = BENCH_INDEX.shape[1]
N_QUERIES = 100
TOP_K = 5

# Generate random query embeddings (simulating search queries)
np.random.seed(42)
query_indices = np.random.choice(N_PRODUCTS, size=N_QUERIES, replace=False)
query_embeddings = BENCH_INDEX[query_indices]

print(f"Benchmark setup: {N_PRODUCTS:,} products, {EMBED_DIM}d, {N_QUERIES} queries, top-{TOP_K}")

# ============================================================
# FAISS Benchmark
# ============================================================

# --- FAISS Flat (exact) ---
print("\n--- FAISS (Flat, exact search) ---")
t0 = time.time()
faiss_flat = faiss.IndexFlatIP(EMBED_DIM)  # Inner product (cosine for normalized vectors)
faiss_flat.add(BENCH_INDEX)
faiss_build_flat = time.time() - t0
print(f"  Index build: {faiss_build_flat:.3f}s")

t0 = time.time()
for qe in query_embeddings:
    faiss_flat.search(qe.reshape(1, -1), TOP_K)
faiss_query_flat = time.time() - t0
faiss_qps_flat = N_QUERIES / faiss_query_flat
print(f"  Query time:  {faiss_query_flat:.3f}s total | {faiss_query_flat/N_QUERIES*1000:.2f}ms/query")
print(f"  QPS:         {faiss_qps_flat:.0f}")

# --- FAISS IVF (approximate) ---
print("\n--- FAISS (IVF, approximate search) ---")
n_clusters = min(100, N_PRODUCTS // 10)
quantizer = faiss.IndexFlatIP(EMBED_DIM)
faiss_ivf = faiss.IndexIVFFlat(quantizer, EMBED_DIM, n_clusters, faiss.METRIC_INNER_PRODUCT)

t0 = time.time()
faiss_ivf.train(BENCH_INDEX)
faiss_ivf.add(BENCH_INDEX)
faiss_build_ivf = time.time() - t0
faiss_ivf.nprobe = 10
print(f"  Index build: {faiss_build_ivf:.3f}s ({n_clusters} clusters)")

t0 = time.time()
for qe in query_embeddings:
    faiss_ivf.search(qe.reshape(1, -1), TOP_K)
faiss_query_ivf = time.time() - t0
faiss_qps_ivf = N_QUERIES / faiss_query_ivf
print(f"  Query time:  {faiss_query_ivf:.3f}s total | {faiss_query_ivf/N_QUERIES*1000:.2f}ms/query")
print(f"  QPS:         {faiss_qps_ivf:.0f}")

# ============================================================
# ChromaDB Benchmark
# ============================================================
print("\n--- ChromaDB (Persistent) ---")

chroma_bench_dir = EXPORT_DIR / "chroma_bench"
chroma_bench_dir.mkdir(exist_ok=True)

t0 = time.time()
chroma_client = chromadb.PersistentClient(path=str(chroma_bench_dir))

# Delete existing collection if present
try:
    chroma_client.delete_collection("bench_text")
except:
    pass

collection = chroma_client.create_collection(
    name="bench_text",
    metadata={"hnsw:space": "cosine"},
)

# Upsert in batches
CHROMA_BATCH = 1024
ids = [str(i) for i in range(N_PRODUCTS)]
for start in range(0, N_PRODUCTS, CHROMA_BATCH):
    end = min(start + CHROMA_BATCH, N_PRODUCTS)
    collection.add(
        ids=ids[start:end],
        embeddings=BENCH_INDEX[start:end].tolist(),
    )
chroma_build = time.time() - t0
print(f"  Index build: {chroma_build:.3f}s")

t0 = time.time()
for qe in query_embeddings:
    collection.query(query_embeddings=[qe.tolist()], n_results=TOP_K)
chroma_query = time.time() - t0
chroma_qps = N_QUERIES / chroma_query
print(f"  Query time:  {chroma_query:.3f}s total | {chroma_query/N_QUERIES*1000:.2f}ms/query")
print(f"  QPS:         {chroma_qps:.0f}")

# Cleanup
import shutil
shutil.rmtree(chroma_bench_dir, ignore_errors=True)

# ============================================================
# Summary
# ============================================================
print(f"\n{'=' * 70}")
print("VECTOR STORE BENCHMARK SUMMARY")
print(f"{'=' * 70}")
print(f"{'Backend':25s} {'Build (s)':>10s} {'Query (ms)':>12s} {'QPS':>8s}")
_LINE = "─" * 55
print(_LINE)
print(f"{'FAISS Flat (exact)':25s} {faiss_build_flat:>10.3f} {faiss_query_flat/N_QUERIES*1000:>12.2f} {faiss_qps_flat:>8.0f}")
print(f"{'FAISS IVF (approx)':25s} {faiss_build_ivf:>10.3f} {faiss_query_ivf/N_QUERIES*1000:>12.2f} {faiss_qps_ivf:>8.0f}")
print(f"{'ChromaDB (persistent)':25s} {chroma_build:>10.3f} {chroma_query/N_QUERIES*1000:>12.2f} {chroma_qps:>8.0f}")

benchmark_results = {
    "faiss_flat": {"build_s": faiss_build_flat, "query_ms": faiss_query_flat/N_QUERIES*1000, "qps": faiss_qps_flat},
    "faiss_ivf": {"build_s": faiss_build_ivf, "query_ms": faiss_query_ivf/N_QUERIES*1000, "qps": faiss_qps_ivf},
    "chromadb": {"build_s": chroma_build, "query_ms": chroma_query/N_QUERIES*1000, "qps": chroma_qps},
}

---

## Step 7 – Export Results & Fine-Tuned Model

In [None]:
# ============================================================
# Step 7: Export Everything
# ============================================================

# 1. Save comprehensive results
results = {
    "notebook": "05-fine-tuning",
    "timestamp": datetime.now().isoformat(),
    "base_model": BASE_MODEL_ID,
    "finetuned_model_path": str(MODEL_DIR),
    "training": {
        "n_triplets": len(triplets),
        "n_train": len(train_examples),
        "n_val": len(val_examples),
        "epochs": NUM_EPOCHS,
        "batch_size": BATCH_SIZE,
        "learning_rate": LEARNING_RATE,
        "triplet_margin": MARGIN,
        "hard_negative_ratio": HARD_NEGATIVE_RATIO,
        "train_time_s": round(train_time, 1),
    },
    "evaluation": {
        "n_queries": N_EVAL,
        "base_p_at_5": round(float(base_results["mean_p_at_k"]), 4),
        "base_p_at_5_std": round(float(base_results["std_p_at_k"]), 4),
        "finetuned_p_at_5": round(float(ft_results["mean_p_at_k"]), 4),
        "finetuned_p_at_5_std": round(float(ft_results["std_p_at_k"]), 4),
        "improvement": round(float(improvement), 4),
        "improvement_pct": round(float(pct_improvement), 1),
        "best_ft_alpha": round(float(best_ft["alpha"]), 1),
        "best_ft_p_at_5": round(float(best_ft["P@5"]), 4),
    },
    "vector_store_benchmark": benchmark_results,
    "device": str(DEVICE),
    "gpu_name": GPU_NAME,
}

results_path = EXPORT_DIR / "finetune_results.json"
with open(results_path, "w") as f:
    json.dump(results, f, indent=2, default=str)
print(f"\u2713 {results_path.name:35s}  {results_path.stat().st_size / 1e3:.1f} KB")

# 2. Alpha sweep results
sweep_path = EXPORT_DIR / "finetune_alpha_sweep.csv"
df_sweep_ft.to_csv(sweep_path, index=False)
print(f"\u2713 {sweep_path.name:35s}  {sweep_path.stat().st_size / 1e3:.1f} KB")

# 3. Fine-tuned embeddings (already saved above)
print(f"\u2713 {'finetuned_text_index.npy':35s}  {ft_index_path.stat().st_size / 1e6:.1f} MB")

# 4. Model directory listing
print(f"\nFine-tuned model files in {MODEL_DIR}/:")
for p in sorted(MODEL_DIR.rglob("*")):
    if p.is_file():
        print(f"  {p.relative_to(MODEL_DIR)}  {p.stat().st_size / 1e6:.1f} MB")

print(f"\n\u2713 All artifacts exported")

In [None]:
# ============================================================
# Final Summary
# ============================================================

print("=" * 60)
print("SHOPTALK FINE-TUNING \u2014 SUMMARY")
print("=" * 60)

print("\n--- Training ---")
print(f"  Base model:       {BASE_MODEL_ID}")
print(f"  Triplets:         {len(triplets):,}")
print(f"  Epochs:           {NUM_EPOCHS}")
print(f"  Training time:    {train_time:.1f}s")
print(f"  Device:           {DEVICE} ({GPU_NAME})")

print("\n--- Retrieval Quality (P@5) ---")
print(f"  Base model:       {base_results['mean_p_at_k']:.4f}")
print(f"  Fine-tuned:       {ft_results['mean_p_at_k']:.4f}")
print(f"  Improvement:      {improvement:+.4f} ({pct_improvement:+.1f}%)")

print("\n--- Vector Store Benchmark ---")
print(f"  FAISS Flat:       {faiss_query_flat/N_QUERIES*1000:.2f}ms/query  ({faiss_qps_flat:.0f} QPS)")
print(f"  FAISS IVF:        {faiss_query_ivf/N_QUERIES*1000:.2f}ms/query  ({faiss_qps_ivf:.0f} QPS)")
print(f"  ChromaDB:         {chroma_query/N_QUERIES*1000:.2f}ms/query  ({chroma_qps:.0f} QPS)")

print("\n--- Exports ---")
print(f"  Fine-tuned model: {MODEL_DIR}")
print(f"  Embeddings:       finetuned_text_index.npy")
print(f"  Results:          finetune_results.json")

print("\n" + "=" * 60)
print("Fine-tuning complete. Model ready for production use.")
print("=" * 60)