# Mining Hard Negatives for Asymmetric Search with ChromaDB

This notebook demonstrates how to:

1. Load an initial **training dataset** of question–chunk pairs from `rag_train_dataset.csv`.
2. Use a **Hugging Face embedding model** (`sentence-transformers/msmarco-distilbert-cos-v5`) to:
   - embed all `chunk_preview` texts,
   - store them in a **ChromaDB** vector database.
3. For each question, embed the question and query ChromaDB to **mine hard negatives**
   (3–5 similar but incorrect chunks per question).
4. Build a **new dataset** in triplet form:
   - `question`
   - `chunk_preview` (the correct / positive chunk)
   - `hard_negative_chunk` (one hard negative per row)

This triplet dataset can be used later to fine-tune the embedding model with
contrastive or triplet-style loss functions.


In [None]:
# ============================================================================
# OPTIONAL: Install dependencies
# ============================================================================
# Uncomment and run this cell if you do NOT already have these libraries.
# In many environments (e.g., Colab) you will need to install them first.
# ----------------------------------------------------------------------------

# !pip install -U chromadb transformers sentencepiece torch pandas

In [1]:
# ============================================================================
# Imports and Configuration
# ============================================================================
# This section imports all the Python packages we will use and sets up
# some basic configuration such as the model name and random seeds.
# ----------------------------------------------------------------------------

import os
import random
from typing import List, Dict

import pandas as pd
import torch
from torch import nn
from transformers import AutoTokenizer, AutoModel

import chromadb
from chromadb.config import Settings

# Set a random seed for reproducibility of any sampling we might do
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Device configuration: use GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Path to the initial training data CSV.
# This CSV is expected to have at least:
#   - a 'question' column
#   - a 'chunk_preview' column
CSV_PATH = "rag_train_dataset.csv"

# Name of the Hugging Face embedding model we will use.
# This is an asymmetric model trained for question -> passage retrieval.
MODEL_NAME = "sentence-transformers/msmarco-distilbert-cos-v5"

# Number of hard negatives we aim to mine per question.
# You can adjust this between 3 and 5 as desired.
NUM_HARD_NEGATIVES = 5

# How many nearest neighbors to retrieve when mining.
# We retrieve more than we need so we can filter out the true positive and
# still have enough candidates for hard negatives.
K_RETRIEVE = 20


Using device: cpu


In [None]:
# ============================================================================
# Load the Initial Training Data
# ============================================================================
# We now load `rag_train_dataset.csv` which should contain at least two columns:
#   - 'question'       : the user query / question text
#   - 'chunk_preview'  : the relevant passage / chunk text
# 
# If your actual column names differ, update the code below accordingly.
# ----------------------------------------------------------------------------

df = pd.read_csv(CSV_PATH)

# Basic sanity checks and preview
print("Loaded dataset shape:", df.shape)
print("Columns:", df.columns.tolist())

# Show the first few rows so we can verify the structure
df.head()


In [None]:
# ============================================================================
# Assign Unique IDs to Each Chunk
# ============================================================================
# ChromaDB requires each document / vector to have a unique string ID.
# Here we create a simple 'doc_id' for each row based on its index.
# We also make sure there are no missing values in the key columns.
# ----------------------------------------------------------------------------

# Drop any rows where question or chunk_preview is missing to avoid errors
df = df.dropna(subset=["question", "chunk_preview"]).reset_index(drop=True)

# Create a unique string ID per row (used as the Chroma document ID)
df["doc_id"] = df.index.map(lambda i: f"doc_{i}")

print("Dataset size after dropping NaNs:", df.shape)
df.head()


In [None]:
# ============================================================================
# Load the Embedding Model and Define Encoding Helpers
# ============================================================================
# We load the Hugging Face model and tokenizer, then define a small helper
# to convert raw text into normalized embeddings suitable for cosine similarity.
# 
# The model 'sentence-transformers/msmarco-distilbert-cos-v5' is designed for
# asymmetric search (query vs passage), but we will use the same encoder for
# both questions and passages here.
# ----------------------------------------------------------------------------

# Load tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME).to(device)
model.eval()  # put the model into evaluation mode (no dropout, etc.)

def mean_pooling(model_output, attention_mask):
    """Perform mean pooling on the token embeddings.

    This function takes the model output (last_hidden_state) and an attention mask,
    and computes a single vector per sequence by averaging the embeddings for
    the tokens that are not masked.

    Args:
        model_output: Output object from the transformer model.
        attention_mask: Tensor of shape (batch_size, seq_len) indicating which
                        tokens are real (1) vs padding (0).

    Returns:
        A tensor of shape (batch_size, hidden_dim) containing the pooled embeddings.
    """
    token_embeddings = model_output.last_hidden_state  # (batch_size, seq_len, hidden_dim)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    # Multiply token embeddings by mask, sum over sequence length, and divide by number of valid tokens
    pooled = (token_embeddings * input_mask_expanded).sum(dim=1) / torch.clamp(
        input_mask_expanded.sum(dim=1), min=1e-9
    )
    return pooled

@torch.no_grad()
def encode_texts(texts: List[str], batch_size: int = 32, max_length: int = 256) -> torch.Tensor:
    """Encode a list of texts into L2-normalized embeddings.

    This helper will:
      1. Tokenize the texts in mini-batches.
      2. Run them through the transformer model.
      3. Apply mean pooling.
      4. L2-normalize the resulting embeddings so cosine similarity corresponds
         to dot product.

    Args:
        texts: A list of strings to encode.
        batch_size: Batch size for encoding.
        max_length: Maximum number of tokens per sequence (longer texts are truncated).

    Returns:
        A tensor of shape (len(texts), hidden_dim) containing normalized embeddings.
    """
    all_embeddings = []
    for start_idx in range(0, len(texts), batch_size):
        batch_texts = texts[start_idx:start_idx + batch_size]

        # Tokenize the batch of texts
        encoded = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        ).to(device)

        # Forward pass through the model
        model_output = model(**encoded)

        # Mean-pool the token embeddings
        pooled = mean_pooling(model_output, encoded["attention_mask"])

        # L2-normalize the pooled embeddings along the feature dimension
        pooled = nn.functional.normalize(pooled, p=2, dim=-1)

        # Move to CPU to free GPU memory and append
        all_embeddings.append(pooled.cpu())

    # Concatenate all batches into a single tensor
    return torch.cat(all_embeddings, dim=0)


In [None]:
# ============================================================================
# Initialize ChromaDB and Create a Collection
# ============================================================================
# Here we create an in-memory ChromaDB client and a collection that will store
# our chunk embeddings. We are NOT using an embedding function inside Chroma,
# because we want full control over how we embed with our Hugging Face model.
# Instead, we will compute embeddings ourselves and pass them to Chroma.
# ----------------------------------------------------------------------------

# Create an in-memory Chroma client (no persistent storage by default)
client = chromadb.Client(Settings(anonymized_telemetry=False))

# Name of the collection for our chunks
COLLECTION_NAME = "rag_chunks"

# If a collection with this name already exists, you might want to delete it or reuse it.
# To be safe in repeated runs, we can try to get and delete it first.
try:
    client.delete_collection(COLLECTION_NAME)
except Exception:
    # It's fine if the collection does not exist yet
    pass

# Create a fresh collection
collection = client.create_collection(name=COLLECTION_NAME)
print("Created ChromaDB collection:", collection.name)


In [None]:
# ============================================================================
# Embed All chunk_preview Texts and Add Them to ChromaDB
# ============================================================================
# We now:
#   1. Extract the 'chunk_preview' text for each row.
#   2. Encode them into embeddings using our Hugging Face model.
#   3. Add the embeddings and associated metadata (e.g., question) into ChromaDB.
# 
# This allows us to later query the collection with question embeddings to find
# similar chunks, which we will treat as candidate hard negatives.
# ----------------------------------------------------------------------------

# Extract the list of chunk texts and corresponding IDs
chunk_texts = df["chunk_preview"].tolist()
chunk_ids = df["doc_id"].tolist()

print("Number of chunks to embed:", len(chunk_texts))

# Encode all chunk texts into normalized embeddings
chunk_embeddings = encode_texts(chunk_texts, batch_size=32, max_length=256)
print("Chunk embeddings shape:", chunk_embeddings.shape)

# Convert embeddings to Python lists of floats for ChromaDB
chunk_embeddings_list = chunk_embeddings.tolist()

# Optionally store some metadata for each document, such as the original question
metadatas = [{"question": q} for q in df["question"].tolist()]

# Add all documents, embeddings, and metadata to the Chroma collection
collection.add(
    ids=chunk_ids,
    documents=chunk_texts,
    embeddings=chunk_embeddings_list,
    metadatas=metadatas,
)

print("Added documents to ChromaDB collection.")


In [None]:
# ============================================================================
# Mine Hard Negatives from ChromaDB
# ============================================================================
# For each question:
#   1. Encode the question into an embedding.
#   2. Use ChromaDB to retrieve the top-K most similar chunks.
#   3. Exclude the *true positive* chunk for that question (its own 'doc_id').
#   4. Take up to NUM_HARD_NEGATIVES of the remaining results as hard negatives.
# 
# These are "hard" because they are semantically close to the question but are
# NOT the correct chunk in our training data.
# ----------------------------------------------------------------------------

# We will store a list of hard negatives (as strings) per question
all_hard_negatives: List[List[str]] = []

questions = df["question"].tolist()
pos_ids = df["doc_id"].tolist()

print("Starting hard negative mining for", len(questions), "questions...")

for i, (question, pos_id) in enumerate(zip(questions, pos_ids)):
    # Encode the single question into an embedding (shape: (1, hidden_dim))
    q_emb = encode_texts([question], batch_size=1, max_length=256)
    q_emb_list = q_emb[0].tolist()  # Convert to a plain list for ChromaDB

    # Query ChromaDB to get top K_RETRIEVE nearest chunks
    # We pass the question embedding as the query embedding
    results = collection.query(
        query_embeddings=[q_emb_list],
        n_results=K_RETRIEVE,
        include=["documents", "ids"]
    )

    # Retrieve the lists of ids and documents from the query results
    retrieved_ids = results["ids"][0]         # list of doc_ids
    retrieved_docs = results["documents"][0]  # list of chunk texts

    # Build a list of candidate hard negatives, skipping the true positive
    hn_docs = []
    for rid, rdoc in zip(retrieved_ids, retrieved_docs):
        if rid == pos_id:
            # This is the true positive for this question; skip it
            continue
        hn_docs.append(rdoc)
        # Stop once we have collected enough hard negatives
        if len(hn_docs) >= NUM_HARD_NEGATIVES:
            break

    # If we did not find enough distinct hard negatives (e.g., small dataset),
    # we simply keep as many as we found (could be fewer than NUM_HARD_NEGATIVES).
    all_hard_negatives.append(hn_docs)

    # Occasionally print progress so the user can see that it's working
    if (i + 1) % 50 == 0 or (i + 1) == len(questions):
        print(f"Processed {i + 1} / {len(questions)} questions")

# Add the list of hard negatives as a new column in the original DataFrame.
# Note: this column will hold lists of strings.
df["hard_negatives"] = all_hard_negatives

print("Example row with hard negatives:")
df[["question", "chunk_preview", "hard_negatives"]].head()


In [None]:
# ============================================================================
# Build Triplet Dataset: (question, positive_chunk, hard_negative_chunk)
# ============================================================================
# Many fine-tuning setups (triplet loss, InfoNCE with explicit negatives) prefer
# the data in "long" format, where each row is:
#   - question
#   - chunk_preview        (the correct / positive chunk)
#   - hard_negative_chunk  (one hard negative)
#
# For each original row and each of its mined hard negatives, we will create a
# separate row in a new DataFrame. If a question has N hard negatives, it will
# produce N rows.
# ----------------------------------------------------------------------------

triplet_rows = []

for _, row in df.iterrows():
    question_text = row["question"]
    positive_chunk = row["chunk_preview"]
    hn_list = row["hard_negatives"]  # list of hard negative strings

    # For each hard negative, create a new triplet row
    for hn_chunk in hn_list:
        triplet_rows.append(
            {
                "question": question_text,
                "positive_chunk": positive_chunk,
                "hard_negative_chunk": hn_chunk,
            }
        )

# Build the new DataFrame from the list of triplet dicts
triplet_df = pd.DataFrame(triplet_rows)

print("Triplet dataset shape:", triplet_df.shape)
triplet_df.head()


In [None]:
# ============================================================================
# Save the Triplet Dataset to CSV
# ============================================================================
# Finally, we save the newly constructed triplet dataset to disk.
# This CSV can then be used in a separate training script / notebook to
# fine-tune your embedding model using a triplet loss or multi-negative
# contrastive loss with explicit hard negatives.
# ----------------------------------------------------------------------------

OUTPUT_CSV = "rag_train_with_hard_negatives_triplets.csv"

triplet_df.to_csv(OUTPUT_CSV, index=False)
print(f"Saved triplet dataset to: {OUTPUT_CSV}")
