# RAG for Mantine Documentation

**Purpose:** Evaluate chunking and retrieval strategies for Mantine docs using a fixed set of queries and existing labels.  
**Note:** Embeddings and labels are already generated; ingestion and export steps are commented out.


## Notebook Flow
1. Vector search only: compare fixed vs custom chunking using existing labels.
2. Keyword search (BM25) evaluated on the winning chunker.
3. Hybrid search with RRF (vector + BM25) and evaluate.
4. Re-ranker evaluation on the same queries.
5. New ambiguous queries to test query parsing. 


In [1]:
%load_ext autoreload
%autoreload 2 

In [2]:
from pathlib import Path
from dotenv import load_dotenv
import os
import pandas as pd

load_dotenv()

True

In [3]:
# Instantiate pg_vector database session
from rag_service.db import DatabaseManager
from rag_service.models.embeddings import Document, Chunk

local_session = DatabaseManager.get_session_factory()

## Setup & Ingestion (already done)


In [4]:
from rag_service.pipeline.document_loader import load_corpus

# Create document node object
ROOT = Path("../documents").resolve()
SOURCE = "mantine_docs"

documents = load_corpus(source=SOURCE, root=ROOT)

In [5]:
mantine_documentation = documents[0]
mantine_documentation.doc_id

'mantine_docs::mantine-llms-full.txt'

In [7]:
from rag_service.providers.gemini import RateLimitedGeminiEmbedding


In [8]:
# Initialize Google Gemini Embedding model
from google.genai.types import EmbedContentConfig

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
EMBEDDING_DIM = os.getenv("EMBEDDING_DIM")
embedding_model = RateLimitedGeminiEmbedding(
    model_name="gemini-embedding-001", 
    api_key=GEMINI_API_KEY,
    embedding_config=EmbedContentConfig(output_dimensionality=int(EMBEDDING_DIM)),
    embed_batch_size=99,
    timeout=60,
    sleep_s=0.1
)

### Instantiate Chunkers

In [9]:
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 300

In [None]:
from rag_service.pipeline.mantine_markdown_parser import MantineMarkdownChunker

mantine_parser = MantineMarkdownChunker(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)



In [None]:
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=500,
    chunk_overlap=80,
    separator="\n\n",
    backup_separators=["\n", " "]
)


### Ingestion Pipeline

In [None]:
# Initialize Ingestion Pipeline
from rag_service.pipeline.ingestion import IngestPipeline

In [None]:
ingest_pipeline_custom = IngestPipeline(
    chunker_transform=mantine_parser,
    embedding_model=embedding_model,
    session_factory=local_session
)


In [None]:
ingest_pipeline_fixed = IngestPipeline(
    chunker_transform=splitter,
    embedding_model=embedding_model,
    session_factory=local_session
)

In [None]:
res_custom = await ingest_pipeline_custom.ingest_documents(
    documents=[mantine_documentation],
    source=SOURCE,
    title="Mantine Documentation"
)

In [None]:
# res_fixed = await ingest_pipeline_fixed.ingest_documents(
#     documents=[mantine_documentation],
#     source=SOURCE,
#     title="Mantine Documentation"
# )

In [None]:
# res_custom

In [None]:
# res_fixed

## Step 1 - Vector Search: Chunker Comparison


### Load Queries (existing)


In [10]:
import json
from rag_service.eval.labelling_utils import export_hits_to_csv, make_truth_label_df

from rag_service.models.evaluations import (
    QueryItem,
    RetrievalHit,
    KeywordSearchHit
)




In [11]:
query_fpath = Path("../evaluation/queries_merged.jsonl")
queries = [json.loads(line) for line in query_fpath.read_text().splitlines() if line.strip()]
query_items = [QueryItem.model_validate(q) for q in queries]


In [12]:
from rag_service.pipeline.retrieval import vectors_search, bm25_search, hybrid_search


### Run Vector Search (vector-only baseline)


In [None]:
ef_search_values = [50]
async with local_session() as session:
    retrieved_vector_hits = await vectors_search(
        queries=query_items,
        source=SOURCE,
        embedding_model=embedding_model,
        ef_search_values=ef_search_values,
        k=15,
        session=session
    )

### Load Existing Labels (custom + fixed)


In [None]:

labelled_path_custom = '../evaluation/retrieval_mantine_custom_merged_label.csv'
labelled_path_fixed = '../evaluation/retrieval_mantine_fixed_chunk_merged_label.csv'

df_custom = pd.read_csv(labelled_path_custom)
df_fixed = pd.read_csv(labelled_path_fixed)

### Metrics & Comparison


In [25]:
import numpy as np
from rag_service.eval.ndcg import calculate_ndcg
from rag_service.eval.precision import calculate_precision_scores

# The cut off constant for evaluation
K = 10
   


In [23]:
def summarize_metrics(df: pd.DataFrame, run_name: str, k: int) -> dict:
    precision = calculate_precision_scores(df, k)
    ndcg = calculate_ndcg(df, k)

    return {
        "run": run_name,
        f"P@{k}": np.mean([r["precision_at_k"] for r in precision]),
        f"MAP@{k}": np.mean([r["average_precision"] for r in precision]),
        f"nDCG@{k}": np.mean([r["ndcg_value"] for r in ndcg]),
        f"DCG@{k}": np.mean([r["dcg_value"] for r in ndcg]),
    }

In [None]:
summary_fixed = summarize_metrics(df_fixed, "Vector: fixed chunker", K)
summary_custom_vector = summarize_metrics(df_custom, "Vector: custom chunker", K)

chunker_summary = pd.DataFrame([summary_fixed, summary_custom_vector]).set_index("run")
chunker_delta = (
    chunker_summary.loc["Vector: custom chunker"]
    - chunker_summary.loc["Vector: fixed chunker"]
).to_frame().T
chunker_delta.index = ["Delta custom - fixed"]

chunker_table = pd.concat([chunker_summary, chunker_delta])
chunker_table.style.format("{:.4f}")


**Insight**

Fixed-size chunking is a strong baseline. The markdown-aware chunker improves early/top-K quality (P@10, DCG@10, nDCG@10), likely due to better topical boundaries, but slightly reduces MAP@10, suggesting a different distribution of relevant hits across ranks.

**Limitation**

Relevance labels were manually assigned using a rubric, but judgments still involve some subjectivity (e.g., what counts as “direct” vs “partial”), so scores may have minor margins of error.

Labeling rubric:

- 3 = directly answers the query’s main intent with the key action/object and minimal distracting fluff.
- 2 = correct but less direct (more general, partially mixed with unrelated content, or answers a valid sub-interpretation).
- 1 = relevant context but not sufficient to answer on its own.

## Step 2 - Keyword Search (BM25)


Run BM25, then evaluate using the existing labels from the custom chunker.

In [None]:
async with local_session() as session:
    keyword_hits = await bm25_search(
        queries=query_items,
        k=15,
        session=session,
        source=SOURCE
    )

In [None]:
keyword_hits

In [None]:
# Labels already exported; skip new export for now
# outpath_bm25 = "../evaluation/bm25_search_labels.csv"
# export_hits_to_csv(keyword_hits, outpath_bm25)


In [None]:
labelled_path_bm25 = '../evaluation/bm25_search_labelled.csv'
keywords_df = pd.read_csv(labelled_path_bm25)
keywords_df

**BM25 Metrics**

In [None]:
summary_bm25 = summarize_metrics(keywords_df, "BM25", K)

bm25_table = pd.DataFrame([summary_bm25]).set_index("run")
bm25_table.style.format("{:.4f}")


## Step 3 - Hybrid Search with RRF (Vector + BM25)

Build RRF results, label them using a shared truth set, then evaluate.

In [None]:

async with local_session() as session:
    hybrid_search_df = await hybrid_search(
        queries=query_items,
        embedding_model=embedding_model,
        source=SOURCE,
        ef_search_values=ef_search_values,
        a=0.5,
        b=0.5,
        k=15,
        session=session
    )


hybrid_search_df.head(20)

In [None]:
# label_path_hybrid = '../evaluation/hybrid_search_mantine_merged_label.csv'
# hybrid_search_df.to_csv(label_path_hybrid, index=False)

In [None]:
hybrid_search_labelled_path = '../evaluation/hybrid_search_mantine_merged_labelled.csv'
hybrid_search_labelled_df = pd.read_csv(hybrid_search_labelled_path)

In [None]:

truth_label_df = make_truth_label_df(hybrid_search_labelled_df)

In [None]:
truth_label_df

In [None]:
assert truth_label_df.duplicated(['query_id', 'chunk_id']).sum() == 0

In [15]:
# Export to csv
outpath_truth = '../evaluation/mantine_truth_labels.csv'
# truth_label_df.to_csv(outpath_truth, index=False)

In [16]:
truth_label_df = pd.read_csv(outpath_truth)

**RRF Metrics**

In [None]:
summary_rrf = summarize_metrics(hybrid_search_labelled_df, "RRF (a=0.5, b=0.5)", K)

method_summary = pd.DataFrame(
    [summary_custom_vector, summary_bm25, summary_rrf]
).set_index("run")

method_summary.style.format("{:.4f}")


**Insight**


BM25 provides no benefit over vector search in this setting. RRF did not yield measurable gains at @10 for the tested weight configuration, so vector-only remains the best-performing baseline. 

These queries are phrased more like “how-to” or conceptual questions to simulate what an actual user might ask (semantic intent, paraphrases, and context), so exact keyword overlap is less important. In this setting, vector retrieval can match meaning even when the query words don’t appear verbatim in the best chunk, while BM25 relies heavily on lexical overlap and can miss semantically relevant chunks.

## Step 4 - Re-ranker



In [None]:
import voyageai
import pandas as pd

vo = voyageai.Client()

def voyage_rerank(df:pd.DataFrame, model:str='rerank-2.5-lite', top_k:int=10):
    all_reranked = []

    for query_id, group in df.groupby("query_id"):
        query = group.iloc[0]["query_text"]
        documents = group.sort_values("rank")["chunk_text"].tolist()

        reranked_results = vo.rerank(
            query=query,
            documents=documents,
            model=model,
            top_k=top_k
        )

        picked = []
        for new_rank, r in enumerate(reranked_results.results, start=1):
            row = group.iloc[r.index].copy()
            row["rerank_score"] = float(r.relevance_score)
            row["rank_reranked"] = new_rank
            picked.append(row)
        
        all_reranked.append(pd.DataFrame(picked))
    reranked_df = pd.concat(all_reranked, ignore_index=True)
    return reranked_df

In [None]:
reranked_df = voyage_rerank(
    df=hybrid_search_labelled_df,
    model='rerank-2.5-lite',
    top_k=10
)

reranked_df

In [None]:
rerank_outpath = '../evaluation/reranked_mantine_label.csv'
reranked_df.to_csv(rerank_outpath, index=False)

In [None]:
reranked_summary = summarize_metrics(reranked_df, "reranked", K)

method_summary = pd.DataFrame(
    [summary_custom_vector, summary_bm25, summary_rrf, reranked_summary]
).set_index("run")

method_summary.style.format("{:.4f}")

**Insights**

Reranking noticeably improved the performance from RRF's baseline. 

## Step 5 - Query Parsing 

### Unparsed

In [17]:
query_parse_fpath = Path("../evaluation/queries_to_parse.jsonl")
queries_to_parse = [json.loads(line) for line in query_parse_fpath.read_text().splitlines() if line.strip()]
queries_to_parse = [QueryItem.model_validate(q) for q in queries_to_parse]


In [18]:
queries_to_parse

[QueryItem(id='parse_eval2_q001_01', category='theme', difficulty=1, text='How can I add my own styles to be used across the whole app?', tags=['theme', 'provider', 'global', 'messy']),
 QueryItem(id='parse_eval2_q007_01', category='hooks', difficulty=3, text='form validate 2 inputs same — error only on 2nd', tags=['useForm', 'validation', 'cross-field', 'messy']),
 QueryItem(id='parse_eval2_q005_01', category='components', difficulty=2, text="I don't want my top header or navbar to scroll with page.", tags=['layout', 'header', 'messy']),
 QueryItem(id='parse_eval2_q014_01', category='hooks', difficulty=2, text='When I make a change on one tab, I want the other tab to be updated too.', tags=['useLocalStorage', 'localStorage', 'sync', 'tabs', 'messy'])]

In [19]:
ef_search_values = [50]
async with local_session() as session:
    retrieved_vector_hits = await vectors_search(
        queries=queries_to_parse,
        source=SOURCE,
        embedding_model=embedding_model,
        ef_search_values=ef_search_values,
        k=15,
        session=session
    )
    
unparsed_query_df = pd.DataFrame([hit.model_dump() for hit in retrieved_vector_hits])
unparsed_query_df

Unnamed: 0,query_id,query_text,run_name,param_value,rank,dist,chunk_id,chunk_text
0,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,1,0.29495,abe16395-6325-49d4-a307-132341bb45ab,Topic: GlobalStyles\nSection: Add global style...
1,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,2,0.307682,e3b48d31-7906-43a5-80ca-00ca524932a1,Topic: Emotion\nSection: styles prop\n\n`style...
2,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,3,0.309155,52324c0a-312a-4395-954e-a4ab7e7a8c8c,Topic: Emotion\nSection: styles in theme\n\nYo...
3,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,4,0.309955,0c1b3056-616a-4197-be68-5bd91cea9d5c,Topic: General\nSection: I prefer a third-part...
4,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,5,0.31376,0c3e5f2b-7fe3-4012-acad-cd2ce8a4e27f,Topic: MantineStyles\nSection: Mantine compone...
5,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,6,0.316285,5b76a555-ef34-49dc-9e81-02c272d0c6dd,Topic: Input\nSection: Styles on theme\n\nSame...
6,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,7,0.316862,5d6d8375-bc91-44cc-a34b-00414bd75517,Topic: SixToSeven\nSection: Global styles\n\n`...
7,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,8,0.318789,f57d97ef-6fa4-4c3f-a9db-a878fcd563bb,Topic: General\nSection: How Mantine styles wo...
8,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,9,0.318796,e7c2fab6-b9bf-4ad6-b28d-0b467ca41b8c,Topic: CSSModules\nSection: Styling Mantine co...
9,parse_eval2_q001_01,How can I add my own styles to be used across ...,hnsw_ef50_k15,50,10,0.32208,331c6399-0fc0-4ed6-b5d9-63baaa240432,Topic: CSSModules\nSection: Adding styles to M...


In [20]:
def to_gold_query_id(parsed_id: str) -> str:
    """
    parse_eval2_q001_01 -> eval2_q001
    eval2_q001          -> eval2_q001  (already gold)
    """
    s = str(parsed_id)
    if s.startswith("parse_"):
        s = s[len("parse_"):]
    parts = s.split("_")
    return "_".join(parts[:2]) 



In [21]:
unparsed_query_df["gold_query_id"] = unparsed_query_df["query_id"].apply(to_gold_query_id)
unparsed_query_df_labelled = unparsed_query_df.merge(
    truth_label_df,
    left_on=["gold_query_id", "chunk_id"],
    right_on=["query_id", "chunk_id"],
    how="left",
    suffixes=("", "_truth")
)

unparsed_query_df_labelled["relevance"] = unparsed_query_df_labelled["relevance"].fillna(0).astype(int)

In [26]:
unparsed_query_summary = summarize_metrics(unparsed_query_df_labelled, "unparesed queries", K)

unparsed_query_summary

{'run': 'unparesed queries',
 'P@10': np.float64(0.125),
 'MAP@10': np.float64(0.43749999999999994),
 'nDCG@10': np.float64(0.5225979458841208),
 'DCG@10': np.float64(4.751541953607022)}

### Parsed

In [27]:
import json
from typing import Optional, Dict, List
from pydantic import BaseModel, Field, ValidationError, field_validator

class ParsedQuery(BaseModel):
    refined_query: str
    filters: Dict[str, str] = Field(default_factory=dict)
    must_not: List[str] = Field(default_factory=list)
    confidence: Optional[float] = None


def validate_llm_json(text: str) -> ParsedQuery:
    # Layer 1: JSON parse
    data = json.loads(text)  # raises json.JSONDecodeError if invalid
    # Layer 2: Schema validate
    return ParsedQuery.model_validate(data)  # raises ValidationError if wrong shape





In [28]:
from rag_service.providers.gemini import GeminiTextLLM

query_parser_llm = GeminiTextLLM(
    model=os.getenv("GEMINI_PARSER_Model", "gemini-2.5-flash-lite"),
    api_key=os.getenv("GEMINI_API_KEY"),
    min_interval_s=2
)

In [50]:
import json

async def parse_query_with_llm(llm, user_query: str, system_prompt: str, query_parsing_prompt: str) -> ParsedQuery:
    prompt = query_parsing_prompt.format(user_query=user_query)

    # attempt 1
    raw = await llm.generate_text(
        prompt=prompt,
        system_prompt=system_prompt,
        max_tokens=500,
        temperature=0.0,
        top_p=1.0,
        n=1,
        stop=None,
        response_mime_type="application/json",
    )
    try:
        return validate_llm_json(raw)
    except (json.JSONDecodeError, ValidationError):
        # attempt 2 (repair)
        repair_prompt = "Return ONLY valid JSON. No prose. No markdown.\n\n" + prompt
        raw2 = await llm.generate_text(
            prompt=repair_prompt,
            system_prompt=system_prompt,
            max_tokens=500,
            temperature=0.0,
            top_p=1.0,
            n=1,
            stop=None,
            response_mime_type="application/json",
        )
        try:
            return validate_llm_json(raw2)
        except (json.JSONDecodeError, ValidationError) as e:
            print("PARSER VALIDATION FAILED:", type(e).__name__, e)
            print("RAW LLM OUTPUT:", raw[:500])
            return ParsedQuery(refined_query=user_query)


In [30]:
parsed_queries: list[QueryItem] = []

SYSTEM_PROMPT = (
    "You are an expert query parser for Mantine documentation search. "
    "Return ONLY valid JSON matching the requested schema. No prose, no markdown."
    "filters values MUST be strings (single value), not arrays. If multiple, pick the most relevant single one."
)

QUERY_PARSING_PROMPT = (
    """
    Return ONLY valid JSON with fields:
        - refined_query (string)
        - filters (object)
        - must_not (array of strings)
        - confidence (0-1)
    
    "User Query: '''{user_query}'''
    
    """
)

for qi in queries_to_parse:
    parsed = await parse_query_with_llm(
        llm=query_parser_llm,
        user_query=qi.text,
        system_prompt=SYSTEM_PROMPT,
        query_parsing_prompt=QUERY_PARSING_PROMPT,
    )

    parsed_queries.append(
        QueryItem(
            id=qi.id,
            category=qi.category,
            difficulty=qi.difficulty,
            text=parsed.refined_query,
            tags=qi.tags,
        )
    )


In [31]:
parsed_queries

[QueryItem(id='parse_eval2_q001_01', category='theme', difficulty=1, text='global styles', tags=['theme', 'provider', 'global', 'messy']),
 QueryItem(id='parse_eval2_q007_01', category='hooks', difficulty=3, text='form validation two inputs same error second input', tags=['useForm', 'validation', 'cross-field', 'messy']),
 QueryItem(id='parse_eval2_q005_01', category='components', difficulty=2, text='fixed header or navbar', tags=['layout', 'header', 'messy']),
 QueryItem(id='parse_eval2_q014_01', category='hooks', difficulty=2, text='synchronize tab updates', tags=['useLocalStorage', 'localStorage', 'sync', 'tabs', 'messy'])]

In [32]:
ef_search_values = [50]

async with local_session() as session:
    retrieved_vector_hits_parsed = await vectors_search(
        queries=parsed_queries,
        source=SOURCE,
        embedding_model=embedding_model,
        ef_search_values=ef_search_values,
        k=15,
        session=session
    )
    
parsed_queries_df = pd.DataFrame([hit.model_dump() for hit in retrieved_vector_hits_parsed])
parsed_queries_df

Unnamed: 0,query_id,query_text,run_name,param_value,rank,dist,chunk_id,chunk_text
0,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,1,0.250112,cfa908cf-8382-442f-a0c8-17903de6fca0,Topic: GlobalStyles\nSection: Overview\n\n# Gl...
1,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,2,0.258681,5d6d8375-bc91-44cc-a34b-00414bd75517,Topic: SixToSeven\nSection: Global styles\n\n`...
2,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,3,0.265468,abe16395-6325-49d4-a307-132341bb45ab,Topic: GlobalStyles\nSection: Add global style...
3,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,4,0.266246,d695b57b-b87a-4ca4-9c82-53b7a48e6ae0,Topic: GlobalStyles\nSection: Body and :root e...
4,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,5,0.266315,10efcc17-9248-40a0-9d2b-1ed2d15fd736,Topic: CSSFilesList\nSection: Global styles\n\...
5,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,6,0.272084,75b8add9-8a29-4ee4-b96c-594d78329a41,Topic: SevenToEight\nSection: Global styles im...
6,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,7,0.280832,b36531fe-5dfa-44de-95c9-21cb21a0304a,Topic: SixToSeven\nSection: createStyles and G...
7,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,8,0.288701,6d2d42a7-b7b3-434e-acf1-2f1a2926817f,Topic: GlobalStyles\nSection: CSS reset\n\n`@m...
8,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,9,0.297437,0262d898-9595-4dd3-a4c9-9e415747d92e,Topic: CSSModules\nSection: Referencing global...
9,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,10,0.298924,0c521dc0-c7cc-4d8e-8262-14fb73f8bbe3,Topic: GlobalStyles\nSection: Static classes\n...


In [33]:
parsed_queries_df["gold_query_id"] = parsed_queries_df["query_id"].apply(to_gold_query_id)
parsed_query_df_labelled = parsed_queries_df.merge(
    truth_label_df,
    left_on=["gold_query_id", "chunk_id"],
    right_on=["query_id", "chunk_id"],
    how="left",
    suffixes=("", "_truth")
)

parsed_query_df_labelled["relevance"] = parsed_query_df_labelled["relevance"].fillna(0).astype(int)

In [34]:
parsed_query_summary = summarize_metrics(parsed_query_df_labelled, "paresed queries", K)

parsed_query_summary

{'run': 'paresed queries',
 'P@10': np.float64(0.125),
 'MAP@10': np.float64(0.5634920634920635),
 'nDCG@10': np.float64(0.6273750696567437),
 'DCG@10': np.float64(5.131372036660792)}

### Prompt Engineering

In [35]:
IMPROVED_SYSTEM_PROMPT = (
    "You are a query parser for Mantine documentation search. "
    "Rewrite messy queries into technical search queries that match documentation wording" 
    "Return ONLY valid JSON. No prose, no markdown."
)

IMPROVED_QUERY_PARSING_PROMPT = """
Return ONLY valid JSON with fields:
- refined_query (string)
- filters (object)
- must_not (array of strings)
- confidence (0-1)

Rules:
- Avoid overly generic phrases.
- Negation: if user says don't/not/avoid/without, rewrite to the positive goal and put the negated concept in must_not.
- Try your best to map generic user terms to actual Mantine terms, if you are confident of the mapping.
- Do not reduce queries into keyword lists, keep it with context as natural language.
- Only add filters if confident; values must be strings.

User Query: '''{user_query}'''
"""


### Parsed (Improved Prompts)


In [36]:
parsed_queries_improved: list[QueryItem] = []

for qi in queries_to_parse:
    parsed = await parse_query_with_llm(
        llm=query_parser_llm,
        user_query=qi.text,
        system_prompt=IMPROVED_SYSTEM_PROMPT,
        query_parsing_prompt=IMPROVED_QUERY_PARSING_PROMPT,
    )

    parsed_queries_improved.append(
        QueryItem(
            id=qi.id,
            category=qi.category,
            difficulty=qi.difficulty,
            text=parsed.refined_query,
            tags=qi.tags,
        )
    )


In [37]:
parsed_queries_improved

[QueryItem(id='parse_eval2_q001_01', category='theme', difficulty=1, text='global styles', tags=['theme', 'provider', 'global', 'messy']),
 QueryItem(id='parse_eval2_q007_01', category='hooks', difficulty=3, text='form validation for two inputs with same value, showing error only on the second input', tags=['useForm', 'validation', 'cross-field', 'messy']),
 QueryItem(id='parse_eval2_q005_01', category='components', difficulty=2, text='fixed header or navbar', tags=['layout', 'header', 'messy']),
 QueryItem(id='parse_eval2_q014_01', category='hooks', difficulty=2, text='synchronize tab state between tabs', tags=['useLocalStorage', 'localStorage', 'sync', 'tabs', 'messy'])]

In [38]:
ef_search_values = [50]

async with local_session() as session:
    retrieved_vector_hits_parsed_improved = await vectors_search(
        queries=parsed_queries_improved,
        source=SOURCE,
        embedding_model=embedding_model,
        ef_search_values=ef_search_values,
        k=15,
        session=session
    )
    
parsed_queries_improved_df = pd.DataFrame([hit.model_dump() for hit in retrieved_vector_hits_parsed_improved])
parsed_queries_improved_df


Unnamed: 0,query_id,query_text,run_name,param_value,rank,dist,chunk_id,chunk_text
0,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,1,0.250112,cfa908cf-8382-442f-a0c8-17903de6fca0,Topic: GlobalStyles\nSection: Overview\n\n# Gl...
1,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,2,0.258681,5d6d8375-bc91-44cc-a34b-00414bd75517,Topic: SixToSeven\nSection: Global styles\n\n`...
2,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,3,0.265468,abe16395-6325-49d4-a307-132341bb45ab,Topic: GlobalStyles\nSection: Add global style...
3,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,4,0.266246,d695b57b-b87a-4ca4-9c82-53b7a48e6ae0,Topic: GlobalStyles\nSection: Body and :root e...
4,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,5,0.266315,10efcc17-9248-40a0-9d2b-1ed2d15fd736,Topic: CSSFilesList\nSection: Global styles\n\...
5,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,6,0.272084,75b8add9-8a29-4ee4-b96c-594d78329a41,Topic: SevenToEight\nSection: Global styles im...
6,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,7,0.280832,b36531fe-5dfa-44de-95c9-21cb21a0304a,Topic: SixToSeven\nSection: createStyles and G...
7,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,8,0.288701,6d2d42a7-b7b3-434e-acf1-2f1a2926817f,Topic: GlobalStyles\nSection: CSS reset\n\n`@m...
8,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,9,0.297437,0262d898-9595-4dd3-a4c9-9e415747d92e,Topic: CSSModules\nSection: Referencing global...
9,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,10,0.298924,0c521dc0-c7cc-4d8e-8262-14fb73f8bbe3,Topic: GlobalStyles\nSection: Static classes\n...


In [39]:
parsed_queries_improved_df["gold_query_id"] = parsed_queries_improved_df["query_id"].apply(to_gold_query_id)
parsed_query_improved_df_labelled = parsed_queries_improved_df.merge(
    truth_label_df,
    left_on=["gold_query_id", "chunk_id"],
    right_on=["query_id", "chunk_id"],
    how="left",
    suffixes=("", "_truth")
)

parsed_query_improved_df_labelled["relevance"] = parsed_query_improved_df_labelled["relevance"].fillna(0).astype(int)


In [40]:
parsed_query_improved_summary = summarize_metrics(parsed_query_improved_df_labelled, "improved parsed queries", K)

parsed_query_improved_summary


{'run': 'improved parsed queries',
 'P@10': np.float64(0.125),
 'MAP@10': np.float64(0.6527777777777778),
 'nDCG@10': np.float64(0.6637013687201386),
 'DCG@10': np.float64(5.579899565498037)}

In [41]:
query_parse_summary = pd.DataFrame(
    [unparsed_query_summary, parsed_query_summary, parsed_query_improved_summary]
).set_index("run")

query_parse_summary.style.format("{:.4f}")


Unnamed: 0_level_0,P@10,MAP@10,nDCG@10,DCG@10
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
unparesed queries,0.125,0.4375,0.5226,4.7515
paresed queries,0.125,0.5635,0.6274,5.1314
improved parsed queries,0.125,0.6528,0.6637,5.5799


### Augment Prompt with retrieved chunks


### Parsed (Augmented With Retrieved Chunks)


In [42]:
def build_retrieved_chunk_context(df: pd.DataFrame, top_k: int = 3) -> dict[str, str]:
    ctx = {}
    ranked = df.sort_values("rank")
    for query_id, group in ranked.groupby("query_id"):
        top_chunks = group.head(top_k)["chunk_text"].tolist()
        ctx[query_id] = "\n\n".join([
            f"Chunk {i + 1}:\n{chunk}" for i, chunk in enumerate(top_chunks)
        ])
    return ctx

retrieved_chunk_context = build_retrieved_chunk_context(unparsed_query_df, top_k=3)


In [52]:
AUGMENTED_SYSTEM_PROMPT = (
    "You are a query parser for Mantine documentation search. "
    "Rewrite messy queries into concise technical search queries using Mantine’s terminology. "
    "You are also given top retrieved chunks from the original query; use them only to infer Mantine's terminology. "
    "Return ONLY valid JSON. No prose, no markdown."
)

AUGMENTED_QUERY_PARSING_PROMPT = """
Return JSON with exactly these fields:
- refined_query: string (concise; include the most important Mantine API/component/hook names if implied)
- filters: object (optional; values must be strings, not arrays)
- must_not: array of strings (optional; put negated concepts here instead of refined_query)
- confidence: number between 0 and 1

Guidelines:
- Prefer library/API terms over vague words (e.g., component/hook/provider names when implied).
- filters can include keys like: package, component, hook, framework (only if confidently inferred).
- If unsure, leave filters empty.

Retrieved chunks:
{retrieved_chunks}

User Query: '''{user_query}'''
"""


In [53]:
parsed_queries_augmented: list[QueryItem] = []

for qi in queries_to_parse:
    retrieved_chunks = retrieved_chunk_context.get(qi.id, "") or "(none)"
    safe_chunks = retrieved_chunks.replace("{", "{{").replace("}", "}}")
    prompt_with_chunks = AUGMENTED_QUERY_PARSING_PROMPT.replace("{retrieved_chunks}", safe_chunks)

    parsed = await parse_query_with_llm(
        llm=query_parser_llm,
        user_query=qi.text,
        system_prompt=AUGMENTED_SYSTEM_PROMPT,
        query_parsing_prompt=prompt_with_chunks,
    )

    parsed_queries_augmented.append(
        QueryItem(
            id=qi.id,
            category=qi.category,
            difficulty=qi.difficulty,
            text=parsed.refined_query,
            tags=qi.tags,
        )
    )


In [54]:
parsed_queries_augmented


[QueryItem(id='parse_eval2_q001_01', category='theme', difficulty=1, text='global styles', tags=['theme', 'provider', 'global', 'messy']),
 QueryItem(id='parse_eval2_q007_01', category='hooks', difficulty=3, text='form validation matchesField', tags=['useForm', 'validation', 'cross-field', 'messy']),
 QueryItem(id='parse_eval2_q005_01', category='components', difficulty=2, text='fixed header or navbar', tags=['layout', 'header', 'messy']),
 QueryItem(id='parse_eval2_q014_01', category='hooks', difficulty=2, text='Tabs synchronization between browser tabs', tags=['useLocalStorage', 'localStorage', 'sync', 'tabs', 'messy'])]

In [55]:
ef_search_values = [50]

async with local_session() as session:
    retrieved_vector_hits_parsed_augmented = await vectors_search(
        queries=parsed_queries_augmented,
        source=SOURCE,
        embedding_model=embedding_model,
        ef_search_values=ef_search_values,
        k=15,
        session=session
    )
    
parsed_queries_augmented_df = pd.DataFrame([hit.model_dump() for hit in retrieved_vector_hits_parsed_augmented])
parsed_queries_augmented_df


Unnamed: 0,query_id,query_text,run_name,param_value,rank,dist,chunk_id,chunk_text
0,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,1,0.250112,cfa908cf-8382-442f-a0c8-17903de6fca0,Topic: GlobalStyles\nSection: Overview\n\n# Gl...
1,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,2,0.258681,5d6d8375-bc91-44cc-a34b-00414bd75517,Topic: SixToSeven\nSection: Global styles\n\n`...
2,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,3,0.265468,abe16395-6325-49d4-a307-132341bb45ab,Topic: GlobalStyles\nSection: Add global style...
3,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,4,0.266246,d695b57b-b87a-4ca4-9c82-53b7a48e6ae0,Topic: GlobalStyles\nSection: Body and :root e...
4,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,5,0.266315,10efcc17-9248-40a0-9d2b-1ed2d15fd736,Topic: CSSFilesList\nSection: Global styles\n\...
5,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,6,0.272084,75b8add9-8a29-4ee4-b96c-594d78329a41,Topic: SevenToEight\nSection: Global styles im...
6,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,7,0.280832,b36531fe-5dfa-44de-95c9-21cb21a0304a,Topic: SixToSeven\nSection: createStyles and G...
7,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,8,0.288701,6d2d42a7-b7b3-434e-acf1-2f1a2926817f,Topic: GlobalStyles\nSection: CSS reset\n\n`@m...
8,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,9,0.297437,0262d898-9595-4dd3-a4c9-9e415747d92e,Topic: CSSModules\nSection: Referencing global...
9,parse_eval2_q001_01,global styles,hnsw_ef50_k15,50,10,0.298924,0c521dc0-c7cc-4d8e-8262-14fb73f8bbe3,Topic: GlobalStyles\nSection: Static classes\n...


In [56]:
parsed_queries_augmented_df["gold_query_id"] = parsed_queries_augmented_df["query_id"].apply(to_gold_query_id)
parsed_query_augmented_df_labelled = parsed_queries_augmented_df.merge(
    truth_label_df,
    left_on=["gold_query_id", "chunk_id"],
    right_on=["query_id", "chunk_id"],
    how="left",
    suffixes=("", "_truth")
)

parsed_query_augmented_df_labelled["relevance"] = parsed_query_augmented_df_labelled["relevance"].fillna(0).astype(int)


In [57]:
parsed_query_augmented_summary = summarize_metrics(parsed_query_augmented_df_labelled, "augmented parsed queries", K)

parsed_query_augmented_summary


{'run': 'augmented parsed queries',
 'P@10': np.float64(0.125),
 'MAP@10': np.float64(0.6111111111111112),
 'nDCG@10': np.float64(0.6518935163725987),
 'DCG@10': np.float64(5.43410583008132)}

In [58]:
query_parse_summary_augmented = pd.DataFrame(
    [unparsed_query_summary, parsed_query_summary, parsed_query_improved_summary, parsed_query_augmented_summary]
).set_index("run")

query_parse_summary_augmented.style.format("{:.4f}")


Unnamed: 0_level_0,P@10,MAP@10,nDCG@10,DCG@10
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
unparesed queries,0.125,0.4375,0.5226,4.7515
paresed queries,0.125,0.5635,0.6274,5.1314
improved parsed queries,0.125,0.6528,0.6637,5.5799
augmented parsed queries,0.125,0.6111,0.6519,5.4341


## Generation