# HW4: Recipe RAG — BM25, Synthetic Queries, Phoenix


<center>
    <p style="text-align:left">
        <img alt="phoenix logo" src="https://repository-images.githubusercontent.com/564072810/f3666cdf-cb3e-4056-8a25-27cb3e6b5848" width="600"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>

## 🎯 Assignment Overview
In this assignment, you’ll build and evaluate a BM25-based RAG system for recipes, generate synthetic user queries with llm_generate, and trace everything in Phoenix for visibility and analysis.

#### Workflow
- 🧰 Setup — Install deps, set env vars, and import libs.

- 📥 Load & structure data — Read RAW_recipes.csv, normalize fields (title, ingredients, steps, raw text).

- 🔤 Tokenize & index — Use a number-preserving tokenizer (keeps 375 F, 9x13) and build/load a BM25 index.

- 🧪 Generate synthetic queries — Use llm_generate to create realistic, technical cooking questions (+ validation & dedupe).

- ☁️ Upload dataset to Phoenix — Map to input (query) and output (salient_fact) and upload as a Phoenix Dataset.

- 🔎 Run retriever with tracing — For each query, call retrieve_bm25() (decorated) to emit rag.query → retrieval.bm25 spans.

- 🧪 Experiments — Run a Phoenix Experiment (Recall@1/3/5, MRR) and compare runs.

- 💾 Save results — Persist per-query outputs & summary to JSON for offline review.

- 📊 Inspect in Phoenix — View spans, top-k documents, evaluation chips, and experiment summaries.

#### Core Task: “Can the retriever surface the correct recipe?”
You’ll evaluate whether the ground-truth recipe (the one that the synthetic question depends on) appears in the top-k results—especially top-1/3/5.

## Environment & Imports
_Install deps (if needed), configure environment variables, and import libraries. Keep this section idempotent so re-running the notebook is safe._

In [None]:
%pip install -qqqq 'arize-phoenix==11.21.0' rank-bm25 tqdm litellm python-dotenv kagglehub openinference-instrumentation-litellm

In [None]:
import ast
import functools
import json
import math
import os
import pickle
import random
import re
import string
import time
from contextlib import contextmanager, nullcontext
from getpass import getpass
from pathlib import Path
from time import perf_counter
from typing import Any, Dict, List, Optional, Union

import kagglehub
import nest_asyncio
import pandas as pd
from kagglehub import KaggleDatasetAdapter
from openinference.semconv.trace import OpenInferenceSpanKindValues, SpanAttributes
from rank_bm25 import BM25Okapi
from tqdm import tqdm

import phoenix as px
from phoenix.client import AsyncClient
from phoenix.client.experiments import run_experiment
from phoenix.evals import OpenAIModel, RelevanceEvaluator, llm_generate, run_evals
from phoenix.otel import register
from phoenix.session.evaluation import get_retrieved_documents
from phoenix.trace import DocumentEvaluations

file_path = "hw4/RAW_recipes.csv"
KAGGLE_DATASET = "shuyangli94/food-com-recipes-and-user-interactions"
KAGGLE_FILE = "RAW_recipes.csv"
LOCAL_RAW_CSV = "hw4/data/RAW_recipes.csv"
OUTPUT_JSON = "hw4/data/processed_recipes.json"
DEFAULT_INDEX_PATH = "hw4/data/bm25_index.pkl"
PHOENIX_API_KEY = os.getenv("PHOENIX_API_KEY")
PHOENIX_ENDPOINT = os.getenv("PHOENIX_COLLECTOR_ENDPOINT")
tracer = None
TEXT_FIELD = "raw_text"
TITLE_FIELD = "title"
ID_FIELD = "id"
TOP_N = 200

QUERY_MODEL = "gpt-3.5-turbo-0125"
NUM_QUERIES = 100
BATCH_SIZE = 3
MAX_WORKERS = 5
SEED = 42
TIMEOUT_S = 60
SYN_QUERY_PATH = "hw4/data/synthetic_queries.json"
EVAL_RESULTS_PATH = "hw4/results/retrieval_evaluation.json"

random.seed(SEED)

In [None]:
_JSON_RE = re.compile(r"\{.*\}", re.S)
_WORD_RE = re.compile(r"[a-zA-Z]+")
_TECH = re.compile(
    r"\b\d[\d/\.\s]*(?:sec|second|min|minute|hour|hr|°[CF]|deg(?:rees)?\s*[CF]|tsp|tbsp|cup|cups|g|gram|ml|°)\b",
    re.I,
)
_TOKEN_RE = re.compile(
    r"\d+\s*[x×]\s*\d+"
    r"|(?:\d+/?\d+)"
    r"|(?:\d+(?:\.\d+)?)"
    r"|(?:°[fc])"
    r"|[a-z]+"
)


def tokenize(text: str) -> list[str]:
    s = (text or "").lower()
    s = s.replace("degrees f", "°f").replace("degree f", "°f")
    s = s.replace("degrees c", "°c").replace("degree c", "°c")
    s = s.replace("mins", "min").replace("minutes", "min")
    s = s.replace("hours", "hr").replace("hrs", "hr")
    return _TOKEN_RE.findall(s)


def _to_text(x) -> str:
    if isinstance(x, str):
        return x
    if x is None:
        return ""
    if isinstance(x, float) and math.isnan(x):
        return ""
    return str(x)


def _jsonish(v):
    if isinstance(v, str):
        t = v.strip()
        if t and (t[0] + t[-1] in ("{}", "[]")):
            try:
                return json.loads(t)
            except Exception:
                pass
    return v


def _norm_id(x):
    return "" if x is None else str(x)


nest_asyncio.apply()

## Step 0: Setup Phoenix

First, let's set up Phoenix on our local machine. You should run these commands within your terminal in your chosen environment.

(If you have already done this in a previous HW assignment, you are good to go.)

**Boot up Phoenix on localhost**

```phoenix serve```

In [None]:
def register_phoenix():
    global tracer
    if tracer:
        return tracer
    tracer_provider = register(project_name="hw4-rag")
    tracer = tracer_provider.get_tracer(__name__)
    return tracer

In [None]:
tracer = register_phoenix()

#### Set OpenAI API Key

In [None]:
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

# Part 1: Create Your Retrieval Evaluation Dataset

## Step1: Data Loading & Structuring
_Load the recipe dataset, perform minimal cleaning, and structure the fields we care about (title, ingredients, steps, text). Avoid heavy processing here so experimentation is quick._
- Load and clean the provided RAW_recipes.csv (~5,000 recipes)
- Structure recipe data (ingredients, steps, tags, nutrition)
- Select the ~200 longest recipes by text content for richer evaluation
- Save as data/processed_recipes.json

In [None]:
def load_raw_recipes() -> pd.DataFrame:
    """
    Try local CSV first. If missing, pull via kagglehub and also cache locally.
    """
    if os.path.exists(LOCAL_RAW_CSV):
        return pd.read_csv(LOCAL_RAW_CSV)
    try:
        df = kagglehub.load_dataset(
            KaggleDatasetAdapter.PANDAS,
            KAGGLE_DATASET,
            KAGGLE_FILE,
        )
        Path(LOCAL_RAW_CSV).parent.mkdir(parents=True, exist_ok=True)
        df.to_csv(LOCAL_RAW_CSV, index=False)
        print(f"✅ Saved raw CSV to {LOCAL_RAW_CSV}")
        return df
    except ModuleNotFoundError as e:
        raise RuntimeError(
            "kagglehub not installed and local CSV missing. "
            "Run `pip install kagglehub[pandas-datasets]` or place the CSV locally."
        ) from e

In [None]:
def clean_text(text):
    if pd.isna(text):
        return ""
    return re.sub(r"\s+", " ", str(text).strip())


def parse_list_column(val):
    """
    Columns like ingredients/steps/tags are stored as Python lists (string form).
    """
    if pd.isna(val):
        return []
    try:
        parsed = ast.literal_eval(val)
        return [clean_text(x) for x in parsed if isinstance(x, str)]
    except Exception:
        return []


def parse_nutrition(val):
    """
    nutrition is a list of floats in this dataset:
    [calories, total fat %, sugar %, sodium %, protein %, sat fat %, carbs %]
    """
    if pd.isna(val):
        return {}
    try:
        parsed = ast.literal_eval(val)
        if not isinstance(parsed, (list, tuple)):
            return {}
        keys = [
            "calories",
            "total_fat",
            "sugar",
            "sodium",
            "protein",
            "saturated_fat",
            "carbohydrates",
        ]
        return {k: float(parsed[i]) if i < len(parsed) else None for i, k in enumerate(keys)}
    except Exception:
        return {}

In [None]:
def structure_recipe(row):
    ingredients = parse_list_column(row.get("ingredients"))
    steps = parse_list_column(row.get("steps") or row.get("directions"))
    tags = parse_list_column(row.get("tags"))
    nutrition = parse_nutrition(row.get("nutrition"))
    description = clean_text(row.get("description"))
    minutes = row.get("minutes")
    title = clean_text(row.get("name") or row.get("title") or "")

    raw_text = (
        f"{title}"
        + "\nIngredients:\n"
        + "\n".join(ingredients)
        + "\nSteps:\n"
        + "\n".join(steps)
        + "\nTags:\n"
        + "\n".join(tags)
    ).strip()

    return {
        "id": str(row.get("id") or row.name),
        "title": title,
        "description": description,
        "minutes": minutes,
        "ingredients": ingredients,
        "steps": steps,
        "tags": tags,
        "nutrition": nutrition,
        "raw_text": raw_text,
    }

In [None]:
print("Loading raw recipes...")
df = load_raw_recipes()

print("Structuring & cleaning...")
structured = [structure_recipe(r) for _, r in df.iterrows()]

print("Selecting top recipes by text length...")
for rec in structured:
    rec["text_length"] = len(rec["raw_text"])
top_recipes = sorted(structured, key=lambda x: x["text_length"], reverse=True)[:TOP_N]

print(f"Saving {len(top_recipes)} recipes to {OUTPUT_JSON}")
Path(OUTPUT_JSON).parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(top_recipes, f, indent=2, ensure_ascii=False)

## Step 2: Build BM25 Retrieval Engine

- Implement BM25-based recipe search using rank_bm25
- Support index saving/loading for efficiency
- Provide retrieve_bm25(query, corpus, top_n=5) interface
- Handle recipe ranking and scoring

In [None]:
def build_bm25(tokenized_corpus: List[List[str]]) -> BM25Okapi:
    return BM25Okapi(tokenized_corpus)

In [None]:
def build_or_load_bm25(corpus, index_path, text_field=TEXT_FIELD, force_rebuild=False):
    if not force_rebuild and os.path.exists(index_path):
        with open(index_path, "rb") as f:
            data = pickle.load(f)
        if data.get("corpus_size") == len(corpus):
            return data["bm25"], data["tok_corpus"]

    tok_corpus = [tokenize(doc.get(text_field, "")) for doc in corpus]
    bm25 = BM25Okapi(tok_corpus)
    with open(index_path, "wb") as f:
        pickle.dump(
            {"bm25": bm25, "tok_corpus": tok_corpus, "corpus_size": len(corpus)},
            f,
        )
    return bm25, tok_corpus

In [None]:
print("Building BM25 index...")
corpus = top_recipes
bm25, tok = build_or_load_bm25(corpus, index_path=DEFAULT_INDEX_PATH)

#### Span helpers: set kind + conditionally create spans
Utility helpers for cleaner tracing:

_set_kind(span, kind) safely sets the OpenInference span kind (defaults to CHAIN, falls back to CHAIN if an unknown kind is passed).

span_if(enabled, name, *, attrs=None, kind="CHAIN") is a context manager that only creates a span when enabled=True, sets its kind, applies optional attributes, and yields the span. Use it to toggle fine-grained spans without cluttering the code.

In [None]:
def _set_kind(span, kind: str = "CHAIN"):
    span.set_attribute(
        SpanAttributes.OPENINFERENCE_SPAN_KIND,
        getattr(OpenInferenceSpanKindValues, kind.upper(), OpenInferenceSpanKindValues.CHAIN).value,
    )


@contextmanager
def span_if(enabled: bool, name: str, *, attrs: Dict[str, Any] | None = None, kind: str = "CHAIN"):
    if not enabled:
        yield None
        return
    with tracer.start_as_current_span(name) as s:
        _set_kind(s, kind)
        if attrs:
            for k, v in attrs.items():
                s.set_attribute(k, v)
        yield s

#### BM25 retriever with tracing
@retriever("retrieval.bm25") emits a RETRIEVER span and logs each top-k doc (id, score, short content, metadata) for Phoenix.

retrieve_bm25(..., emit_detail_spans=False) runs BM25 and, if enabled, adds tiny child spans (query.normalize → query.tokenize → bm25.score/bm25.rank → results.build).

Returns ranked hits with useful metadata (matched terms, doc length, etc.).

In [None]:
def retriever(name: str = "retrieval.bm25"):
    def outer(fn):
        @functools.wraps(fn)
        def inner(*a, **kw):
            q = kw.get("query", a[0] if a else None)
            with tracer.start_as_current_span(name) as s:
                s.set_attribute(
                    SpanAttributes.OPENINFERENCE_SPAN_KIND,
                    OpenInferenceSpanKindValues.RETRIEVER.value,
                )
                if q is not None:
                    s.set_attribute(SpanAttributes.INPUT_VALUE, str(q))
                res = fn(*a, **kw)
                for i, r in enumerate(res):
                    base = f"retrieval.documents.{i}.document"
                    s.set_attribute(f"{base}.id", str(r.get("id")))
                    sc = r.get("score")
                    if sc is not None:
                        s.set_attribute(f"{base}.score", float(sc))
                    s.set_attribute(
                        f"{base}.content", (r.get("text", "") or "")[:1200].replace("\n", " ")
                    )
                    meta = r.get("metadata")
                    if meta:
                        s.set_attribute(
                            f"{base}.metadata",
                            json.dumps(meta, ensure_ascii=False, separators=(",", ":")),
                        )
                return res

        return inner

    return outer


@retriever("retrieval.bm25")
def retrieve_bm25(
    query: str,
    corpus: List[Dict[str, Any]],
    bm25: BM25Okapi,
    tokenized_corpus: List[List[str]],
    top_n: int = 5,
    text_field: str = TEXT_FIELD,
    title_field: str = TITLE_FIELD,
    id_field: str = ID_FIELD,
    *,
    emit_detail_spans: bool = False,
) -> List[Dict[str, Any]]:
    q_raw = _to_text(query)
    with span_if(emit_detail_spans, "query.normalize", attrs={"query.len": len(q_raw)}):
        q_norm = q_raw.strip()

    with span_if(emit_detail_spans, "query.tokenize"):
        qtok = tokenize(q_norm)

    with span_if(
        emit_detail_spans,
        "bm25.score",
        attrs={"tokens": len(qtok), "N_docs": len(tokenized_corpus)},
    ):
        scores = bm25.get_scores(qtok)

    with span_if(emit_detail_spans, "bm25.rank", attrs={"k": int(top_n)}):
        top = sorted(range(len(scores)), key=scores.__getitem__, reverse=True)[:top_n]

    out: List[Dict[str, Any]] = []
    with span_if(emit_detail_spans, "results.build", attrs={"k": int(top_n)}):
        for rank, idx in enumerate(top, 1):
            d = corpus[idx]
            dt = tokenized_corpus[idx]
            out.append(
                {
                    "rank": rank,
                    "score": float(scores[idx]),
                    "id": d.get(id_field),
                    "title": d.get(title_field, ""),
                    "text": d.get(text_field, ""),
                    "index": idx,
                    "metadata": {
                        **{k: _jsonish(v) for k, v in d.items() if k != text_field},
                        "rank": rank,
                        "bm25_score": float(scores[idx]),
                        "matched_terms": sorted(set(qtok) & set(dt)),
                        "doc_len_terms": len(dt),
                    },
                }
            )
    return out

In [None]:
query = "holiday cake pops"
hits = retrieve_bm25(query, corpus, bm25, tok, top_n=5)
print(hits)

## Step 3: Generate Synthetic Queries

- Use LLM to generate realistic cooking queries
- Focus on complex scenarios requiring specific recipe knowledge
- Use ThreadPoolExecutor for parallel processing
- Generate 100+ queries with salient facts
- Save as data/synthetic_queries.json

In [None]:
SYSTEM_PROMPT = """You are an advanced user of a recipe search engine.
Given a recipe, write ONE realistic, conversational cooking question that depends on a precise, technical detail
that is clearly contained in THIS recipe. Focus on:
1) Specific methods (e.g., marinate 4 hours, bake at 375°F for 25 minutes)
2) Appliance settings (e.g., air fryer 400°F for 12 minutes, pressure cook 8 minutes)
3) Ingredient prep details (e.g., slice onions paper-thin, whip cream to soft peaks)
4) Timing specifics (e.g., rest dough 30 minutes, simmer 45 minutes)
5) Temperature precision (e.g., internal 165°F, oil 350°F)

Return EXACTLY a single JSON object:
{"query":"...?","salient_fact":"<exact quote or tight paraphrase>"}"""

_TEMPLATE = "Recipe ID: {id}\nTitle: {title}\nKey ingredients: {ingredients}\nFirst instruction: {first_step}\n"

#### Parse & normalize LLM output
_norm_q(q) — lowercase, strip punctuation, collapse spaces → dedupe-friendly key.

_parse_json(text) — parse JSON; if there’s extra text, grab the first {...} block and load it.

_parse_and_validate(text) — require query + salient_fact, append “?” if missing, enforce a minimum length, and ensure the fact contains a concrete technical detail (time/temp/ratio/etc.). Returns the cleaned pair or None.

In [None]:
def _norm_q(q: str) -> str:
    s = _to_text(q).lower().translate(_PUNCT).strip()
    return re.sub(r"\s+", " ", s)


def _parse_json(text: str) -> Optional[dict]:
    s = _to_text(text).strip()
    if not s:
        return None
    try:
        return json.loads(s)
    except Exception:
        m = _JSON_RE.search(s)
        return json.loads(m.group(0)) if m else None


def _parse_and_validate(text: str) -> Optional[Dict[str, str]]:
    obj = _parse_json(text)
    if not isinstance(obj, dict):
        return None
    q = _to_text(obj.get("query")).strip()
    f = _to_text(obj.get("salient_fact")).strip()
    if not q or not f:
        return None
    if not q.endswith("?"):
        q += "?"
    if len(q) < 25:
        return None
    if not _TECH.search(f):
        return None
    return {"query": q, "salient_fact": f}

Takes a slice of recipes starting at start and grabs batch_rows items (wrapping around if needed).

For each recipe, it collects:

- id and title

- ingredients: the first 6 ingredients, joined as a comma-separated string

- first_step: the first instruction, trimmed to 200 characters

- _recipe: the full original recipe dict (kept for later)

Returns a tidy pandas DataFrame with one row per recipe.

In [None]:
def _make_batch_df(recipes: List[Dict[str, Any]], start: int, batch_rows: int) -> pd.DataFrame:
    rows, n = [], len(recipes)
    for i in range(batch_rows):
        r = recipes[(start + i) % n]
        ings = r.get("ingredients") or []
        step0 = (r.get("steps") or [""])[0]
        rows.append(
            {
                "id": _to_text(r.get("id")),
                "title": _to_text(r.get("title")),
                "ingredients": ", ".join(_to_text(x) for x in ings[:6]),
                "first_step": _to_text(step0)[:200],
                "_recipe": r,
            }
        )
    return pd.DataFrame(rows)


_PUNCT = str.maketrans("", "", string.punctuation)

#### Generate synthetic queries
Uses llm_generate to create n unique, validated cooking questions from recipe batches.

Dedupes via normalized text, nudges temperature if a batch adds zero.

Saves results (with recipe metadata) to out_path and returns the file path.

In [None]:
def generate_synthetic_queries(
    recipes: List[Dict[str, Any]],
    n: int = NUM_QUERIES,
    out_path: str = SYN_QUERY_PATH,
    max_workers: int = MAX_WORKERS,
):
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    random.shuffle(recipes)

    model = OpenAIModel(
        model=QUERY_MODEL,
        temperature=0.25,
        max_tokens=100,
        request_timeout=TIMEOUT_S,
    )

    unique: Dict[str, Dict[str, Any]] = {}
    seen_norm: set[str] = set()
    start_idx = 0

    with tqdm(total=n, desc="synthetic", mininterval=0.2) as bar:
        while len(unique) < n:
            need = n - len(unique)
            batch_rows = max(256, min(need * 3, 1024))
            df = _make_batch_df(recipes, start_idx, batch_rows)
            start_idx = (start_idx + batch_rows) % len(recipes)

            out = llm_generate(
                dataframe=df,
                template=_TEMPLATE,
                model=model,
                system_instruction=SYSTEM_PROMPT,
                output_parser=lambda txt, row_idx: (_parse_and_validate(txt) or {}),
                run_sync=True,
                concurrency=min(max_workers * 3, 48),
            )

            added = 0
            for i, row in out.iterrows():
                q = _to_text(row.get("query"))
                f = _to_text(row.get("salient_fact"))
                if not q or not f:
                    continue
                key = _norm_q(q)
                if key in seen_norm:
                    continue

                base = df.iloc[i]["_recipe"]
                unique[q] = {
                    "query": q,
                    "salient_fact": f,
                    "source_recipe_id": base.get("id"),
                    "recipe_name": _to_text(base.get("title")),
                    "ingredients": base.get("ingredients", []),
                    "tags": base.get("tags", []),
                }
                seen_norm.add(key)
                added += 1
                if len(unique) >= n:
                    break

            acc = (added / len(out)) if len(out) else 0.0
            bar.set_postfix({"batch": len(out), "added": added, "acc%": f"{acc * 100:.0f}"})
            if added:
                bar.update(added)
            else:
                model.temperature = min(model.temperature + 0.1, 0.8)

    with open(out_path, "w", encoding="utf-8") as fp:
        json.dump(list(unique.values()), fp, indent=2, ensure_ascii=False)
    print(f"✅ Saved {len(unique)} queries → {out_path}")
    return out_path

In [None]:
print("Generating synthetic queries...")
generate_synthetic_queries(top_recipes)

#### Upload synthetic queries to Phoenix
Loads your JSON of queries, maps to Phoenix-friendly columns: input (query) and output (salient_fact).

Flattens optional metadata (recipe_name, source_recipe_id, ingredients, tags) for tracking.

Auto-generates a dataset name (or uses dataset_name) and uploads via px.Client().upload_dataset(...).


In [None]:
async def upload_synthetic_queries_to_phoenix(
    json_path: str = SYN_QUERY_PATH,
    dataset_name: str | None = None,
):
    data = json.loads(Path(json_path).read_text(encoding="utf-8"))
    df = pd.DataFrame(data)

    if "query" not in df or "salient_fact" not in df:
        raise ValueError("Expected keys 'query' and 'salient_fact' in the JSON items.")

    df = df.copy()
    df["input"] = df["query"].fillna("").astype(str)
    df["output"] = df["salient_fact"].fillna("").astype(str)

    meta_cols = []
    for col in ("recipe_name", "source_recipe_id"):
        if col in df:
            df[col] = df[col].fillna("").astype(str)
            meta_cols.append(col)
    for col in ("ingredients", "tags"):
        if col in df:
            df[col] = df[col].apply(
                lambda x: json.dumps(x, ensure_ascii=False)
                if isinstance(x, (list, dict))
                else ("" if pd.isna(x) else str(x))
            )
            meta_cols.append(col)

    upload_df = df[["input", "output"] + meta_cols]
    ds_name = (
        dataset_name or f"recipes-synth-queries-{len(upload_df)}-{time.strftime('%Y%m%d-%H%M%S')}"
    )
    px_client = AsyncClient()
    ds = await px_client.datasets.create_dataset(
        dataframe=upload_df,
        name=ds_name,
        input_keys=["input"],
        output_keys=["output"],
        metadata_keys=meta_cols if meta_cols else [],
    )
    print(f"✅ Uploaded dataset '{getattr(ds, 'name', ds)}' with {len(upload_df)} rows")
    return ds

In [None]:
print("Uploading synthetic queries as a dataset in Phoenix...")
await upload_synthetic_queries_to_phoenix(SYN_QUERY_PATH, dataset_name="synthetic_queries")

# Part 2: Evaluate the BM25 Retriever
## Step 4: Implement Evaluation

- Load synthetic queries and retrieval engine
- For each query, run retrieve_bm25() and record results
- Calculate standard IR metrics:
    - ##Recall@1: Target recipe rank 1
    - ##Recall@3: Target recipe in top 3
    - ##Recall@5: Target recipe in top 5
    - ##MRR: Mean Reciprocal Rank
- Save detailed results to results/retrieval_evaluation.json

Let's track our traces in our Phoenix project!

In [None]:
def run_queries_to_phoenix(
    items: Union[str, List[Union[str, Dict[str, Any]]]],
    corpus,
    bm25,
    tokenized_corpus,
    *,
    top_k: int = 5,
    tracer=None,
):
    """
    For each query, emit a 'rag.query' span; your @retriever-decorated retrieve_bm25()
    will emit the child 'retrieval.bm25' span with Phoenix-friendly doc attrs.
    """
    tr = tracer or globals().get("tracer") or register_phoenix()
    span_cm = (lambda name: tr.start_as_current_span(name)) if tr else (lambda name: nullcontext())

    if isinstance(items, str) and (items.endswith(".json") or os.path.exists(items)):
        items = json.loads(Path(items).read_text(encoding="utf-8"))

    def _txt(x):
        return "" if x is None else str(x)

    def _norm(x):
        return "" if x is None else str(x)

    for it in items:
        if isinstance(it, str):
            q = it.strip()
            salient = None
            target_id = None
        else:
            q = _txt(it.get("query") or it.get("input")).strip()
            salient = _txt(it.get("salient_fact") or it.get("output")).strip() or None
            target_id = _norm(it.get("source_recipe_id")) or None

        with span_cm("rag.query") as root:
            if tr:
                root.set_attribute(
                    SpanAttributes.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKindValues.CHAIN.value
                )
                root.set_attribute(SpanAttributes.INPUT_VALUE, q)
                root.set_attribute("eval.synthetic", True)
                if salient:
                    root.set_attribute("eval.salient_fact", salient)
                if target_id:
                    root.set_attribute("eval.target_id", target_id)
                root.set_attribute("retriever.top_k", int(top_k))

            t0 = perf_counter()
            hits = retrieve_bm25(q, corpus, bm25, tokenized_corpus, top_n=5)
            if tr:
                root.set_attribute("retrieval.latency_ms", int((perf_counter() - t0) * 1000))
                root.set_attribute(
                    "retrieval.top_ids", json.dumps([_norm(h.get("id")) for h in hits])
                )

In [None]:
print("Running queries to Phoenix...")
run_queries_to_phoenix(SYN_QUERY_PATH, corpus, bm25, tok, top_k=5, tracer=tracer)

We can also run our retriever on the synethtic queries dataset we have uploaded.

In [None]:
def make_bm25_task(
    corpus,
    bm25,
    tokenized_corpus,
    k: int,
    collector: list,
    show_top_k: int = 5,
    tracer=None,
):
    tr = tracer or globals().get("tracer") or register_phoenix()

    def _span(name):
        return tr.start_as_current_span(name) if tr else nullcontext()

    def bm25_task(example):
        q = (example["input"].get("input") or example["input"].get("query") or "").strip()
        salient = (
            example["output"].get("output") or example["output"].get("salient_fact") or ""
        ).strip()
        md = example["metadata"] or {}
        inp = example["input"] or {}
        target_id = _norm_id(md.get("source_recipe_id") or inp.get("source_recipe_id"))

        with _span("rag.query") as root:
            if tr:
                root.set_attribute(
                    SpanAttributes.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKindValues.CHAIN.value
                )
                root.set_attribute(SpanAttributes.INPUT_VALUE, q)
                if salient:
                    root.set_attribute("eval.salient_fact", salient)
                root.set_attribute("retriever.top_k", int(k))

            t0 = perf_counter()
            hits = retrieve_bm25(q, corpus, bm25, tokenized_corpus, top_n=5)

            keep = min(show_top_k, len(hits))
            top = hits[:keep]
            top_ids = [_norm_id(h.get("id")) for h in top]
            top_titles = [str(h.get("title") or "") for h in top]

            if tr:
                root.set_attribute("retrieval.latency_ms", int((perf_counter() - t0) * 1000))
                root.set_attribute("retrieval.top_ids", json.dumps(top_ids))

        hit_rank = 0
        if target_id:
            for r, h in enumerate(hits, 1):
                if _norm_id(h.get("id")) == target_id:
                    hit_rank = r
                    break
        r1 = 1.0 if hit_rank == 1 else 0.0
        r3 = 1.0 if (hit_rank and hit_rank <= 3) else 0.0
        r5 = 1.0 if (hit_rank and hit_rank <= 5) else 0.0
        rr = 1.0 / hit_rank if hit_rank else 0.0

        collector.append(
            {
                "query": q,
                "salient_fact": salient,
                "target_id": target_id or None,
                "top_k": k,
                "hit_rank": hit_rank or None,
                "recall@1": r1,
                "recall@3": r3,
                "recall@5": r5,
                "rr": rr,
                "top5": [{"id": i, "title": t} for i, t in zip(top_ids, top_titles)],
            }
        )

        return {"top_ids": top_ids, "top_titles": top_titles}

    return bm25_task

## 🕰️ Evaluation Time

#### IR metric evaluators
_recall_at_k_generic(k, ...): Checks if the ground-truth recipe ID appears in the model’s top_ids. Returns 1.0 if it’s within the top-k, else 0.0.

RecallAt1 / RecallAt3 / RecallAt5: Thin wrappers around the generic function for k = 1, 3, 5.

MRR: Mean Reciprocal Rank for a single example. If the ground-truth ID is at rank r in top_ids, returns 1/r; returns 0.0 if it’s not found.

<img alt="Document Retrieval Evaluation Image" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/ir_metrics_for_rag.png" width=1000/>

In [None]:
def _recall_at_k_generic(
    k: int, input=None, output=None, expected=None, metadata=None, reference=None
) -> float:
    gt = _norm_id((metadata or {}).get("source_recipe_id") or (input or {}).get("source_recipe_id"))
    if not gt:
        return 0.0
    top_ids = (output or {}).get("top_ids", []) if isinstance(output, dict) else []
    for rank, did in enumerate(top_ids, 1):
        if _norm_id(did) == gt:
            return 1.0 if rank <= k else 0.0
    return 0.0


def RecallAt1(input=None, output=None, expected=None, metadata=None, reference=None) -> float:
    return _recall_at_k_generic(
        1, input=input, output=output, expected=expected, metadata=metadata, reference=reference
    )


def RecallAt3(input=None, output=None, expected=None, metadata=None, reference=None) -> float:
    return _recall_at_k_generic(
        3, input=input, output=output, expected=expected, metadata=metadata, reference=reference
    )


def RecallAt5(input=None, output=None, expected=None, metadata=None, reference=None) -> float:
    return _recall_at_k_generic(
        5, input=input, output=output, expected=expected, metadata=metadata, reference=reference
    )


def MRR(input=None, output=None, expected=None, metadata=None, reference=None) -> float:
    gt = _norm_id((metadata or {}).get("source_recipe_id") or (input or {}).get("source_recipe_id"))
    if not gt:
        return 0.0
    top_ids = (output or {}).get("top_ids", []) if isinstance(output, dict) else []
    for rank, did in enumerate(top_ids, 1):
        if _norm_id(did) == gt:
            return 1.0 / rank
    return 0.0

#### Run BM25 experiment
Runs BM25 over a Phoenix dataset by name, logs spans, and scores Recall@1/3/5 + MRR.

Collects per-query results, computes a summary, and saves everything to out_path as JSON.

Returns the Phoenix experiment and the results file path.

In [None]:
async def run_bm25_experiment_on_ds(
    corpus,
    bm25,
    tokenized_corpus,
    dataset_name: str,
    k: int = 50,
    show_top_k: int = 5,
    out_path: str = EVAL_RESULTS_PATH,
):
    px_client = AsyncClient()
    ds = await px_client.datasets.get_dataset(dataset=dataset_name)

    collector = []
    task = make_bm25_task(
        corpus, bm25, tokenized_corpus, k=k, collector=collector, show_top_k=show_top_k
    )

    evaluators = [RecallAt1, RecallAt3, RecallAt5, MRR]
    experiment = run_experiment(dataset=ds, task=task, evaluators=evaluators)
    print("✅ Experiment finished.")

    n = len(collector) or 1
    sum_r1 = sum(r["recall@1"] for r in collector)
    sum_r3 = sum(r["recall@3"] for r in collector)
    sum_r5 = sum(r["recall@5"] for r in collector)
    sum_rr = sum(r["rr"] for r in collector)

    summary = {
        "examples": n,
        "k": k,
        "show_top_k": show_top_k,
        "recall@1": round(sum_r1 / n, 4),
        "recall@3": round(sum_r3 / n, 4),
        "recall@5": round(sum_r5 / n, 4),
        "mrr": round(sum_rr / n, 4),
        "hit_rate@show_top_k": round(
            sum(1.0 for r in collector if r["hit_rank"] and r["hit_rank"] <= show_top_k) / n, 4
        ),
    }

    payload = {"summary": summary, "results": collector}
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    Path(out_path).write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
    print(f"💾 Saved detailed results → {out_path}")

    return experiment, out_path

In [None]:
print("Running BM25 experiment on synthetic queries...")
experiment, path = await run_bm25_experiment_on_ds(
    corpus, bm25, tok, dataset_name="synthetic_queries", k=TOP_N
)

## Optional: Document Retrieval Relevance Evaluator

<img alt="Document Retrieval Evaluation Image" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/documentRelevanceDiagram.png" width="1000"/>

In [None]:
retrieved_documents_df = get_retrieved_documents(
    px.Client(), project_name=experiment["project_name"], timeout=None
)
retrieved_documents_df

In [None]:
eval_model = OpenAIModel(model="gpt-4")
relevance_evaluator = RelevanceEvaluator(eval_model)

#### RAG Eval Template =
> You are comparing a reference text to a question and trying to determine if the reference text
>
> contains information relevant to answering the question. Here is the data:
>
>>    [BEGIN DATA]
>>
>>    [Question] : query
>>
>>    [Reference text] : reference
>>
>>    [END DATA]
>
> Compare the Question above to the Reference text. You must determine whether the Reference text
>
> contains information that can answer the Question. Please focus on whether the very specific
>
> question can be answered by the information in the Reference text.
>
> Your response must be single word, either "relevant" or "unrelated",
>
> and should not contain any text or characters aside from that word.
>
> "unrelated" means that the reference text does not contain an answer to the Question.
>
> "relevant" means the reference text contains an answer to the Question.


In [None]:
retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]
retrieved_documents_relevance_df

In [None]:
px.Client().log_evaluations(
    DocumentEvaluations(eval_name="Retrieval Relevance", dataframe=retrieved_documents_relevance_df)
)