<a href="https://colab.research.google.com/github/EsmaeilNarimissa/Dialectal-Retrieval-Bias/blob/main/rag_bias_poc_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Phase 2: RAG Bias Detection PoC (Retrieval-only)**

A proof-of-concept notebook evaluating dialectal bias in information retrieval. This analysis compares sparse (BM25) and dense (OpenAI) retrieval systems using paired SAE vs. AAVE queries **generated in a separate Phase 1 pipeline**. Performance is measured against the SQuAD corpus using a suite of metrics—including **Recall@k** across multiple cutoffs, **rank-of-gold**, and **hybrid recall**—with findings validated by paired significance tests (McNemar, Wilcoxon) to ensure statistical rigor.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **2.1 Setup: Configure, Initialize, and Load Paired Queries**

- Define experiment settings: dataset path, top_k, enabled retrievers (BM25, Dense), embedding model, dry-run size, random seed, and light-cleanup toggle.
- Install and import dependencies; seed RNGs; load OpenAI API key from Colab secrets; initialize OpenAI client; print Python and SDK versions.
- Provide minimal text utilities: normalize quotes/whitespace; optionally convert “in’/in'” → “ing” at word endings.
- Load Phase 1 paired SAE/AAVE dataset; assert presence of squad_idx; select essential columns (id, squad_idx, sae_query, aave_query); apply optional light cleanup; support dry-run subset; preview and report loaded pair count.

In [None]:
# Config (PoC: do not add extra knobs)
aave_dataset_path = "data/aave_poc_dataset_20250927-193921.json"  # paired JSON with id, sae_query, aave_query
top_k = 10
use_bm25 = True
use_dense = True
embed_model = "text-embedding-3-small"  # OpenAI embeddings
dry_run_n = None  # e.g., 50 for smoke test; None for full
random_seed = 42

# Optional light cleanup
apply_light_cleanup = True  # only normalize quotes and convert "in'" -> "ing" at word-end

In [None]:
# Minimal installs for PoC
!pip -q install rank_bm25 chromadb datasets statsmodels "openai>=1.0.0" > /dev/null

import os, json, time, re, random, sys # Import sys for version check
import numpy as np
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
from rank_bm25 import BM25Okapi
import chromadb
from chromadb.config import Settings
from openai import OpenAI, __version__ as openai_version # Import openai_version
from statsmodels.stats.contingency_tables import mcnemar
from google.colab import userdata

# Seeds
random.seed(random_seed)
np.random.seed(random_seed)

# Set the OpenAI API key from Colab secrets
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

if not OPENAI_API_KEY:
    print("ERROR: OPENAI_API_KEY not found in Colab secrets!")
    print("Please set your OpenAI API key in the Colab Secrets tab (icon on the left).")
else:
    print("OpenAI API key loaded successfully from Colab secrets")
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


# OpenAI client (requires OPENAI_API_KEY in env)
openai_client = OpenAI()
print("OpenAI Client initialized.")

# Verify OpenAI SDK and Python versions
print("\nPython:", sys.version) # Use sys for Python version
try:
    print("OpenAI SDK version:", openai_version) # Use imported openai_version
except Exception as e:
    print("Could not read OpenAI SDK version:", repr(e))


def normalize_text(s: str) -> str:
    if s is None:
        return ""
    # normalize curly quotes and whitespace only (do NOT drop possessives)
    s = s.replace("’","'").replace("‘","'").replace("“",'"').replace("”",'"')
    s = re.sub(r"\s+", " ", s).strip()
    return s

def light_cleanup_query(q: str) -> str:
    q = normalize_text(q)
    # only convert stylized endings: "in’" or "in'" -> "ing"
    q = re.sub(r"\b([A-Za-z]+)in[’']\b", r"\1ing", q)
    return q

OpenAI API key loaded successfully from Colab secrets
OpenAI Client initialized.

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
OpenAI SDK version: 1.108.0


In [None]:
with open(aave_dataset_path, "r", encoding="utf-8") as f:
    pairs = json.load(f)

# Load all columns first, then assert presence of squad_idx
df_pairs = pd.DataFrame(pairs).copy()
cols = df_pairs.columns.tolist()
print("Columns in dataset:", cols)

assert "squad_idx" in df_pairs.columns, (
    f"Dataset missing 'squad_idx'. Check aave_dataset_path:\n{aave_dataset_path}\n"
    "You may be loading an older JSON without squad_idx."
)

# Keep only the needed columns (including squad_idx) and sort by id
df_pairs = df_pairs[["id", "squad_idx", "sae_query", "aave_query"]].sort_values("id").reset_index(drop=True)

# Optional cleanup
if apply_light_cleanup:
    df_pairs["sae_query"] = df_pairs["sae_query"].map(light_cleanup_query)
    df_pairs["aave_query"] = df_pairs["aave_query"].map(light_cleanup_query)

# Optional dry run
if dry_run_n is not None:
    df_pairs = df_pairs.head(dry_run_n).copy()

print("Loaded pairs:", len(df_pairs))
df_pairs.head(3)

Columns in dataset: ['id', 'squad_idx', 'sae_query', 'aave_query', 'source', 'conversion_method']
Loaded pairs: 200


Unnamed: 0,id,squad_idx,sae_query,aave_query
0,0,85143,Who overshadowed House Speaker Dennis Hastert?,Who overshadow House Speaker Dennis Hastert?
1,1,14741,Who released Mosaic?,Who released Mosaic?
2,2,3320,How deep was the focus of the earthquake?,How deep the focus of the earthquake?


## **2.2 Corpus Prep: Build SQuAD Index and Map Gold Documents**

- Load SQuAD train split; normalize all contexts and create a DataFrame corpus.
- Deduplicate contexts; assign stable integer doc_id; build context→doc_id lookup.
- Validate df_pairs has squad_idx; define mapper from SQuAD index to gold_doc_id via normalized context.
- Add gold_doc_id to df_pairs; drop unmapped rows; cast to int; report corpus size and retained pair count.


```
- “Gold”: ground-truth document (context) from which the SQuAD question was written.
- “Map Gold Documents” : link each query pair to its corresponding ground-truth context’s doc_id in the deduplicated SQuAD corpus.
```




In [None]:
squad = load_dataset("squad", split="train")

contexts = [ex["context"] for ex in squad]
contexts_norm = [normalize_text(c) for c in contexts]
corpus_df = pd.DataFrame({"raw_doc_id": list(range(len(contexts_norm))), "context_text": contexts_norm})

# Deduplicate identical contexts and assign doc_id
corpus_df = corpus_df.drop_duplicates(subset=["context_text"]).reset_index(drop=True)
corpus_df["doc_id"] = corpus_df.index

# Map normalized context -> doc_id
context_to_docid = dict(zip(corpus_df["context_text"], corpus_df["doc_id"]))

# Map pair index -> gold_doc_id (assumes alignment to SQuAD index used in generation)
# Corrected mapping using 'squad_idx' column
assert "squad_idx" in df_pairs.columns, "Dataset missing 'squad_idx'"

def map_idx_to_docid(idx: int):
    if idx is None or idx < 0 or idx >= len(squad):
        return None
    ctx = normalize_text(squad[int(idx)]["context"])
    return context_to_docid.get(ctx, None)

df_pairs["gold_doc_id"] = df_pairs["squad_idx"].map(map_idx_to_docid)

before = len(df_pairs)
df_pairs = df_pairs.dropna(subset=["gold_doc_id"]).copy()
df_pairs["gold_doc_id"] = df_pairs["gold_doc_id"].astype(int)
after = len(df_pairs)

print(f"Corpus size (unique contexts): {len(corpus_df)}")
print(f"Pairs available: {after} (dropped: {before - after})")

Corpus size (unique contexts): 18891
Pairs available: 200 (dropped: 0)


## 2.3 Retrievers: Initialize BM25 and Dense (Chroma + OpenAI) and Define Query Functions

- BM25 setup: tokenize corpus with a simple alphanumeric tokenizer; build BM25Okapi index; implement retrieve_bm25(query, k) returning top-k (doc_id, score) pairs or [] if disabled/empty.
- Dense setup: initialize in-memory Chroma collection; batch-embed corpus with OpenAI embeddings; add documents in safe chunks; skip add if collection already populated.
- Dense query: embed the query, run vector search in Chroma, convert distances to scores (negative distance fallback), and return top-k (doc_id, score) pairs; no-op if disabled/blank query.

***BM25*** is a classic, **sparse**, term-based (keyword search) retriever that relies on **exact word overlap** and an **inverted index**. It scores documents with **term frequency**, **inverse document frequency (IDF)**, and **length normalization**, making it fast, interpretable, and strong when queries share keywords with documents. Conceptually, it operates over a very **sparse term–document matrix** (not explicitly materialized), where each document has nonzero weights only for terms it contains. In this framing, documents and queries are represented as high-dimensional, mostly-zero **vectors**; computations can be viewed as sparse **tensors/vectors** operations via posting lists.

***Dense retrieval*** uses **vector embeddings** to place queries and documents in a continuous **semantic space**, retrieving by **cosine similarity** or related metrics. It can match **paraphrases** and **synonyms** without exact term overlap, often improving recall for varied wording or dialectal shifts. Here, each query/document is a fixed-length, low-dimensional **dense vector** (a numeric **tensor**), produced by a neural model; building and querying require embedding computation and a **vector index** (e.g., FAISS/Chroma). This yields stronger **semantic matching** at the cost of more compute, dependence on embedding quality, and less transparency than BM25.

In [None]:
# Sets up the BM25 retrieval model.
import re
def _tok(s): return re.findall(r"[a-z0-9]+", s.lower())

if use_bm25:
    # Tokenize the corpus for BM25
    tokenized_corpus = [_tok(doc) for doc in corpus_df["context_text"].tolist()]
    # Initialize BM25Okapi
    bm25 = BM25Okapi(tokenized_corpus)

    # Define BM25 retrieval function
    def retrieve_bm25(query: str, k: int = top_k):
        if not query.strip(): return []
        # Tokenize the query
        tokens = _tok(query)
        # Get scores for all documents
        scores = bm25.get_scores(tokens)
        # Get indices of top-k documents
        top_idx = np.argsort(scores)[::-1][:k]
        # Return list of (doc_id, score) for top-k
        return [(int(corpus_df.iloc[idx]["doc_id"]), float(scores[idx])) for idx in top_idx]
else:
    # Placeholder function if BM25 is disabled
    def retrieve_bm25(query: str, k: int = top_k): return []

In [None]:
# Dense retrieval model using Chroma and OpenAI embeddings.
if use_dense:
    # In-memory Chroma client
    import chromadb
    from chromadb.config import Settings

    chroma_client = chromadb.Client(Settings())  # pure in-memory
    # Use get_or_create=True to handle existing collection
    collection = chroma_client.create_collection(name="poc_corpus", get_or_create=True)

    doc_texts = corpus_df["context_text"].tolist()
    # Convert doc_ids to string for Chroma
    doc_ids_str = corpus_df["doc_id"].astype(str).tolist()

    # Function to embed texts in batches using OpenAI
    def embed_texts(texts, batch=64):
        out = []
        for i in tqdm(range(0, len(texts), batch), desc="Embedding corpus"):
            chunk = texts[i:i+batch]
            # Create embeddings using the specified model
            resp = openai_client.embeddings.create(model=embed_model, input=chunk)
            out.extend([d.embedding for d in resp.data])
        return out

    # Check if collection is empty before adding documents to avoid re-embedding
    if collection.count() == 0:
        embeddings = embed_texts(doc_texts)
        # Basic sanity checks
        assert len(embeddings) > 0 and len(embeddings[0]) > 0, "Empty embedding vector"
        assert len(embeddings) == len(doc_texts) == len(doc_ids_str), "Mismatch in lengths before add"

        # Add documents to Chroma in safe batches to avoid InternalError (max batch size limit)
        B_ADD = 2000  # conservative chunk size
        for i in tqdm(range(0, len(doc_texts), B_ADD), desc="Adding to Chroma"):
            chunk_docs = doc_texts[i:i+B_ADD]
            chunk_embs = embeddings[i:i+B_ADD]
            chunk_ids  = doc_ids_str[i:i+B_ADD]
            collection.add(documents=chunk_docs, embeddings=chunk_embs, ids=chunk_ids)
    else:
        print("Collection 'poc_corpus' already contains data. Skipping adding documents.")

    # Define dense retrieval function
    def retrieve_dense(query: str, k: int = top_k):
        if not query or not query.strip():
            return []
        # Embed the query
        resp = openai_client.embeddings.create(model=embed_model, input=[query])
        qvec = resp.data[0].embedding
        # Query the Chroma collection for top-k results
        res = collection.query(query_embeddings=[qvec], n_results=k)
        ids = res["ids"][0]
        dists = res.get("distances", [[None]])[0] # Get distances if available
        out = []
        for j, sid in enumerate(ids):
            # Calculate score (negative distance as a simple score)
            score = -float(dists[j]) if dists and dists[j] is not None else float(k - j) # Fallback score if distance is None
            out.append((int(sid), score)) # Return doc_id and score
        return out
else:
    # Placeholder function if dense retrieval is disabled
    def retrieve_dense(query: str, k: int = top_k):
        return []

Embedding corpus: 100%|██████████| 296/296 [02:38<00:00,  1.87it/s]
Adding to Chroma: 100%|██████████| 10/10 [00:43<00:00,  4.40s/it]


### 2.3.1 Sanity Check: Validate Mappings, Ensure Retrievers, and Peek Results

- Validate inputs: assert required columns exist and gold_doc_id values fall within corpus bounds; print corpus/pair stats and gold_doc_id distribution.
- Ensure retrievers: lazily initialize BM25 if missing; define lightweight retrieve_bm25 and retrieve_dense helpers with safe fallbacks.
- Quick inspection: define peek(row_idx, k) to show SAE/AAVE queries, top-k doc IDs for BM25/Dense, and whether the gold doc is hit.
- Run peeks on a few rows to confirm correct gold mapping and retriever behavior before full evaluation.

In [None]:
# Performs sanity checks, defines lightweight retrieval helpers, and peeks at results.

# 1) Basic column and bounds checks
required_cols = ["id", "squad_idx", "sae_query", "aave_query", "gold_doc_id"]
missing = [c for c in required_cols if c not in df_pairs.columns]
assert not missing, f"Missing columns in df_pairs: {missing}"

assert df_pairs["gold_doc_id"].between(0, len(corpus_df)-1).all(), \
    "gold_doc_id contains out-of-range values"

# 2) Summary stats on gold mapping
print("Corpus size (unique contexts):", len(corpus_df))
print("Pairs loaded:", len(df_pairs))
print("gold_doc_id describe:\n", df_pairs["gold_doc_id"].describe())
print("Top duplicate gold_doc_id counts:\n", df_pairs["gold_doc_id"].value_counts().head(10))

# 3) Lightweight BM25 setup (skip if already initialized)
def _tok(s):
    import re
    return re.findall(r"[a-z0-9]+", s.lower())

if use_bm25 and 'bm25' not in globals():
    from rank_bm25 import BM25Okapi
    tokenized_corpus = [_tok(doc) for doc in corpus_df["context_text"].tolist()]
    bm25 = BM25Okapi(tokenized_corpus)

# Lightweight BM25 retrieve function for peek
def retrieve_bm25(query: str, k: int = 10):
    if not use_bm25 or not query or not query.strip():
        return []
    tokens = _tok(query)
    scores = bm25.get_scores(tokens)
    import numpy as np
    top_idx = np.argsort(scores)[::-1][:k]
    return [(int(corpus_df.iloc[idx]["doc_id"]), float(scores[idx])) for idx in top_idx]

# 4) Lightweight dense retriever (assumes your collection is built)
# Lightweight dense retrieve function for peek
def retrieve_dense(query: str, k: int = 10):
    if not use_dense or not query or not query.strip():
        return []
    resp = openai_client.embeddings.create(model=embed_model, input=[query])
    qvec = resp.data[0].embedding
    res = collection.query(query_embeddings=[qvec], n_results=k)
    ids = res["ids"][0]
    dists = res.get("distances", [[None]])[0]
    out = []
    for j, sid in enumerate(ids):
        score = -float(dists[j]) if dists and dists[j] is not None else float(k - j)
        out.append((int(sid), score))
    return out

# 5) Peek function for SAE and AAVE, checks gold presence in top-k
def peek(row_idx=0, k=10, show_snip=False):
    r = df_pairs.iloc[row_idx]
    gold = int(r["gold_doc_id"])
    sae = r["sae_query"]; aave = r["aave_query"]
    print(f"\nRow {row_idx} | id={r['id']} squad_idx={r['squad_idx']} gold_doc_id={gold}")
    print("SAE:", sae)
    print("AAVE:", aave)
    if show_snip:
        gold_text = corpus_df.loc[corpus_df["doc_id"]==gold, "context_text"].iloc[0][:200]
        print("Gold context snippet:", gold_text, "...")

    if use_bm25:
        bm_sae = [d for d,_ in retrieve_bm25(sae, k)]
        bm_aav = [d for d,_ in retrieve_bm25(aave, k)]
        print("BM25 SAE top ids:", bm_sae, "Hit:", gold in bm_sae)
        print("BM25 AAVE top ids:", bm_aav, "Hit:", gold in bm_aav)

    if use_dense:
        de_sae = [d for d,_ in retrieve_dense(sae, k)]
        de_aav = [d for d,_ in retrieve_dense(aave, k)]
        print("Dense SAE top ids:", de_sae, "Hit:", gold in de_sae)
        print("Dense AAVE top ids:", de_aav, "Hit:", gold in de_aav)

# 6) Run a few peeks
peek(0, k=20, show_snip=False)
peek(1, k=20, show_snip=False)
peek(2, k=20, show_snip=False)

print("\nSanity check complete. Proceed to Evaluate Retrieval Hits.")

Corpus size (unique contexts): 18891
Pairs loaded: 200
gold_doc_id describe:
 count      200.000000
mean      9455.145000
std       5596.148617
min         15.000000
25%       4726.750000
50%       8959.500000
75%      14729.000000
max      18828.000000
Name: gold_doc_id, dtype: float64
Top duplicate gold_doc_id counts:
 gold_doc_id
18705    2
14729    2
4510     2
3249     1
510      1
7176     1
6547     1
2931     1
4134     1
2351     1
Name: count, dtype: int64

Row 0 | id=0 squad_idx=85143 gold_doc_id=18405
SAE: Who overshadowed House Speaker Dennis Hastert?
AAVE: Who overshadow House Speaker Dennis Hastert?
BM25 SAE top ids: [18405, 18401, 18404, 18412, 18406, 18402, 13546, 18577, 5808, 18416, 18415, 13544, 17528, 18413, 18403, 13529, 18578, 18407, 15726, 18419] Hit: True
BM25 AAVE top ids: [18401, 18405, 18404, 18412, 18406, 18402, 13546, 18577, 5808, 18416, 18415, 13544, 17528, 18413, 18403, 13529, 18578, 18407, 15726, 18419] Hit: True
Dense SAE top ids: [18404, 18405, 18415, 

## 2.4 Evaluation: Compute Hit@k per Dialect and Run Paired Significance Tests

- Iterate over paired SAE/AAVE queries; retrieve top-k with BM25 and Dense; record binary hits against gold_doc_id for each dialect.
- Build eval_df with per-pair results; report total evaluated rows and preview for sanity checking.
- Define McNemar helper to form 2×2 contingency tables and compute chi-square with continuity correction.
- Summarize recall@k for SAE and AAVE per retriever, delta in percentage points, and McNemar p-values; print a compact, comparable summary.



### **Mathematical Explanation of Retrieval Metrics**

### Mathematical Explanation of Retrieval Metrics

**Recall@k** (or Hit@k) measures the fraction of queries for which the correct document is successfully found within the top *k* retrieved results. It is a direct measure of retrieval effectiveness. For a set of ***N*** queries, where g_{i} is the gold document for query i and $$R \left( q_{i} , k \right)$$ is the set of top *k* retrieved documents for that query, the formula is:

$$\text{Recall@k} = \frac{1}{N} \sum_{i = 1}^{N} \mathbb{I} \left( g_{i} \in R \left( q_{i} , k \right) \right)$$ where $$\mathbb{I} ( \cdot )$$ is the indicator function, returning 1 if the condition is true and 0 otherwise.

The **Delta (pp)** is the simple arithmetic difference between the Recall@k scores for SAE and AAVE queries, expressed in percentage points (pp). This value quantifies the magnitude of the performance gap, with a positive value indicating SAE performed better and a negative value indicating AAVE performed better.

$$
\Delta_{p p} = \left( \text{Recall@k}_{\text{SAE}} - \text{Recall@k}_{\text{AAVE}} \right) \times 100
$$

**McNemar's Test** is a statistical test for paired binary data (hit vs. miss) that determines if there is a significant difference between the two conditions (SAE vs. AAVE). It specifically evaluates the discordant pairs: *b* (SAE hits, AAVE misses) and *c* (AAVE hits, SAE misses). The test's null hypothesis is that both dialects have the same hit rate.

$$
\chi^{2} = \frac{\left( | b - c | - 1 \right)^{2}}{b + c}
$$

The **p-value** from McNemar's test is the probability of observing a performance difference at least as extreme as the one measured, assuming the null hypothesis is true. A small p-value (typically \< 0.05) suggests the observed difference in hit rates is statistically significant and not due to random chance, allowing us to reject the null hypothesis.


In [None]:
# Iterate over paired queries and record retrieval hits for each retriever.
records = []

for _, row in tqdm(df_pairs.iterrows(), total=len(df_pairs), desc="Evaluating"):
    pid = int(row["id"])
    gold = int(row["gold_doc_id"])
    sae_q = row["sae_query"]
    aave_q = row["aave_query"]

    rec = {"id": pid, "gold_doc_id": gold}

    # Evaluate BM25 hits
    if use_bm25:
        sae_bm25 = retrieve_bm25(sae_q, top_k)
        aave_bm25 = retrieve_bm25(aave_q, top_k)
        rec["bm25_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_bm25)
        rec["bm25_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_bm25)
    else:
        rec["bm25_hit_sae"] = None
        rec["bm25_hit_aave"] = None

    # Evaluate Dense hits
    if use_dense:
        sae_dense = retrieve_dense(sae_q, top_k)
        aave_dense = retrieve_dense(aave_q, top_k)
        rec["dense_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_dense)
        rec["dense_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_dense)
    else:
        rec["dense_hit_sae"] = None
        rec["dense_hit_aave"] = None

    records.append(rec)

eval_df = pd.DataFrame(records)
print("Eval rows:", len(eval_df))
eval_df.head(3)

Evaluating: 100%|██████████| 200/200 [02:12<00:00,  1.51it/s]

Eval rows: 200





Unnamed: 0,id,gold_doc_id,bm25_hit_sae,bm25_hit_aave,dense_hit_sae,dense_hit_aave
0,0,18405,True,True,True,True
1,1,3249,True,True,True,True
2,2,510,True,True,True,True


In [None]:
# Defines McNemar helper and summarizes retrieval results.
from statsmodels.stats.contingency_tables import mcnemar

# Helper function to perform McNemar's test from hit flags
def mcnemar_from_flags(x_hits, y_hits):
    x = pd.Series(x_hits).dropna().astype(int)
    y = pd.Series(y_hits).dropna().astype(int)
    # Build 2x2 contingency table: [[both hit, x hit & y miss], [x miss & y hit, both miss]]
    a = int(((x==1)&(y==1)).sum())
    b = int(((x==1)&(y==0)).sum())
    c = int(((x==0)&(y==1)).sum())
    d = int(((x==0)&(y==0)).sum())
    table = [[a,b],[c,d]]
    # Perform McNemar's test with continuity correction
    res = mcnemar(table, exact=False, correction=True)
    return table, float(res.statistic), float(res.pvalue)

# Helper function to calculate hit rate (recall)
def hit_rate(flags):
    s = pd.Series(flags).dropna().astype(int)
    return float(s.mean()) if len(s) else np.nan

summary = {}

# Summarize BM25 results if enabled
if use_bm25:
    x, y = eval_df["bm25_hit_sae"], eval_df["bm25_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["bm25"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

# Summarize Dense results if enabled
if use_dense:
    x, y = eval_df["dense_hit_sae"], eval_df["dense_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["dense"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

# Print the summary results
print("=== RAG Bias PoC — Retrieval-only Summary ===")
for name, s in summary.items():
    sae_key = [k for k in s.keys() if k.startswith("sae_recall@")][0]
    aave_key = [k for k in s.keys() if k.startswith("aave_recall@")][0]
    print(f"\n[{name.upper()}]")
    print(f"SAE {sae_key.split('_')[0].capitalize()}:  {s[sae_key]:.3f}")
    print(f"AAVE {aave_key.split('_')[0].capitalize()}: {s[aave_key]:.3f}")
    print(f"Delta (pp):          {s['delta_pp']:+.3f}")
    print(f"McNemar p-value:     {s['mcnemar_table']['p_value']:.4g}")
    print(f"N:                   {s['mcnemar_table']['n']}  Table: {s['mcnemar_table']['table']}")

=== RAG Bias PoC — Retrieval-only Summary ===

[BM25]
SAE Sae:  0.900
AAVE Aave: 0.890
Delta (pp):          +0.010
McNemar p-value:     0.4795
N:                   200  Table: [[178, 2], [0, 20]]

[DENSE]
SAE Sae:  0.890
AAVE Aave: 0.895
Delta (pp):          -0.005
McNemar p-value:     1
N:                   200  Table: [[178, 0], [1, 21]]


### **2.4.1 Discussion for k=10**

*   **Primary Finding: No Statistically Significant Bias Detected**
    *   The central conclusion from this 200-pair PoC is the absence of a statistically significant retrieval bias for either BM25 or the dense retriever at `k=10`. Both systems performed at a high level and demonstrated remarkable consistency across SAE and AAVE query variations.

*   **BM25 (Sparse Retrieval) Performance**
    *   Recall was nearly identical at 90.0% for SAE and 89.0% for AAVE, a negligible +1.0 percentage point difference.
    *   The McNemar's test confirms this similarity with a p-value of 0.48, far from statistical significance (p < 0.05).
    *   The contingency table `[[178, 2], [0, 20]]` shows only two discordant pairs (cases where SAE hit and AAVE missed), explaining the lack of a significant difference.

*   **Dense (Embedding-based) Retrieval Performance**
    *   Performance was almost perfectly balanced, with 89.0% recall for SAE versus 89.5% for AAVE (a -0.5 pp delta).
    *   The McNemar's p-value of 1.0 provides the strongest possible evidence of no statistical difference in performance between the dialects.
    *   The contingency table `[[178, 0], [1, 21]]` reinforces this, showing only a single instance where the retrieval outcome differed, indicating high robustness to dialectal shifts.

*   **Overall Conclusion and Context**
    *   For this specific dataset and synthetic query style, both sparse and dense retrieval methods appear robust to dialectal variation. The high baseline recall (~90%) may suggest the retrieval task was relatively easy, potentially masking subtle biases that could appear in more challenging scenarios.

## 2.5 Qualitative Error Analysis and Saving Results

*   **Error Analysis:**
    *   Defines and calls a function `show_examples` to perform qualitative error analysis.
    *   This function filters for and displays specific failure cases where an SAE query resulted in a hit, but its corresponding AAVE query did not, for both BM25 and Dense retrievers.

*   **Save Outputs:**
    *   Creates a timestamped directory to store the experiment outputs for reproducibility.
    *   Saves the detailed, per-pair hit/miss data (`eval_df`) to a CSV file.
    *   Saves the high-level summary statistics (Recall@k, delta, McNemar results) along with the experiment configuration to a JSON file.

In [None]:
# Displays examples of retrieval misses (SAE hit, AAVE miss).
def show_examples(sae_col, aave_col, title):
    # Filter for rows where SAE hit and AAVE missed
    misses = eval_df[(eval_df[sae_col] == True) & (eval_df[aave_col] == False)]
    print(f"\nExamples — {title}: SAE hit, AAVE miss (up to 5)")
    # Iterate and print details for up to 5 examples
    for _, r in misses.head(5).iterrows():
        pid = int(r["id"])
        # Find the original pair data using the id
        rowp = df_pairs[df_pairs["id"] == pid].iloc[0]
        print(f"- id={pid}")
        print(f"  SAE:  {rowp['sae_query']}")
        print(f"  AAVE: {rowp['aave_query']}")

# Show examples for BM25 if enabled
if use_bm25:
    show_examples("bm25_hit_sae", "bm25_hit_aave", "BM25")
# Show examples for Dense if enabled
if use_dense:
    show_examples("dense_hit_sae", "dense_hit_aave", "Dense")


Examples — BM25: SAE hit, AAVE miss (up to 5)
- id=46
  SAE:  What did the British form in preparation to leaving India?
  AAVE: What did the British form to get ready to leave India?
- id=165
  SAE:  When did Libya become an independent nation?
  AAVE: When did Libya get independent?

Examples — Dense: SAE hit, AAVE miss (up to 5)


In [None]:
# Saves the evaluation results and summary to files with a timestamp.
timestamp = time.strftime("%Y%m%d-%H%M%S")
os.makedirs("./results_rag_bias_poc", exist_ok=True)

# Define the path for the evaluation results CSV file
eval_path = f"./results_rag_bias_poc/hits_{timestamp}.csv"
eval_df.to_csv(eval_path, index=False)

# Define the path for the summary JSON file
summary_path = f"./results_rag_bias_poc/summary_{timestamp}.json"
# Write the summary dictionary to the JSON file
with open(summary_path, "w", encoding="utf-8") as f:
    json.dump({
        "config": {
            "top_k": top_k, "use_bm25": use_bm25, "use_dense": use_dense,
            "embed_model": embed_model, "dry_run_n": dry_run_n
        },
        "counts": {"n_pairs": int(len(eval_df))},
        "summary": summary
    }, f, indent=2, ensure_ascii=False)

print(f"Saved: {eval_path}")
print(f"Saved: {summary_path}")

Saved: ./results_rag_bias_poc/hits_20250927-103449.csv
Saved: ./results_rag_bias_poc/summary_20250927-103449.json


## 2.6 Key Findings & Next Steps from Retrieval Bias PoC (`k=10`)

This Proof-of-Concept found **no statistically significant evidence of dialectal bias** for either BM25 or dense retrieval at `k=10`. Both systems demonstrated high and nearly identical recall for SAE and AAVE queries, with McNemar's p-values (0.48 for BM25, 1.0 for Dense) confirming the lack of a significant disparity. The retrieval pipelines and dataset mappings are validated and robust.

#### Detailed Findings

*   **System Performance:**
    *   **BM25:** SAE recall was 90.0% vs. AAVE 89.0% (+1.0 pp delta). The contingency table `[[178, 2], [0, 20]]` shows only two discordant pairs, which is insufficient to suggest a systematic bias.
    *   **Dense:** SAE recall was 89.0% vs. AAVE 89.5% (-0.5 pp delta). The table `[[178, 0], [1, 21]]` shows only one discordant pair, indicating the model is highly robust to the dialectal shifts in this dataset.

*   **Qualitative Insight:**
    *   The few discordant cases (e.g., "in preparation to leaving" vs. "to get ready to leave") highlight minor lexical sensitivities in BM25 but do not translate to a meaningful aggregate performance drop.

#### Recommended Next Steps

1.  **Vary `k` and Re-evaluate (Sensitivity Analysis)**
    *   **Motivation:** Bias may only emerge at stricter cutoffs (e.g., k=2, k=5) where the top-ranked result is critical.
    *   **Action:** Re-run the evaluation loop with `top_k` set to 2, 5, and 20 to analyze performance at different retrieval depths.

2.  **Add Rank-Based Metrics (Deeper Diagnostics)**
    *   **Motivation:** Even if hit rates are similar, the *distribution of ranks* for the gold document could differ systematically.
    *   **Action:** In the evaluation loop, record the rank of the gold document for each query. Compare the rank distributions (e.g., median, mean) and apply a paired Wilcoxon signed-rank test to check for significant differences.

3.  **Evaluate Hybrid Retrieval (Ceiling Analysis)**
    *   **Motivation:** Determine the maximum achievable recall by combining both retrievers and check if one dialect benefits more.
    *   **Action:** For each query, create a union of the top-k doc IDs from BM25 and dense. Calculate a "hybrid hit rate" and run a McNemar test on these results to see if a performance gap emerges at this higher ceiling.

## **2.7 Varying k and Re-evaluating (Sensitivity Analysis)**

### **2.7.1   k=5**

In [None]:
top_k = 5

In [None]:
records = []

for _, row in tqdm(df_pairs.iterrows(), total=len(df_pairs), desc="Evaluating"):
    pid = int(row["id"])
    gold = int(row["gold_doc_id"])
    sae_q = row["sae_query"]
    aave_q = row["aave_query"]

    rec = {"id": pid, "gold_doc_id": gold}

    if use_bm25:
        sae_bm25 = retrieve_bm25(sae_q, top_k)
        aave_bm25 = retrieve_bm25(aave_q, top_k)
        rec["bm25_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_bm25)
        rec["bm25_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_bm25)
    else:
        rec["bm25_hit_sae"] = None
        rec["bm25_hit_aave"] = None

    if use_dense:
        sae_dense = retrieve_dense(sae_q, top_k)
        aave_dense = retrieve_dense(aave_q, top_k)
        rec["dense_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_dense)
        rec["dense_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_dense)
    else:
        rec["dense_hit_sae"] = None
        rec["dense_hit_aave"] = None

    records.append(rec)

eval_df = pd.DataFrame(records)
print("Eval rows:", len(eval_df))
eval_df.head(3)

Evaluating: 100%|██████████| 200/200 [02:11<00:00,  1.52it/s]

Eval rows: 200





Unnamed: 0,id,gold_doc_id,bm25_hit_sae,bm25_hit_aave,dense_hit_sae,dense_hit_aave
0,0,18405,True,True,True,True
1,1,3249,True,True,True,True
2,2,510,True,True,True,True


In [None]:
from statsmodels.stats.contingency_tables import mcnemar

def mcnemar_from_flags(x_hits, y_hits):
    x = pd.Series(x_hits).dropna().astype(int)
    y = pd.Series(y_hits).dropna().astype(int)
    a = int(((x==1)&(y==1)).sum())
    b = int(((x==1)&(y==0)).sum())
    c = int(((x==0)&(y==1)).sum())
    d = int(((x==0)&(y==0)).sum())
    table = [[a,b],[c,d]]
    res = mcnemar(table, exact=False, correction=True)
    return table, float(res.statistic), float(res.pvalue)

def hit_rate(flags):
    s = pd.Series(flags).dropna().astype(int)
    return float(s.mean()) if len(s) else np.nan

summary = {}

if use_bm25:
    x, y = eval_df["bm25_hit_sae"], eval_df["bm25_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["bm25"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

if use_dense:
    x, y = eval_df["dense_hit_sae"], eval_df["dense_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["dense"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

print("=== RAG Bias PoC — Retrieval-only Summary ===")
for name, s in summary.items():
    sae_key = [k for k in s.keys() if k.startswith("sae_recall@")][0]
    aave_key = [k for k in s.keys() if k.startswith("aave_recall@")][0]
    print(f"\n[{name.upper()}]")
    print(f"SAE {sae_key.split('_')[0].capitalize()}:  {s[sae_key]:.3f}")
    print(f"AAVE {aave_key.split('_')[0].capitalize()}: {s[aave_key]:.3f}")
    print(f"Delta (pp):          {s['delta_pp']:+.3f}")
    print(f"McNemar p-value:     {s['mcnemar_table']['p_value']:.4g}")
    print(f"N:                   {s['mcnemar_table']['n']}  Table: {s['mcnemar_table']['table']}")

=== RAG Bias PoC — Retrieval-only Summary ===

[BM25]
SAE Sae:  0.845
AAVE Aave: 0.835
Delta (pp):          +0.010
McNemar p-value:     0.6831
N:                   200  Table: [[165, 4], [2, 29]]

[DENSE]
SAE Sae:  0.870
AAVE Aave: 0.870
Delta (pp):          +0.000
McNemar p-value:     0.4795
N:                   200  Table: [[173, 1], [1, 25]]


- BM25 @5: SAE 0.845 vs AAVE 0.835 (∆ = +1.0 pp). McNemar p=0.68 → not significant. Table [[165, 4], [2, 29]] shows only 4 SAE> AAVE vs 2 AAVE> SAE discordant cases.
- Dense @5: SAE 0.870 vs AAVE 0.870 (∆ = 0). McNemar p=0.48 → not significant. Table [[173, 1], [1, 25]] is symmetric on discordants.

Conclusion: Even at stricter k=5, no statistically significant retrieval bias between SAE and AAVE for either BM25 or dense.

Recommended next steps (minimal):
- Try k=20 to see if curves converge fully and to report a small “k sweep” (5/10/20).
- Add rank-of-gold logging and compare median ranks with a paired Wilcoxon (sometimes rank gaps exist even when hit@k is similar).
- Compute hybrid recall (union of BM25 and dense top-k) to show an upper bound and check if any dialect gap appears there.

### **2.7.2   k=20**

In [None]:
top_k = 20

In [None]:
records = []

for _, row in tqdm(df_pairs.iterrows(), total=len(df_pairs), desc="Evaluating"):
    pid = int(row["id"])
    gold = int(row["gold_doc_id"])
    sae_q = row["sae_query"]
    aave_q = row["aave_query"]

    rec = {"id": pid, "gold_doc_id": gold}

    if use_bm25:
        sae_bm25 = retrieve_bm25(sae_q, top_k)
        aave_bm25 = retrieve_bm25(aave_q, top_k)
        rec["bm25_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_bm25)
        rec["bm25_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_bm25)
    else:
        rec["bm25_hit_sae"] = None
        rec["bm25_hit_aave"] = None

    if use_dense:
        sae_dense = retrieve_dense(sae_q, top_k)
        aave_dense = retrieve_dense(aave_q, top_k)
        rec["dense_hit_sae"] = any(doc_id == gold for doc_id, _ in sae_dense)
        rec["dense_hit_aave"] = any(doc_id == gold for doc_id, _ in aave_dense)
    else:
        rec["dense_hit_sae"] = None
        rec["dense_hit_aave"] = None

    records.append(rec)

eval_df = pd.DataFrame(records)
print("Eval rows:", len(eval_df))
eval_df.head(3)

Evaluating: 100%|██████████| 200/200 [02:12<00:00,  1.51it/s]

Eval rows: 200





Unnamed: 0,id,gold_doc_id,bm25_hit_sae,bm25_hit_aave,dense_hit_sae,dense_hit_aave
0,0,18405,True,True,True,True
1,1,3249,True,True,True,True
2,2,510,True,True,True,True


In [None]:
from statsmodels.stats.contingency_tables import mcnemar

def mcnemar_from_flags(x_hits, y_hits):
    x = pd.Series(x_hits).dropna().astype(int)
    y = pd.Series(y_hits).dropna().astype(int)
    a = int(((x==1)&(y==1)).sum())
    b = int(((x==1)&(y==0)).sum())
    c = int(((x==0)&(y==1)).sum())
    d = int(((x==0)&(y==0)).sum())
    table = [[a,b],[c,d]]
    res = mcnemar(table, exact=False, correction=True)
    return table, float(res.statistic), float(res.pvalue)

def hit_rate(flags):
    s = pd.Series(flags).dropna().astype(int)
    return float(s.mean()) if len(s) else np.nan

summary = {}

if use_bm25:
    x, y = eval_df["bm25_hit_sae"], eval_df["bm25_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["bm25"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

if use_dense:
    x, y = eval_df["dense_hit_sae"], eval_df["dense_hit_aave"]
    table, chi2, p = mcnemar_from_flags(x, y)
    summary["dense"] = {
        f"sae_recall@{top_k}": hit_rate(x),
        f"aave_recall@{top_k}": hit_rate(y),
        "delta_pp": hit_rate(x) - hit_rate(y),
        "mcnemar_table": {"table": table, "chi2": chi2, "p_value": p, "n": int(len(eval_df))}
    }

print("=== RAG Bias PoC — Retrieval-only Summary ===")
for name, s in summary.items():
    sae_key = [k for k in s.keys() if k.startswith("sae_recall@")][0]
    aave_key = [k for k in s.keys() if k.startswith("aave_recall@")][0]
    print(f"\n[{name.upper()}]")
    print(f"SAE {sae_key.split('_')[0].capitalize()}:  {s[sae_key]:.3f}")
    print(f"AAVE {aave_key.split('_')[0].capitalize()}: {s[aave_key]:.3f}")
    print(f"Delta (pp):          {s['delta_pp']:+.3f}")
    print(f"McNemar p-value:     {s['mcnemar_table']['p_value']:.4g}")
    print(f"N:                   {s['mcnemar_table']['n']}  Table: {s['mcnemar_table']['table']}")

=== RAG Bias PoC — Retrieval-only Summary ===

[BM25]
SAE Sae:  0.925
AAVE Aave: 0.930
Delta (pp):          -0.005
McNemar p-value:     1
N:                   200  Table: [[185, 0], [1, 14]]

[DENSE]
SAE Sae:  0.920
AAVE Aave: 0.915
Delta (pp):          +0.005
McNemar p-value:     1
N:                   200  Table: [[183, 1], [0, 16]]


### **2.7.3 Summary of Sensitivity Analysis by Varying `top_k`**

The sensitivity analysis across retrieval depths of `k=5`, `k=10`, and `k=20` strongly reinforces the initial conclusion: **there is no statistically significant evidence of retrieval bias for either the BM25 or the dense system in this PoC.** The performance parity between SAE and AAVE queries remains remarkably consistent regardless of the retrieval cutoff.

**Key Observations**

*   **Consistent Lack of Significance:** Across all six conditions (two retrievers at three `k` values), the McNemar's p-value never approached the significance threshold of 0.05. This indicates that any observed differences in hit rates are statistically indistinguishable from random noise.

*   **Stable and Negligible Deltas:** The performance gap (delta in pp) between SAE and AAVE queries was consistently minimal, fluctuating between -0.5 and +1.0 percentage points. This demonstrates that neither dialect held a meaningful advantage at any tested retrieval depth.

*   **Predictable Recall Scaling:** As expected, recall for both systems and both dialects increased steadily as `k` grew, confirming the evaluation harness is functioning correctly. The dense retriever showed a slight edge at the strictest cutoff (`k=5`), while BM25 was competitive or slightly ahead at `k=20`, but these minor differences are secondary to the core finding of dialectal fairness.

**Consolidated Results Table**

| Metric | k=5 | k=10 | k=20 |
| :--- | :---: | :---: | :---: |
| **BM25 Recall SAE** | 84.5% | 90.0% | 92.5% |
| **BM25 Recall AAVE** | 83.5% | 89.0% | 93.0% |
| *McNemar p-value* | *0.683* | *0.480* | *1.0* |
| **Dense Recall SAE** | 87.0% | 89.0% | 92.0% |
| **Dense Recall AAVE** | 87.0% | 89.5% | 91.5% |
| *McNemar p-value* | *0.480* | *1.0* | *1.0* |



The table illustrates that `top_k` functions as both a **measure of retrieval difficulty** and a **tool for sensitivity analysis**.

1.  **As a measure of difficulty**, the table shows the expected trend: as `k` increases from 5 to 20, the task becomes easier, and recall scores predictably rise for both systems and dialects. This confirms the experiment is behaving logically.

2.  **As a tool for sensitivity analysis**, the table reveals that the *parity* between SAE and AAVE performance is not dependent on a single cutoff. The lack of a significant performance gap holds true whether the criterion is strict (`k=5`) or lenient (`k=20`), which strengthens the conclusion that the retrieval systems are robustly equitable for this dataset.



In summary, varying the retrieval depth did not uncover any latent bias. The results robustly show that, for this dataset and configuration, both sparse and dense retrieval systems perform equitably on SAE and AAVE queries.

## **2.8 Deeper Diagnostics: Rank-Based Analysis and Hybrid Retrieval Ceiling**

*   **Calculate Gold Document Ranks:** Iterates through all query pairs, retrieves top-k results, and records the 1-based rank of the `gold_doc_id` for each dialect and retriever. If the document is not found, its rank is set to infinity.
*   **Compute Hybrid Hits:** For each query pair, it creates a union of the document IDs retrieved by BM25 and the dense model. It then records a "hybrid hit" if the `gold_doc_id` is present in this combined set.
*   **Analyze Rank Distribution:** Compares the median rank of the gold document for SAE vs. AAVE queries. It then applies a paired Wilcoxon signed-rank test to determine if the difference in rank distributions is statistically significant.
*   **Evaluate Hybrid Recall Ceiling:** Calculates the overall recall rate for the hybrid system for both dialects and runs a McNemar's test to check for any significant performance disparity at this combined upper bound.



This section performs a deeper analysis that goes beyond a simple "hit or miss" evaluation. First, for every query, it calculates the exact **rank** (the position, from 1 to k) of the correct answer in the search results for both BM25 and dense retrieval. This helps to see if one dialect consistently ranks the correct answer higher, even if both find it. Second, it calculates a **hybrid recall**, which checks if the correct answer is found by *either* BM25 *or* the dense retriever combined. This measures the maximum possible performance of the two systems working together. Finally, it runs appropriate statistical tests (Wilcoxon for ranks, McNemar for hybrid hits) to determine if any observed differences between SAE and AAVE are statistically significant.

In [None]:
# Calculate gold document ranks and hybrid hits for each query pair.
import numpy as np

# Helper function to find the rank of the gold document in retrieval results (1-based).
def rank_of_gold(results, gold):
    for j, (doc_id, _) in enumerate(results):
        if doc_id == gold:
            return j + 1  # 1-based rank
    return np.inf # Return infinity if gold document is not found in top-k

records_rank = []
for _, row in tqdm(df_pairs.iterrows(), total=len(df_pairs), desc="Ranks/Hybrid"):
    pid = int(row["id"])
    gold = int(row["gold_doc_id"])
    sae_q, aave_q = row["sae_query"], row["aave_query"]

    # Retrieve results for both dialects and retrievers
    sae_bm25 = retrieve_bm25(sae_q, top_k) if use_bm25 else []
    aave_bm25 = retrieve_bm25(aave_q, top_k) if use_bm25 else []
    sae_dense = retrieve_dense(sae_q, top_k) if use_dense else []
    aave_dense = retrieve_dense(aave_q, top_k) if use_dense else []

    rec = {"id": pid, "gold_doc_id": gold}

    # Record ranks (1..k) or inf if not found for each retriever and dialect
    if use_bm25:
        rec["bm25_rank_sae"]  = rank_of_gold(sae_bm25, gold)
        rec["bm25_rank_aave"] = rank_of_gold(aave_bm25, gold)
    else:
        rec["bm25_rank_sae"] = rec["bm25_rank_aave"] = None

    if use_dense:
        rec["dense_rank_sae"]  = rank_of_gold(sae_dense, gold)
        rec["dense_rank_aave"] = rank_of_gold(aave_dense, gold)
    else:
        rec["dense_rank_sae"] = rec["dense_rank_aave"] = None

    # Calculate and record hybrid hit (union of doc ids from BM25 and dense)
    if use_bm25 or use_dense:
        sae_union = set([d for d,_ in sae_bm25]) | set([d for d,_ in sae_dense])
        aave_union = set([d for d,_ in aave_bm25]) | set([d for d,_ in aave_dense])
        rec["hybrid_hit_sae"]  = gold in sae_union
        rec["hybrid_hit_aave"] = gold in aave_union
    else:
        rec["hybrid_hit_sae"] = rec["hybrid_hit_aave"] = None

    records_rank.append(rec)

ranks_df = pd.DataFrame(records_rank)
print("Ranks/Hybrid rows:", len(ranks_df))
ranks_df.head(3)

Ranks/Hybrid: 100%|██████████| 200/200 [02:16<00:00,  1.47it/s]

Ranks/Hybrid rows: 200





Unnamed: 0,id,gold_doc_id,bm25_rank_sae,bm25_rank_aave,dense_rank_sae,dense_rank_aave,hybrid_hit_sae,hybrid_hit_aave
0,0,18405,1.0,2.0,2.0,2.0,True,True
1,1,3249,1.0,1.0,3.0,3.0,True,True
2,2,510,2.0,3.0,2.0,2.0,True,True


In [None]:
# Summarize rank skew and hybrid recall.

from scipy.stats import wilcoxon
import numpy as np

# Helper to filter for finite values in a series.
def finite(series):
    s = pd.to_numeric(series, errors="coerce")
    return s.replace([np.inf, -np.inf], np.nan).dropna()

# Perform paired Wilcoxon signed-rank test.
def paired_wilcoxon(x, y):
    xf, yf = finite(x), finite(y)
    # Align on same indices
    common = xf.index.intersection(yf.index)
    if len(common) < 5:
        return {"n": int(len(common)), "stat": np.nan, "p": np.nan}
    stat, p = wilcoxon(xf.loc[common], yf.loc[common], zero_method="wilcox", alternative="two-sided", mode="auto")
    return {"n": int(len(common)), "stat": float(stat), "p": float(p)}

# Calculate hit rate.
def rate(s):
    s = pd.Series(s).dropna().astype(bool)
    return float(s.mean()) if len(s) else np.nan

print("=== Rank-of-gold summary (lower is better) ===")
if use_bm25:
    bm_w = paired_wilcoxon(ranks_df["bm25_rank_sae"], ranks_df["bm25_rank_aave"])
    print(f"BM25 median rank — SAE: {finite(ranks_df['bm25_rank_sae']).median():.2f}, "
          f"AAVE: {finite(ranks_df['bm25_rank_aave']).median():.2f}, "
          f"Wilcoxon p={bm_w['p']:.4g} (n={bm_w['n']})")

if use_dense:
    de_w = paired_wilcoxon(ranks_df["dense_rank_sae"], ranks_df["dense_rank_aave"])
    print(f"Dense median rank — SAE: {finite(ranks_df['dense_rank_sae']).median():.2f}, "
          f"AAVE: {finite(ranks_df['dense_rank_aave']).median():.2f}, "
          f"Wilcoxon p={de_w['p']:.4g} (n={de_w['n']})")

print("\n=== Hybrid recall (BM25 ∪ Dense) ===")
print(f"Hybrid Recall@{top_k} — SAE: {rate(ranks_df['hybrid_hit_sae']):.3f}, "
      f"AAVE: {rate(ranks_df['hybrid_hit_aave']):.3f}")

# Optional: McNemar on hybrid
from statsmodels.stats.contingency_tables import mcnemar
x = pd.Series(ranks_df["hybrid_hit_sae"]).dropna().astype(int)
y = pd.Series(ranks_df["hybrid_hit_aave"]).dropna().astype(int)
a = int(((x==1)&(y==1)).sum()); b = int(((x==1)&(y==0)).sum())
c = int(((x==0)&(y==1)).sum()); d = int(((x==0)&(y==0)).sum())
res = mcnemar([[a,b],[c,d]], exact=False, correction=True)
print(f"Hybrid McNemar p={res.pvalue:.4g}  Table=[[{a},{b}],[{c},{d}]]  N={len(x)}")

=== Rank-of-gold summary (lower is better) ===
BM25 median rank — SAE: 1.00, AAVE: 1.00, Wilcoxon p=0.3936 (n=185)
Dense median rank — SAE: 1.00, AAVE: 1.00, Wilcoxon p=0.09568 (n=183)

=== Hybrid recall (BM25 ∪ Dense) ===
Hybrid Recall@20 — SAE: 0.990, AAVE: 0.990
Hybrid McNemar p=0  Table=[[198,0],[0,2]]  N=200


  statistic = (np.abs(n1 - n2) - corr)**2 / (1. * (n1 + n2))


### 2.8.1 Interpretation of Deeper Diagnostic Results

This deeper analysis provides stronger evidence that **no dialectal bias exists in this PoC**. The retrieval systems perform equitably not only in finding the correct document but also in how highly they rank it.

*   **Rank Analysis Shows No Disparity:**
    *   The median rank for the correct document was **1.00 for both SAE and AAVE** across both BM25 and dense systems, indicating excellent and identical top-ranking performance.
    *   The Wilcoxon signed-rank test confirms this, with p-values (0.39 for BM25, 0.096 for Dense) well above the 0.05 threshold, showing **no statistically significant difference** in the rank distributions.

*   **Hybrid Recall Ceiling is Perfectly Identical:**
    *   Combining BM25 and dense retrievers results in a near-perfect hybrid recall of **99.0% for both SAE and AAVE**, demonstrating an identical performance ceiling.
    *   The McNemar test table `[[198,0],[0,2]]` is the most compelling evidence: there were **zero discordant cases** where the hybrid system succeeded for one dialect but failed for the other. This perfect agreement signifies the strongest possible parity between the two.

## 2.9 Final Conclusions

This Proof-of-Concept investigation into dialectal bias yielded a clear and consistent primary finding: across a multi-faceted evaluation, we found **no statistically significant evidence of retrieval bias** against African American Vernacular English (AAVE) for either the BM25 (sparse) or the `text-embedding-3-small` (dense) retrieval system. The performance parity between Standard American English (SAE) and AAVE queries was robust, holding true across multiple metrics and evaluation criteria.

The evidence for this conclusion is threefold:

1.  **Insensitivity to Retrieval Depth:** A sensitivity analysis varying the retrieval cutoff (`k=5, 10, 20`) demonstrated that the hit rates for SAE and AAVE were statistically indistinguishable at every level. All McNemar's tests yielded high p-values, and the performance deltas were consistently negligible, confirming that the lack of bias was not an artifact of a specific `k` value.

2.  **Absence of Rank Skew:** A deeper analysis of the gold document's rank revealed no disparity. For both retrievers, the median rank was 1.00 for both dialects, indicating that the correct document was frequently the top result regardless of dialect. Paired Wilcoxon signed-rank tests confirmed that the rank distributions were not significantly different.

3.  **Identical Hybrid Performance Ceiling:** When combining the outputs of both BM25 and dense retrieval, the maximum achievable recall was a near-perfect 99.0% for both SAE and AAVE. The McNemar's test on these hybrid results showed perfect agreement, with zero discordant cases where the system succeeded for one dialect but failed for the other. This provides the strongest possible evidence that the combined potential of the retrieval systems is perfectly balanced.

However, these findings must be interpreted within the specific context and limitations of this PoC. The 200 query pairs were **synthetically generated** by a large language model instructed to paraphrase SAE questions into AAVE. While this approach ensures controlled, parallel data, the resulting AAVE may not fully capture the lexical richness, syntactic diversity, or sociolinguistic nuance of naturally occurring dialectal speech. It is plausible that modern, highly capable embedding models are robust to these specific, structured transformations, leading to the observed null result.

Therefore, while this study provides a valuable and encouraging baseline, it does not definitively prove the absence of dialectal bias in all scenarios. The primary implication is that for well-formed questions against a standard corpus like SQuAD, modern retrieval systems can exhibit high robustness to certain forms of dialectal variation.

Future work should focus on validating and generalizing these initial findings by:
*   **Expanding the Dataset:** Scaling the evaluation to a larger and more diverse set of queries, ideally incorporating naturally occurring (non-synthetic) AAVE data.
*   **Deepening Error Analysis:** Stratifying results by question type, context length, or linguistic features to determine if bias emerges in specific, more challenging subsets.
*   **Testing System Robustness:** Evaluating the stability of these findings by testing with different embedding models or more advanced sparse retrieval configurations.
*   **Evaluating the Full RAG Pipeline:** Extending the analysis beyond retrieval to assess the downstream ***Generation*** component, where biases may manifest differently.