# Notebook 04 — Hybrid Retrieval (BM25 + FAISS) + Rerank
Goal: BM25 keyword recall + dense E5 vectors → fuse → rerank with bge-reranker → soft filters by competition/topic/stage.

## Load artifacts & models

Purpose: load meta.parquet, FAISS index, E5 for queries, and optional reranker.

In [17]:
# --- 0) Load index + models (single source of truth) ---
from pathlib import Path
import pandas as pd
import numpy as np
import torch, faiss

NOTEBOOK_DIR = Path.cwd()
ROOT = NOTEBOOK_DIR.parent
INDEX_DIR  = ROOT / "data" / "index"
META_PATH  = INDEX_DIR / "meta.parquet"
FAISS_PATH = INDEX_DIR / "faiss_e5.index"

assert META_PATH.exists() and FAISS_PATH.exists(), "Run Notebook 03 first."

meta = pd.read_parquet(META_PATH).reset_index(drop=True)
index = faiss.read_index(str(FAISS_PATH))
assert index.ntotal == len(meta), f"FAISS size ({index.ntotal}) != meta rows ({len(meta)})"

# Models
from sentence_transformers import SentenceTransformer
DEVICE  = "mps" if torch.backends.mps.is_available() else "cpu"
E5_ID   = "intfloat/multilingual-e5-base"
e5 = SentenceTransformer(E5_ID, device=DEVICE)

# Optional reranker (safe fallback)
try:
    from FlagEmbedding import FlagReranker
    # Force fp32; avoid MPS fp16
    reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=False)
    HAVE_RERANKER = True
except Exception as e:
    print("Reranker unavailable → will skip rerank:", e)
    reranker = None
    HAVE_RERANKER = False

print("Loaded:", len(meta), "chunks | FAISS ntotal:", index.ntotal)
print("Device:", DEVICE, "| Reranker:", HAVE_RERANKER)
print("Meta columns:", list(meta.columns)[:10], "…")

Loaded: 137 chunks | FAISS ntotal: 137
Device: mps | Reranker: True
Meta columns: ['rid', 'qa_id', 'competition', 'topic', 'stage', 'page_start', 'page_end', 'chunk_type', 'content'] …


## Build BM25 corpus (tokenize TR/EN)

Purpose: keyword recall over all chunks.

In [18]:
# --- 1) BM25 corpus ---
from rank_bm25 import BM25Okapi
import re

def norm_tokenize(text: str):
    # keep Turkish diacritics; split to word tokens
    return re.findall(r"[A-Za-z0-9ÇĞİÖŞÜçğıöşü]+", text.lower())

bm25_corpus = [norm_tokenize(s) for s in meta["content"].astype(str).tolist()]
bm25 = BM25Okapi(bm25_corpus)
print("BM25 built over", len(bm25_corpus), "chunks")

BM25 built over 137 chunks


## Router (synonyms + hint detection)

Purpose: detect likely competition/topic/stage from the query text.

In [19]:
# --- 2) Router (soft hints for competition/topic/stage) ---
import re

SYN = {
    "competition": {
        "HSS":      [r"hss", r"hava savunma"],
        "E-TICARET":[r"e-?ticaret", r"e ticaret", r"hackathon"],
        "ADRES":    [r"adres(?![a-z])", r"adres çöz"],
    },
    "topic": {
        "timeline":   [r"son başvuru|başvuru tarihi|son tarih|takvim"],
        "penalties":  [r"ceza|diskalifiye|ihlal|yasak|kural d[ıi]şı|dost (hedef|ateş[iı])|yanl[ıi]ş hedef"],
        "scoring":    [r"puan|puanlama|bsp|kriter|bonus|kesinti|değerlendir"],
        "team":       [r"tak[ıi]m|üye|danışman|ekip"],
        "eligibility":[r"uygun|başvuru koşul|kimler|gereklilik|şart"],
        "stages":     [r"aşama|görev|süreç|sunum"],
        "logistics":  [r"konaklama|ulaşım|destek|mekan|sponsor"],
    },
    "stage": {
        1: [r"\b1(\.|\s|$)|\bi\b"],
        2: [r"\b2(\.|\s|$)|\bii\b"],
        3: [r"\b3(\.|\s|$)|\biii\b"],
    }
}

def route_query(q: str):
    ql = q.lower()
    comp = next((c for c,pats in SYN["competition"].items() if any(re.search(p, ql) for p in pats)), None)
    topic = next((t for t,pats in SYN["topic"].items()       if any(re.search(p, ql) for p in pats)), None)
    stage = next((s for s,pats in SYN["stage"].items()       if any(re.search(p, ql) for p in pats)), None)
    return {"competition": comp, "topic": topic, "stage": stage}

## Soft filters (apply only if they don’t zero-out results)

Purpose: take candidate indices and keep only those matching competition/topic/stage if those values exist and filtering doesn’t wipe everything. Otherwise, fall back to the originals.

In [20]:
# --- 3) Soft filters ---
def supported_values(col, val):
    if val is None:
        return False
    try:
        return val in set(meta[col].dropna().unique())
    except KeyError:
        return False

def apply_soft_filter(idxs, comp=None, topic=None, stage=None, want=None):
    idxs = list(idxs)
    if not idxs:
        return idxs

    orig = idxs[:]

    # competition
    if comp and supported_values("competition", comp):
        kept = [i for i in idxs if meta.loc[i, "competition"] == comp]
        idxs = kept or orig

    # topic (with penalties fallback)
    if topic:
        if supported_values("topic", topic):
            kept = [i for i in idxs if meta.loc[i, "topic"] == topic]
            if not kept and topic == "penalties":
                # label exists globally but none in this candidate set → content fallback
                import re
                pat = re.compile(r"ceza|diskalifiye|yanl[ıi]ş hedef|dost hedef", re.IGNORECASE)
                kept = [i for i in idxs if pat.search(str(meta.loc[i, "content"]))]

            idxs = kept or idxs
        else:
            # label missing globally → content fallback only for penalties
            if topic == "penalties":
                import re
                pat = re.compile(r"ceza|diskalifiye|yanl[ıi]ş hedef|dost hedef", re.IGNORECASE)
                kept = [i for i in idxs if pat.search(str(meta.loc[i, "content"]))]
                if kept:
                    idxs = kept

    # stage
    if stage and supported_values("stage", stage):
        kept = [i for i in idxs if meta.loc[i, "stage"] == stage]
        idxs = kept or idxs

    if want:
        idxs = idxs[:want]
    return idxs

## Search primitives (FAISS dense + BM25 keyword)
Purpose: helpers to get top-K candidates from each retriever, safely handling edge cases.

In [21]:
# --- 4) Search primitives ---
def faiss_search(q: str, top_k=60):
    """Dense search with E5; returns (scores, indices)."""
    qvec = e5.encode(["query: " + q], normalize_embeddings=True).astype("float32")
    D, I = index.search(qvec, min(top_k, index.ntotal))
    return D[0], I[0]

def bm25_search(q: str, top_k=60):
    """Keyword search with BM25; returns (scores, indices)."""
    toks = norm_tokenize(q)
    scores = bm25.get_scores(toks)  # numpy array, len = corpus size
    n = len(scores)
    k = min(top_k, n)
    if k == n:
        top_idx = np.argsort(scores)[::-1]
    else:
        part = np.argpartition(scores, -k)[-k:]
        top_idx = part[np.argsort(scores[part])[::-1]]
    return scores[top_idx], top_idx

## Hybrid fusion + (optional) rerank

Purpose: union dense + bm25 (after soft filters), normalize to [0,1], fuse with weights, then rerank with bge-reranker-v2-m3 if available.

In [22]:
# --- 5) Hybrid search (soft filters + fusion + (optional) rerank) ---
def hybrid_search(query, bm25_k=60, dense_k=60, rerank_k=10, weights=(0.75, 0.25)):
    route = route_query(query)
    comp, topic, stage = route["competition"], route["topic"], route["stage"]

    # Dense search → zip → soft-filter (keep score/index alignment)
    d_scores, d_idx = faiss_search(query, top_k=dense_k)
    d_pairs_all = list(zip(d_idx.tolist(), d_scores.tolist()))
    d_keep = apply_soft_filter([i for i,_ in d_pairs_all], comp, topic, stage, want=dense_k)
    d_set = set(d_keep)
    d_pairs = [(i, s) for (i, s) in d_pairs_all if i in d_set]
    d_map = {i: float(s) for (i, s) in d_pairs}

    # BM25 search → zip → soft-filter (keep score/index alignment)
    b_scores, b_idx = bm25_search(query, top_k=bm25_k)
    b_pairs_all = list(zip(b_idx.tolist(), b_scores.tolist()))
    b_keep = apply_soft_filter([i for i,_ in b_pairs_all], comp, topic, stage, want=bm25_k)
    b_set = set(b_keep)
    b_pairs = [(i, s) for (i, s) in b_pairs_all if i in b_set]
    b_map = {i: float(s) for (i, s) in b_pairs}

    # Normalize each modality to [0,1]
    def _norm_map(m):
        if not m: return {}
        vals = np.array(list(m.values()), dtype=float)
        lo, hi = float(vals.min()), float(vals.max())
        if hi <= lo:  # constant
            return {k: 0.0 for k in m}
        return {k: (v - lo) / (hi - lo) for k,v in m.items()}

    d_norm = _norm_map(d_map)
    b_norm = _norm_map(b_map)

    # Fuse (weighted sum)
    fused = {}
    for i, s in d_norm.items():
        fused[i] = fused.get(i, 0.0) + weights[1] * s
    for i, s in b_norm.items():
        fused[i] = fused.get(i, 0.0) + weights[0] * s
    if not fused:
        return [], route

    # Tiny feature bonus (penalties / dates / team size)
    import re
    MONTHS = "ocak|şubat|subat|mart|nisan|mayıs|mayis|haziran|temmuz|ağustos|agustos|eylül|eylul|ekim|kasım|kasim|aralık|aralik"
    DATE_RE = re.compile(rf"\b\d{{1,2}}[./]\d{{1,2}}[./]\d{{4}}\b|({MONTHS})", re.IGNORECASE)

    def _bonus(i: int, ql: str) -> float:
        text = str(meta.loc[i, "content"]).lower()
        ctype = meta.loc[i, "chunk_type"]
        b = 0.0
        if re.search(r"ceza|penalt|diskalifiye|yanl[ıi]ş hedef|dost hedef", ql):
            if re.search(r"ceza|diskalifiye|puan", text): b += 0.15
        if re.search(r"son başvuru|başvuru tarihi|deadline|tarih|takvim", ql):
            if DATE_RE.search(text): b += 0.15
            if ctype == "limits":   b += 0.05
        if re.search(r"kaç kişi|kaç kis|kaç üye|takım sayısı|ekip sayısı", ql):
            if re.search(r"\b(kişi|üye)\b", text) and re.search(r"\b\d+\b", text): b += 0.12
            if ctype == "limits": b += 0.05
        return b

    ql = query.lower()
    for i in list(fused.keys()):
        fused[i] += _bonus(i, ql)

    # Prelim list for rerank
    prelim = sorted(fused.items(), key=lambda x: x[1], reverse=True)[:max(3*rerank_k, rerank_k)]
    prelim_ids = [i for i,_ in prelim]

    # Rerank (if available)
    if HAVE_RERANKER:
        pairs = [(query, meta.loc[i, "content"]) for i in prelim_ids]
        if not pairs:
            return [], route
        scores = reranker.compute_score(pairs, normalize=True, max_length=512)
        pairs_scored = list(zip(prelim_ids, scores))

        if route.get("topic") == "penalties":
            import re
            pen_re   = re.compile(r"ceza|diskalifiye|yanl[ıi]ş hedef|dost hedef", re.IGNORECASE)
            strong_re= re.compile(r"-\\s*30|yanl[ıi]ş hedef", re.IGNORECASE)  # stronger hints
            bsp_re   = re.compile(r"\\bbsp\\b", re.IGNORECASE)

            bumped = []
            for i, s in pairs_scored:
                text  = str(meta.loc[i, "content"]).lower()
                bonus = 0.0
                if pen_re.search(text):     bonus += 0.04   # was 0.02
                if strong_re.search(text):  bonus += 0.03   # extra for “-30” / “yanlış hedef”
                if bsp_re.search(text):     bonus -= 0.03   # tiny down-bump for BSP in a penalties query
                bumped.append((i, float(s + bonus)))
        else:
            bumped = [(i, float(s)) for i, s in pairs_scored]

        # de-dup by qa_id; prefer 'qa' over 'limits'/'formula'
        def _ctype_pref(i: int) -> int:
            return {"qa": 2, "limits": 1, "formula": 0}.get(meta.loc[i, "chunk_type"], 0)

        best_by_qid = {}
        for i, s in bumped:
            qid = meta.loc[i, "qa_id"]
            prev = best_by_qid.get(qid)
            if prev is None or s > prev[1] + 1e-9 or (abs(s - prev[1]) < 1e-9 and _ctype_pref(i) > _ctype_pref(prev[0])):
                best_by_qid[qid] = (i, s)

        ranked = sorted(best_by_qid.values(), key=lambda x: x[1], reverse=True)[:rerank_k]
        return ranked, route

    # Fallback: fused order
    return [(i, s) for i, s in prelim[:rerank_k]], route

## Pretty printer (compact, RAG-friendly)
Purpose: print top hits with useful source tags.

In [23]:
# --- 6) Pretty print results ---
def show_results(query, results, max_chars=200):
    print("\nQ:", query)
    if not results:
        print("  (no candidates)")
        return
    for rank, (i, sc) in enumerate(results, 1):
        row = meta.loc[i]
        snippet = row["content"].replace("\n", " ")
        print(f"{rank:>2}. [{row['qa_id']} | {row['competition']} | {row['topic']} "
              f"| p{int(row['page_start'])}-{int(row['page_end'])} | {row['chunk_type']}]")
        print("    score:", f"{float(sc):.3f}")
        print("    ", (snippet[:max_chars] + ("…" if len(snippet) > max_chars else "")))

## Basic sanity queries (TR + EN)
Purpose: ensure end-to-end retrieval works and results look sensible.

In [24]:
# --- 7) Quick sanity tests (TR only) ---
tests = [
    "HSS dost hedef vurma cezası nedir?",
    "E-Ticaret için son başvuru tarihi",
    "HSS BSP formülü nedir?",
    "Adres yarışmasında takım kaç kişi olmalı?",
]

for q in tests:
    ranked, route = hybrid_search(q, bm25_k=60, dense_k=60, rerank_k=8)
    print("route:", route)
    show_results(q, ranked, max_chars=180)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


route: {'competition': 'HSS', 'topic': 'penalties', 'stage': None}

Q: HSS dost hedef vurma cezası nedir?
 1. [HSS-Q003 | HSS | eligibility | p9-14 | qa]
    score: 1.055
     Q: Hava Savunma Sistemleri Yarışması'nda yer alan "Bonus Süre Puanı" (BSP) A: mekanizmasının amacı nedir ve hangi koşullarda takımlara avantaj sağlamaktadır? Bu  puanın toplam sıra…
 2. [HSS-Q028 | HSS | team | p5-5 | qa]
    score: 1.022
     Q: 2. Aşamada dost b+r hedef+ vurmanın cezası ned+r A: Tek b+r dost hedef+ vurmak -30 puanlık b+r ceza +le sonuçlanır. Eğer b+r takım +k+ veya daha  fazla dost hedef vurursa görev+…
route: {'competition': 'E-TICARET', 'topic': 'timeline', 'stage': None}

Q: E-Ticaret için son başvuru tarihi
 1. [E-TICARET-QG002 | E-TICARET | team | p16-17 | qa]
    score: 0.090
     Q: sürecine ne gibi bir şe`aﬂık ve derinlik katmaktadır? A: Projelerin, sunumların başlayacağı 14 Eylül Pazar saat 14:00'e kadar Github'a yüklenmiş  olması zorunluluğu60, değerlend…
 2. [E-TICARET-QG001 | E-TICA