# CV RAG Query Notebook (Deep Dive - Hybrid Search)

V tomto notebooku rozebereme dotazov√°n√≠ na prvoƒçinitele.
Pou≈æijeme speci√°ln√≠ metody v k√≥du, kter√© n√°m "vyzrad√≠", jak se Hybrid Search rozhodoval.

Co uvid√≠me:
1.  Co na≈°la BM25 ƒç√°st (Top K).
2.  Co na≈°la Vektorov√° ƒç√°st (Top K).
3.  **SIMULACE:** Ruƒçnƒõ spoƒç√≠t√°me RRF sk√≥re p≈ô√≠mo v notebooku.
4.  **METRIKY:** Zmƒõ≈ô√≠me shodu mezi obƒõma metodami.

In [1]:
import logging
from pathlib import Path
import sys
import pandas as pd

sys.path.insert(0, str(Path.cwd().parent))

logging.basicConfig(level=logging.ERROR) # M√©nƒõ log≈Ø, v√≠ce vlastn√≠ch v√Ωpis≈Ø

from src.config import AppConfig
from src.embeddings import EmbeddingsManager
from src.vector_store import VectorStoreManager
from src.parent_retriever import CVParentRetriever

In [5]:
# Setup
config = AppConfig()
config.rag.use_hybrid_search = True
embeddings_mgr = EmbeddingsManager(config.azure)
vs_manager = VectorStoreManager(config.rag, embeddings_mgr.get_embeddings())
vectorstore = vs_manager.load_vectorstore()
retriever = CVParentRetriever(config.rag, vectorstore, config.azure)
retriever.load_from_existing_store()

print("‚úÖ Retriever p≈ôipraven.")


  self._vectorstore = Chroma(


## ƒå√ÅST 1: Raw Data (Co vrac√≠ BM25 a Vektory?)
Nejd≈ô√≠ve si vyt√°hneme surov√© v√Ωsledky.

In [4]:
query = "python developer"

print(f"üîç Dotaz: '{query}'")

# Vyt√°hneme data p≈ô√≠mo z intern√≠ch retriever≈Ø
hybrid_core = retriever._hybrid_retriever

# 1. BM25
bm25_results = hybrid_core.bm25_retriever.invoke(query)
# O≈ô√≠zneme pro p≈ôehlednost na Top 5
bm25_results = bm25_results[:5]

# 2. Vektory
embedding_results = hybrid_core.embedding_retriever.invoke(query)
# O≈ô√≠zneme pro p≈ôehlednost na Top 5
embedding_results = embedding_results[:5]

print(f"Z√≠sk√°no {len(bm25_results)} BM25 v√Ωsledk≈Ø a {len(embedding_results)} Vektorov√Ωch v√Ωsledk≈Ø.")

üîç Dotaz: 'python developer'
Z√≠sk√°no 5 BM25 v√Ωsledk≈Ø a 5 Vektorov√Ωch v√Ωsledk≈Ø.


## ƒå√ÅST 2: SIMULACE RRF (Reciprocal Rank Fusion)
Zde vid√≠te P≈òESNƒö ten k√≥d, kter√Ω se dƒõje uvnit≈ô. Spojujeme v√Ωsledky na z√°kladƒõ po≈ôad√≠ (ranku).

Vzorec pro sk√≥re dokumentu `d`:
$$ Score(d) = \sum \frac{1}{k + rank(d)} $$

Kde `k` je konstanta (zde 60) a `rank` je po≈ôad√≠ (0, 1, 2...).

In [10]:
for rank, doc in enumerate(bm25_results):
    print(rank)

0
1
2
3
4


In [7]:
from collections import defaultdict

# Konstanty
K = 60
BM25_WEIGHT = 1.0
VECTOR_WEIGHT = 1.0

rrf_scores = defaultdict(float)
# Mapa pro ulo≈æen√≠ objekt≈Ø dokument≈Ø, abychom je nekop√≠rovali
doc_map = {}

print("--- ZAƒå√ÅTEK F√öZE ---")

# 1. Zpracov√°n√≠ BM25
print("\n1. Poƒç√≠t√°m sk√≥re z BM25:")
for rank, doc in enumerate(bm25_results):
    # Vytvo≈ô√≠me unik√°tn√≠ kl√≠ƒç (Jm√©no + kousek textu), proto≈æe ID se m≈Ø≈æe li≈°it
    doc_key = f"{doc.metadata.get('candidate_name')}_{doc.page_content[:20]}"
    
    score = BM25_WEIGHT * (1 / (K + rank + 1))
    rrf_scores[doc_key] += score
    doc_map[doc_key] = doc
    
    print(f"   Rank {rank+1}: {doc_key:<30} -> +{score:.5f}")

# 2. Zpracov√°n√≠ Vektor≈Ø
print("\n2. P≈ôiƒç√≠t√°m sk√≥re z Vektor≈Ø:")
for rank, doc in enumerate(embedding_results):
    doc_key = f"{doc.metadata.get('candidate_name')}_{doc.page_content[:20]}"
    
    score = VECTOR_WEIGHT * (1 / (K + rank + 1))
    
    # Zde vid√≠me, zda u≈æ kl√≠ƒç existuje (Overlap)
    if doc_key in rrf_scores:
        print(f"   Rank {rank+1}: {doc_key:<30} -> +{score:.5f} (MATCH! Zvy≈°uji sk√≥re)")
    else:
        print(f"   Rank {rank+1}: {doc_key:<30} -> +{score:.5f} (Nov√Ω)")
        
    rrf_scores[doc_key] += score
    # Pokud je to nov√Ω dokument, ulo≈æ√≠me si ho
    if doc_key not in doc_map:
        doc_map[doc_key] = doc

# 3. Se≈ôazen√≠
sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

print("\nüèÜ V√ùSLEDN√â PO≈òAD√ç (Top 5):")
for i, (key, score) in enumerate(sorted_docs[:5], 1):
    print(f"   {i}. Sk√≥re {score:.5f} | {key}")

--- ZAƒå√ÅTEK F√öZE ---

1. Poƒç√≠t√°m sk√≥re z BM25:
   Rank 1: L√°tal Michael_www.dolphinconsultin -> +0.01639
   Rank 2: Hu≈àa Tom√°≈°_www.dolphinconsultin -> +0.01613
   Rank 3: Bob≈Ørka Vojtƒõch_www.dolphinconsultin -> +0.01587
   Rank 4: Konvalinka Michal_www.dolphinconsultin -> +0.01562
   Rank 5: Hu≈°ek Michal_www.dolphinconsultin -> +0.01538

2. P≈ôiƒç√≠t√°m sk√≥re z Vektor≈Ø:
   Rank 1: Bukovsk√Ω Petr_Javascript (scriptin -> +0.01639 (Nov√Ω)
   Rank 2: Bal√°ƒçek Daniel_PowerDesigner

Appli -> +0.01613 (Nov√Ω)
   Rank 3: Hu≈°ek Michal_www.dolphinconsultin -> +0.01587 (MATCH! Zvy≈°uji sk√≥re)
   Rank 4: Hlinkov√° Zuzana_www.dolphinconsultin -> +0.01562 (Nov√Ω)
   Rank 5: Hu≈àa Tom√°≈°_Cloud environment

A -> +0.01538 (Nov√Ω)

üèÜ V√ùSLEDN√â PO≈òAD√ç (Top 5):
   1. Sk√≥re 0.03126 | Hu≈°ek Michal_www.dolphinconsultin
   2. Sk√≥re 0.01639 | L√°tal Michael_www.dolphinconsultin
   3. Sk√≥re 0.01639 | Bukovsk√Ω Petr_Javascript (scriptin
   4. Sk√≥re 0.01613 | Hu≈àa Tom√°≈°_www.dolph

## ƒå√ÅST 3: METRIKY (Shoda)
Jak moc se v√Ωsledky p≈ôekr√Ωvaly? Pokud je p≈ôekryv mal√Ω, znamen√° to, ≈æe ka≈æd√° metoda na≈°la √∫plnƒõ nƒõco jin√©ho (co≈æ je u Hybrid Search vlastnƒõ dob≈ôe - dopl≈àuj√≠ se).

Pou≈æijeme **Jaccardovu podobnost**:
$$ J(A,B) = \frac{|A \cap B|}{|A \cup B|} $$

In [11]:
# Vytvo≈ô√≠me mno≈æiny kl√≠ƒç≈Ø
bm25_keys = set()
for doc in bm25_results:
    key = f"{doc.metadata.get('candidate_name')}_{doc.page_content[:20]}"
    bm25_keys.add(key)

vector_keys = set()
for doc in embedding_results:
    # U vektor≈Ø mus√≠me b√Ωt opatrn√≠ - vrac√≠ child chunks. 
    # Pro srovn√°n√≠ bychom mƒõli porovn√°vat sp√≠≈°e JM√âNA kandid√°t≈Ø, 
    # proto≈æe text bude v≈ædy jin√Ω (odstavec vs cel√© CV).
    key = f"{doc.metadata.get('candidate_name')}_{doc.page_content[:20]}"
    vector_keys.add(key)

# Pr≈Ønik a Sjednocen√≠
intersection = bm25_keys.intersection(vector_keys)
union = bm25_keys.union(vector_keys)

jaccard_index = len(intersection) / len(union) if len(union) > 0 else 0

print(f"\nüìä METRIKY SHODY:")
print(f"   Poƒçet BM25: {len(bm25_keys)}")
print(f"   Poƒçet Vector: {len(vector_keys)}")
print(f"   Pr≈Ønik (Identick√© chunk texty): {len(intersection)}")
print(f"   Jaccardova podobnost: {jaccard_index:.2%}")

if jaccard_index == 0:
    print("\nüí° INTERPRETACE: Jaccard 0% je zde OƒåEK√ÅVAN√ù.")
    print("   BM25 vrac√≠ 'Parent Chunks' (dlouh√© texty).")
    print("   Vector Store vrac√≠ 'Child Chunks' (kr√°tk√© √∫ryvky).")
    print("   Jejich zaƒç√°tek (page_content[:20]) se m≈Ø≈æe, ale nemus√≠ shodovat.")


üìä METRIKY SHODY:
   Poƒçet BM25: 5
   Poƒçet Vector: 5
   Pr≈Ønik (Identick√© chunk texty): 1
   Jaccardova podobnost: 11.11%


In [13]:
print(bm25_keys)
print(vector_keys)
print(intersection)
print(union)

{'Hu≈°ek Michal_www.dolphinconsultin', 'L√°tal Michael_www.dolphinconsultin', 'Hu≈àa Tom√°≈°_www.dolphinconsultin', 'Konvalinka Michal_www.dolphinconsultin', 'Bob≈Ørka Vojtƒõch_www.dolphinconsultin'}
{'Bal√°ƒçek Daniel_PowerDesigner\n\nAppli', 'Hu≈°ek Michal_www.dolphinconsultin', 'Hu≈àa Tom√°≈°_Cloud environment\n\nA', 'Hlinkov√° Zuzana_www.dolphinconsultin', 'Bukovsk√Ω Petr_Javascript (scriptin'}
{'Hu≈°ek Michal_www.dolphinconsultin'}
{'Hu≈°ek Michal_www.dolphinconsultin', 'Hu≈àa Tom√°≈°_Cloud environment\n\nA', 'Hlinkov√° Zuzana_www.dolphinconsultin', 'Bukovsk√Ω Petr_Javascript (scriptin', 'L√°tal Michael_www.dolphinconsultin', 'Hu≈àa Tom√°≈°_www.dolphinconsultin', 'Konvalinka Michal_www.dolphinconsultin', 'Bob≈Ørka Vojtƒõch_www.dolphinconsultin', 'Bal√°ƒçek Daniel_PowerDesigner\n\nAppli'}


### Alternativn√≠ metrika: Shoda na √∫rovni Kandid√°t≈Ø
Proto≈æe RAG v posledn√≠m kroku (Parent Retriever) dohled√°v√° rodiƒçovsk√Ω dokument, d√°v√° vƒõt≈°√≠ smysl mƒõ≈ôit, zda obƒõ metody na≈°ly **stejn√© lidi** (kandid√°ty), i kdy≈æ p≈ôes jin√© kousky textu.

In [14]:
bm25_names = {doc.metadata.get('candidate_name') for doc in bm25_results}
vector_names = {doc.metadata.get('candidate_name') for doc in embedding_results}

intersection_names = bm25_names.intersection(vector_names)
union_names = bm25_names.union(vector_names)

jaccard_names = len(intersection_names) / len(union_names) if len(union_names) > 0 else 0

print(f"\nüìä SHODA KANDID√ÅT≈Æ (Jm√©na):")
print(f"   BM25 na≈°el: {bm25_names}")
print(f"   Vector na≈°el: {vector_names}")
print(f"   Spoleƒçn√≠ kandid√°ti: {intersection_names}")
print(f"   Jaccardova podobnost (Kandid√°ti): {jaccard_names:.2%}")

if jaccard_names > 0:
    print("\n‚úÖ Vid√≠te! I kdy≈æ texty (chunks) byly jin√©, obƒõ metody ƒçasto najdou stejn√© relevantn√≠ lidi.")
else:
    print("\n‚ùå Ka≈æd√° metoda na≈°la √∫plnƒõ jin√© lidi. To se st√°v√° u specifick√Ωch dotaz≈Ø.")


üìä SHODA KANDID√ÅT≈Æ (Jm√©na):
   BM25 na≈°el: {'Konvalinka Michal', 'Bob≈Ørka Vojtƒõch', 'L√°tal Michael', 'Hu≈àa Tom√°≈°', 'Hu≈°ek Michal'}
   Vector na≈°el: {'Bukovsk√Ω Petr', 'Bal√°ƒçek Daniel', 'Hlinkov√° Zuzana', 'Hu≈àa Tom√°≈°', 'Hu≈°ek Michal'}
   Spoleƒçn√≠ kandid√°ti: {'Hu≈àa Tom√°≈°', 'Hu≈°ek Michal'}
   Jaccardova podobnost (Kandid√°ti): 25.00%

‚úÖ Vid√≠te! I kdy≈æ texty (chunks) byly jin√©, obƒõ metody ƒçasto najdou stejn√© relevantn√≠ lidi.
