# CV RAG Training Notebook (Deep Dive - Hybrid Search)

Tento notebook jde "pod kapotu". Nebudeme jen slepƒõ spou≈°tƒõt metody, ale pod√≠v√°me se dovnit≈ô objekt≈Ø.
C√≠lem je pochopit:
1.  Co p≈ôesnƒõ je v BM25 indexu.
2.  Co p≈ôesnƒõ je ve Vektorov√©m indexu.
3.  Jak se tyto dva svƒõty potk√°vaj√≠.

**D≈ÆLE≈ΩIT√â:** Zde budeme sahat na "priv√°tn√≠" atributy (zaƒç√≠naj√≠c√≠ podtr≈æ√≠tkem), co≈æ se v produkci nedƒõl√°, ale pro v√Ωuku je to nutn√©.

In [1]:
import sys
import logging
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent))

# Nastav√≠me logging na INFO, abychom vidƒõli co se dƒõje
logging.basicConfig(
    level=logging.INFO,
    format='%(name)s - %(message)s'
)

from src.config import AppConfig
from src.document_loader import CVDocumentLoader
from src.embeddings import EmbeddingsManager
from src.vector_store import VectorStoreManager
from src.parent_retriever import CVParentRetriever

## 1. Setup (Rychl√Ω pr≈Øchod)
Stejn√© jako minule, p≈ôiprav√≠me data a retriever se zapnut√Ωm Hybrid Search.

In [2]:
config = AppConfig()
config.rag.use_hybrid_search = True  # VYNUCEN√ç HYBRID SEARCH

# Naƒçten√≠ dat
loader = CVDocumentLoader(config.rag.data_directory_ntb)
candidates = loader.load_all_cvs()
documents = loader.convert_to_langchain_documents(candidates)

# Setup Store
embeddings_mgr = EmbeddingsManager(config.azure)
vs_manager = VectorStoreManager(config.rag, embeddings_mgr.get_embeddings())
vs_manager.clear_vectorstore()
vectorstore = vs_manager.create_or_load_vectorstore()

# Inicializace Retrieveru
retriever = CVParentRetriever(
    config=config.rag,
    vectorstore=vectorstore,
    azure_config=config.azure
)
retriever.initialize_retriever(documents)

print("\n‚úÖ Retriever inicializov√°n.")

src.document_loader - Found 27 DOCX files in ..\data\OneDrive_2025-12-16
src.document_loader - Loaded CV for Bal√°ƒçek Daniel (3020 characters)
src.document_loader - Loaded CV for Bob≈Ørka Vojtƒõch (2458 characters)
src.document_loader - Loaded CV for Bronec Ond≈ôej (3757 characters)
src.document_loader - Loaded CV for Bukovsk√Ω Petr (2628 characters)
src.document_loader - Loaded CV for B√≠mov√° Kamila (2042 characters)
src.document_loader - Loaded CV for Dlugo≈°ov√° Lenka (2383 characters)
src.document_loader - Loaded CV for Duleba Peter (2873 characters)
src.document_loader - Loaded CV for Fejfarov√° Julia (1445 characters)
src.document_loader - Loaded CV for Fejfar Ond≈ôej (1289 characters)
src.document_loader - Loaded CV for Gleb Tcypin (2543 characters)
src.document_loader - Loaded CV for Hlavat√° Michaela (5445 characters)
src.document_loader - Loaded CV for Hlinkov√° Zuzana (2615 characters)
src.document_loader - Loaded CV for Holman Martin (1559 characters)
src.document_loader 


‚úÖ Retriever inicializov√°n.


## 2. DEEP DIVE: Co je uvnit≈ô?

Nyn√≠ se pod√≠v√°me p≈ô√≠mo do st≈ôev objektu `retriever`. Zaj√≠maj√≠ n√°s dvƒõ hlavn√≠ komponenty:
1.  `bm25_retriever` - Hled√°n√≠ kl√≠ƒçov√Ωch slov
2.  `embedding_retriever` - Vektorov√© hled√°n√≠

In [3]:
# Z√≠sk√°me instanci HybridRetrieveru (kter√° je schovan√° v _hybrid_retriever)
hybrid_core = retriever._hybrid_retriever

print(f"Hybrid Core Object: {hybrid_core}")
print(f"BM25 Retriever: {hybrid_core.bm25_retriever}")
print(f"Embedding Retriever: {hybrid_core.embedding_retriever}")

Hybrid Core Object: <src.hybrid_retriever.HybridRetriever object at 0x000002F920CCE660>
BM25 Retriever: vectorizer=<rank_bm25.BM25Okapi object at 0x000002F920CCEBA0> k=10
Embedding Retriever: tags=['Chroma', 'AzureOpenAIEmbeddings'] vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000002F91E2474D0> search_kwargs={'k': 10}


### A. Anal√Ωza BM25 (Kl√≠ƒçov√° slova)
Co BM25 indexuje? Cel√© dokumenty (Parent Chunks) nebo mal√© kousky (Child Chunks)?
Zkus√≠me vyhledat slovo a pod√≠v√°me se na d√©lku v√Ωsledku.

In [5]:
query = "python"

# Zavol√°me P≈ò√çMO BM25 ƒç√°st (izolovanƒõ)
bm25_results = hybrid_core.bm25_retriever.invoke(query)

print(f"üîç BM25 hled√°n√≠ slova '{query}':")
print(f"   Poƒçet v√Ωsledk≈Ø: {len(bm25_results)}")

if bm25_results:
    doc = bm25_results[0]
    print(f"\n   Prvn√≠ v√Ωsledek:")
    print(f"   Autor: {doc.metadata.get('candidate_name')}")
    print(f"   D√©lka textu: {len(doc.page_content)} znak≈Ø (Dlouh√Ω text = Parent Chunk)")
    print(f"   Uk√°zka: {doc.page_content[:]}...")
else:
    print("   ≈Ω√°dn√© v√Ωsledky pro BM25.")

üîç BM25 hled√°n√≠ slova 'python':
   Poƒçet v√Ωsledk≈Ø: 10

   Prvn√≠ v√Ωsledek:
   Autor: Konvalinka Michal
   D√©lka textu: 1997 znak≈Ø (Dlouh√Ω text = Parent Chunk)
   Uk√°zka: www.dolphinconsulting.cz	

Michal Konvalinka

Instructor and Consultant of BI and MS Office Applications 

 Prague



Key qualifications

Power BI instructor and consultant

Data cleansing, transformation, analysis, and visualization

Microsoft products instructor (Excel, Microsoft 365)

Learning content designer and developer

Secondary and tertiary education educator since 2014



Skills & knowledge

Business intelligence

Microsoft Power BI

MS Excel, Power Query, Power Pivot

Database applications

SQL, data modeling, MS Access

Microsoft office suite 

Expert knowledge and skills in MS Word, PowerPoint, Outlook, and Microsoft 365 ‚Äì experience in training at all levels since 2009

Learning design

Design and implementation of educational content for various public and private schools (Business Academy

### B. Anal√Ωza Vektor≈Ø (S√©mantika)
Co vr√°t√≠ vektorov√© hled√°n√≠? Stejn√Ω dokument? Nebo men≈°√≠ ƒç√°st?

In [None]:
# Zavol√°me P≈ò√çMO Vektorovou ƒç√°st (izolovanƒõ)
vec_results = hybrid_core.embedding_retriever.invoke(query)

print(f"üîç Vektorov√© hled√°n√≠ slova '{query}':")
print(f"   Poƒçet v√Ωsledk≈Ø: {len(vec_results)}")

if vec_results:
    doc = vec_results[0]
    print(f"\n   Prvn√≠ v√Ωsledek:")
    print(f"   Autor: {doc.metadata.get('candidate_name')}")
    print(f"   D√©lka textu: {len(doc.page_content)} znak≈Ø (Kr√°tk√Ω text = Child Chunk)")
    print(f"   Uk√°zka: {doc.page_content[:150]}...")
else:
    print("   ≈Ω√°dn√© v√Ωsledky pro vektory.")

### C. Z√°vƒõr z pitvy
Vid√≠te ten rozd√≠l?
- **BM25** vrac√≠ **Parent Documents** (cel√° CV).
- **Vector Store** vrac√≠ **Child Chunks** (odstavce).

To je d≈Øle≈æit√© vƒõdƒõt! Kdy≈æ Hybrid Search dƒõl√° "f√∫zi", sna≈æ√≠ se spojit tyto dva seznamy. Pokud by se algoritmus spol√©hal na to, ≈æe texty jsou identick√©, f√∫ze by selhala. Zde se RRF (Reciprocal Rank Fusion) dƒõje na z√°kladƒõ kl√≠ƒçe vytvo≈ôen√©ho z `candidate_name` + zaƒç√°tek `page_content`. 

Proto≈æe text je jin√Ω (dlouh√Ω vs kr√°tk√Ω), pravdƒõpodobnƒõ nedoch√°z√≠ k "overlapu" (pos√≠len√≠ sk√≥re), ale sp√≠≈°e k "sjednocen√≠" (union) v√Ωsledk≈Ø. V `manual_query.ipynb` se pod√≠v√°me, jestli to tak opravdu je.