# CV RAG Training Notebook

Tento notebook umo≈æ≈àuje ruƒçn√≠ spou≈°tƒõn√≠ a testov√°n√≠ tr√©novac√≠ho procesu RAG syst√©mu pro vyhled√°v√°n√≠ v ≈æivotopisech.

## Co se dƒõje bƒõhem tr√©nov√°n√≠:

1. **Naƒçten√≠ CV** - Naƒçtou se v≈°echny DOCX soubory z datov√©ho adres√°≈ôe
2. **Setup embeddings** - P≈ôiprav√≠ se Azure OpenAI embeddings
3. **Setup vector store** - Vytvo≈ô√≠ se pr√°zdn√Ω ChromaDB vectorstore
4. **Inicializace retrieveru** - ParentDocumentRetriever:
   - Rozdƒõl√≠ ka≈æd√© CV na parent chunks (~2000 znak≈Ø)
   - Rozdƒõl√≠ parent chunks na child chunks (~400 znak≈Ø)
   - Child chunks se indexuj√≠ do vectorstore (pro vyhled√°v√°n√≠)
   - Parent chunks se ukl√°daj√≠ do docstore (pro context)
5. **Test retrieval** - Vyzkou≈°√≠ nƒõkolik testovac√≠ch dotaz≈Ø

## Import knihoven a konfigurace

In [1]:
import sys
import logging
from pathlib import Path


# P≈ôidat parent directory do sys.path pro importy
sys.path.insert(0, str(Path.cwd().parent))

# Nastaven√≠ logging pro viditelnost procesu
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Import aplikaƒçn√≠ch modul≈Ø
from src.config import AppConfig
from src.document_loader import CVDocumentLoader
from src.embeddings import EmbeddingsManager
from src.vector_store import VectorStoreManager
from src.parent_retriever import CVParentRetriever

print("‚úì Knihovny naƒçteny")

‚úì Knihovny naƒçteny


## Naƒçten√≠ konfigurace

In [2]:
# Naƒçten√≠ konfigurace z .env souboru
config = AppConfig()

print("\nüìã Konfigurace:")
print(f"  Data directory: {config.rag.data_directory}")
print(f"  Vector store: {config.rag.persist_directory}")
print(f"  Parent chunk size: {config.rag.parent_chunk_size}")
print(f"  Child chunk size: {config.rag.child_chunk_size}")
print(f"  Batch size: {config.azure.batch_size}")
print(f"  Batch delay: {config.azure.batch_delay}s")


üìã Konfigurace:
  Data directory: ./data/OneDrive_2025-12-16
  Vector store: ./chroma_db
  Parent chunk size: 2000
  Child chunk size: 400
  Batch size: 5
  Batch delay: 2.0s


## KROK 1: Naƒçten√≠ CV dokument≈Ø

Naƒçte v≈°echny `.docx` soubory z datov√©ho adres√°≈ôe a p≈ôevede je na `Candidate` objekty.

In [3]:
print("\n" + "="*80)
print("KROK 1: Naƒç√≠t√°n√≠ CV dokument≈Ø")
print("="*80)

# Vytvo≈ôen√≠ loaderu
loader = CVDocumentLoader(config.rag.data_directory_ntb)

# Naƒçten√≠ v≈°ech CV
candidates = loader.load_all_cvs()

print(f"\n‚úì Naƒçteno {len(candidates)} CV")
print("\nPrvn√≠ch 5 kandid√°t≈Ø:")
for i, candidate in enumerate(candidates[:5], 1):
    print(f"  {i}. {candidate.name} ({len(candidate.full_cv_text)} znak≈Ø)")

if len(candidates) > 5:
    print(f"  ... a dal≈°√≠ch {len(candidates) - 5}")

2025-12-17 16:40:32,709 - src.document_loader - INFO - Found 27 DOCX files in ..\data\OneDrive_2025-12-16
2025-12-17 16:40:32,736 - src.document_loader - INFO - Loaded CV for Bal√°ƒçek Daniel (3020 characters)
2025-12-17 16:40:32,760 - src.document_loader - INFO - Loaded CV for Bob≈Ørka Vojtƒõch (2458 characters)
2025-12-17 16:40:32,791 - src.document_loader - INFO - Loaded CV for Bronec Ond≈ôej (3757 characters)
2025-12-17 16:40:32,815 - src.document_loader - INFO - Loaded CV for Bukovsk√Ω Petr (2628 characters)
2025-12-17 16:40:32,830 - src.document_loader - INFO - Loaded CV for B√≠mov√° Kamila (2042 characters)
2025-12-17 16:40:32,850 - src.document_loader - INFO - Loaded CV for Dlugo≈°ov√° Lenka (2383 characters)
2025-12-17 16:40:32,868 - src.document_loader - INFO - Loaded CV for Duleba Peter (2873 characters)



KROK 1: Naƒç√≠t√°n√≠ CV dokument≈Ø


2025-12-17 16:40:32,883 - src.document_loader - INFO - Loaded CV for Fejfarov√° Julia (1445 characters)
2025-12-17 16:40:32,896 - src.document_loader - INFO - Loaded CV for Fejfar Ond≈ôej (1289 characters)
2025-12-17 16:40:32,919 - src.document_loader - INFO - Loaded CV for Gleb Tcypin (2543 characters)
2025-12-17 16:40:32,949 - src.document_loader - INFO - Loaded CV for Hlavat√° Michaela (5445 characters)
2025-12-17 16:40:32,959 - src.document_loader - INFO - Loaded CV for Hlinkov√° Zuzana (2615 characters)
2025-12-17 16:40:32,983 - src.document_loader - INFO - Loaded CV for Holman Martin (1559 characters)
2025-12-17 16:40:33,013 - src.document_loader - INFO - Loaded CV for Hrd√Ω Daniel (2399 characters)
2025-12-17 16:40:33,048 - src.document_loader - INFO - Loaded CV for Hu≈àa Tom√°≈° (3930 characters)
2025-12-17 16:40:33,067 - src.document_loader - INFO - Loaded CV for Hu≈°ek Michal (2354 characters)
2025-12-17 16:40:33,089 - src.document_loader - INFO - Loaded CV for Karlovsk√Ω Luk


‚úì Naƒçteno 27 CV

Prvn√≠ch 5 kandid√°t≈Ø:
  1. Bal√°ƒçek Daniel (3020 znak≈Ø)
  2. Bob≈Ørka Vojtƒõch (2458 znak≈Ø)
  3. Bronec Ond≈ôej (3757 znak≈Ø)
  4. Bukovsk√Ω Petr (2628 znak≈Ø)
  5. B√≠mov√° Kamila (2042 znak≈Ø)
  ... a dal≈°√≠ch 22


### Pod√≠vejte se na jedno CV

In [4]:
# Zobrazen√≠ prvn√≠ch 500 znak≈Ø z prvn√≠ho CV
if candidates:
    first_candidate = candidates[0]
    print(f"\nüìÑ P≈ô√≠klad CV: {first_candidate.name}")
    print(f"D√©lka: {len(first_candidate.full_cv_text)} znak≈Ø")
    print(f"\nPrvn√≠ch 500 znak≈Ø:")
    print("-" * 80)
    print(first_candidate.full_cv_text[:500])
    print("-" * 80)


üìÑ P≈ô√≠klad CV: Bal√°ƒçek Daniel
D√©lka: 3020 znak≈Ø

Prvn√≠ch 500 znak≈Ø:
--------------------------------------------------------------------------------
www.dolphinconsulting.cz	

Daniel Bal√°ƒçek

BI consultant 

 Praha



Key qualifications

Dashboard and report development and design

User requirements analysis and documentation

Business and data analysis



Skills & knowledge

Business intelligence

General Skills

Dashboard and report development and design 

Datawarehouse architecture principles and principles of BI

Data modeling

Business intelligence

Applications

MS Excel - advanced

Qlik Sense ‚Äì advanced 

Qlik NPrinting ‚Äì advanced

Q
--------------------------------------------------------------------------------


### P≈ôevod na LangChain Documents

In [5]:
# P≈ôevod kandid√°t≈Ø na LangChain Documents
documents = loader.convert_to_langchain_documents(candidates)

print(f"\n‚úì Vytvo≈ôeno {len(documents)} LangChain Documents")
print("\nP≈ô√≠klad metadat prvn√≠ho dokumentu:")
if documents:
    print(documents[0].metadata)

2025-12-17 16:40:33,501 - src.document_loader - INFO - Converted 27 candidates to LangChain documents



‚úì Vytvo≈ôeno 27 LangChain Documents

P≈ô√≠klad metadat prvn√≠ho dokumentu:
{'candidate_name': 'Bal√°ƒçek Daniel', 'source': '..\\data\\OneDrive_2025-12-16\\Bal√°ƒçek_Daniel_CV_EN.docx', 'type': 'cv_parent', 'filename': 'Bal√°ƒçek_Daniel_CV_EN.docx', 'file_size': 496940, 'text_length': 3020}


In [6]:
print(len(documents))

27


## KROK 2: Setup Azure OpenAI Embeddings

P≈ôiprav√≠ embeddings model pro vytv√°≈ôen√≠ vektorov√Ωch reprezentac√≠ textu.

In [7]:
print("\n" + "="*80)
print("KROK 2: Setup Embeddings")
print("="*80)

embeddings_mgr = EmbeddingsManager(config.azure)

print(f"\n‚úì Embeddings manager vytvo≈ôen")
print(f"  Model: {config.azure.embedding_deployment}")
print(f"  Endpoint: {config.azure.endpoint}")


KROK 2: Setup Embeddings

‚úì Embeddings manager vytvo≈ôen
  Model: text-embedding-ada-002-dolphin-1
  Endpoint: https://oai-dolphin-1.openai.azure.com


### Test p≈ôipojen√≠ k Azure OpenAI

In [8]:
# Test p≈ôipojen√≠
if embeddings_mgr.test_connection():
    print("\n‚úì P≈ôipojen√≠ k Azure OpenAI √∫spƒõ≈°n√©")
else:
    print("\n‚úó P≈ôipojen√≠ k Azure OpenAI selhalo")

2025-12-17 16:40:33,553 - src.embeddings - INFO - Initializing Azure OpenAI Embeddings with deployment: text-embedding-ada-002-dolphin-1
2025-12-17 16:40:37,515 - src.embeddings - INFO - Embeddings initialized successfully
2025-12-17 16:40:45,654 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:40:45,670 - src.embeddings - INFO - Embeddings connection test successful



‚úì P≈ôipojen√≠ k Azure OpenAI √∫spƒõ≈°n√©


## KROK 3: Setup Vector Store

Vytvo≈ô√≠ pr√°zdn√Ω ChromaDB vectorstore (nebo naƒçte existuj√≠c√≠).

In [9]:
print("\n" + "="*80)
print("KROK 3: Setup Vector Store")
print("="*80)

vs_manager = VectorStoreManager(
    config.rag,
    embeddings_mgr.get_embeddings()
)

# Smaz√°n√≠ existuj√≠c√≠ho vectorstore (pro ƒçist√© tr√©nov√°n√≠)
print("\nMa≈æu existuj√≠c√≠ vectorstore...")
vs_manager.clear_vectorstore()

# Vytvo≈ôen√≠ pr√°zdn√©ho vectorstore
print("Vytv√°≈ô√≠m pr√°zdn√Ω vectorstore...")
vectorstore = vs_manager.create_or_load_vectorstore()

print("\n‚úì Vector store p≈ôipraven")

2025-12-17 16:40:45,707 - src.vector_store - INFO - Clearing vector store at chroma_db



KROK 3: Setup Vector Store

Ma≈æu existuj√≠c√≠ vectorstore...


2025-12-17 16:40:45,898 - src.vector_store - INFO - Vector store cleared
2025-12-17 16:40:45,901 - src.vector_store - INFO - Creating new empty vector store at chroma_db


Vytv√°≈ô√≠m pr√°zdn√Ω vectorstore...


  self._vectorstore = Chroma(
2025-12-17 16:40:54,215 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-12-17 16:40:55,415 - src.vector_store - INFO - Empty vector store created



‚úì Vector store p≈ôipraven


## KROK 4: Inicializace Parent Document Retrieveru

**Toto je kl√≠ƒçov√Ω krok!**

ParentDocumentRetriever:
1. Rozdƒõl√≠ ka≈æd√© CV na **parent chunks** (velk√© kusy ~2000 znak≈Ø)
2. Rozdƒõl√≠ ka≈æd√Ω parent chunk na **child chunks** (mal√© kusy ~400 znak≈Ø)
3. **Child chunks** se indexuj√≠ do vectorstore (pou≈æ√≠vaj√≠ se pro vyhled√°v√°n√≠)
4. **Parent chunks** se ukl√°daj√≠ do docstore (vracej√≠ se jako kontext)
5. Pamatuje si mapov√°n√≠: kter√Ω child chunk pat≈ô√≠ ke kter√©mu parent chunku

**Batch processing:**
- Poƒç√≠t√° skuteƒçn√Ω poƒçet child chunks (ne jen poƒçet CV)
- Po ka≈æd√Ωch ~50 chunc√≠ch udƒõl√° pauzu (rate limit protection)

In [10]:
print("\n" + "="*80)
print("KROK 4: Inicializace Parent Document Retriever")
print("="*80)

retriever = CVParentRetriever(
    config=config.rag,
    vectorstore=vectorstore,
    azure_config=config.azure
)

print(f"\nZpracov√°v√°m {len(documents)} dokument≈Ø...")
print(f"Batch size: {config.azure.batch_size} chunks")
print(f"Batch delay: {config.azure.batch_delay}s\n")

# POZOR: Toto m≈Ø≈æe trvat nƒõkolik minut!
# Vytv√°≈ô√≠ embeddingy pro v≈°echny child chunks
retriever.initialize_retriever(documents)

print("\n‚úì Retriever inicializov√°n")

# Zkontroluj statistiky
stats = retriever.get_stats()
print(f"\nStatistiky:")
print(f"  - Parent chunks: {stats['parent_chunks']}")
print(f"  - Child chunks: {stats['child_chunks']}")

2025-12-17 16:40:55,448 - src.parent_retriever - INFO - Initializing Parent Retriever - Parent chunks: 2000, Child chunks: 400
2025-12-17 16:40:55,450 - src.parent_retriever - INFO - Docstore path: chroma_db\docstore
2025-12-17 16:40:55,453 - src.parent_retriever - INFO - Initializing retriever with 27 CV documents
2025-12-17 16:40:55,456 - src.parent_retriever - INFO - Using batch processing for retriever initialization: batch_size=5
2025-12-17 16:40:55,457 - src.parent_retriever - INFO - Pre-splitting 27 documents into child chunks...
2025-12-17 16:40:55,471 - src.parent_retriever - INFO - Total child chunks: 273
2025-12-17 16:40:55,472 - src.parent_retriever - INFO - Processing in 55 batches of ~5 chunks each
2025-12-17 16:40:55,472 - src.parent_retriever - INFO - Document 'Bal√°ƒçek Daniel': 10 child chunks



KROK 4: Inicializace Parent Document Retriever

Zpracov√°v√°m 27 dokument≈Ø...
Batch size: 5 chunks
Batch delay: 2.0s



2025-12-17 16:41:03,420 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:41:03,946 - src.parent_retriever - INFO - Processed 10/273 chunks (1 batches)
2025-12-17 16:41:03,948 - src.parent_retriever - INFO - Waiting 2.0s before next batch...
2025-12-17 16:41:05,951 - src.parent_retriever - INFO - Document 'Bob≈Ørka Vojtƒõch': 7 child chunks
2025-12-17 16:41:06,049 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:41:06,152 - src.parent_retriever - INFO - Processed 17/273 chunks (2 batches)
2025-12-17 16:41:06,154 - src.parent_retriever - INFO - Waiting 2.0s before next batch...
2025-12-17 16:41:08,158 - src.parent_retriever - INFO - Document 'Bronec Ond≈ôej': 11 child chunks
2025-12-17 16:41:08,288 - http


‚úì Retriever inicializov√°n

Statistiky:
  - Parent chunks: 61
  - Child chunks: 298


In [None]:
print("\n" + "="*80)
print("KONTROLA: Docstore na disku")
print("="*80)

import os
from pathlib import Path

docstore_path = Path(config.rag.persist_directory) / "docstore"

print(f"\nDocstore cesta: {docstore_path}")
print(f"Existuje: {docstore_path.exists()}")

if docstore_path.exists():
    files = list(docstore_path.iterdir())
    print(f"Poƒçet soubor≈Ø: {len(files)}")
    
    if len(files) > 0:
        print(f"\nPrvn√≠ch 10 soubor≈Ø:")
        for i, file in enumerate(files[:10], 1):
            size_kb = file.stat().st_size / 1024
            print(f"  {i}. {file.name} ({size_kb:.2f} KB)")
        
        # Uk√°zka obsahu prvn√≠ho souboru
        if files:
            first_file = files[0]
            print(f"\nObsah prvn√≠ho souboru ({first_file.name}):")
            content = first_file.read_text(encoding='utf-8')
            print(content[:500] + "...")
    else:
        print("\n‚ö†Ô∏è Docstore slo≈æka existuje, ale je PR√ÅZDN√Å!")
        print("   Tr√©nov√°n√≠ mo≈æn√° selhalo nebo je≈°tƒõ neprobƒõhlo.")
else:
    print("\n‚ùå Docstore slo≈æka NEEXISTUJE!")
    print("   Spus≈• initialize_retriever() pro vytvo≈ôen√≠.")

# √öpln√° cesta pro pr≈Øzkumn√≠k
abs_path = docstore_path.absolute()
print(f"\n√öpln√° cesta pro pr≈Øzkumn√≠k Windows:")
print(f"{abs_path}")

In [11]:
stats

{'status': 'initialized',
 'parent_chunks': 61,
 'child_chunks': 298,
 'parent_chunk_size': 2000,
 'child_chunk_size': 400}

### Statistiky po tr√©nov√°n√≠

In [12]:
# Zobrazen√≠ statistik
stats = retriever.get_stats()

print("\nüìä Statistiky:")
print(f"  Parent chunks: {stats['parent_chunks']}")
print(f"  Child chunks: {stats['child_chunks']}")
print(f"  Parent chunk size: {stats['parent_chunk_size']} znak≈Ø")
print(f"  Child chunk size: {stats['child_chunk_size']} znak≈Ø")
print(f"\n  Pr≈Ømƒõr child chunks na CV: {stats['child_chunks'] / len(documents):.1f}")
print(f"  Pr≈Ømƒõr parent chunks na CV: {stats['parent_chunks'] / len(documents):.1f}")


üìä Statistiky:
  Parent chunks: 61
  Child chunks: 298
  Parent chunk size: 2000 znak≈Ø
  Child chunk size: 400 znak≈Ø

  Pr≈Ømƒõr child chunks na CV: 11.0
  Pr≈Ømƒõr parent chunks na CV: 2.3


## KROK 5: Test Retrieval

Nyn√≠ m≈Ø≈æete testovat vyhled√°v√°n√≠ v natr√©novan√©m vectorstore.

In [13]:
print("\n" + "="*80)
print("KROK 5: Test Retrieval")
print("="*80)

# Testovac√≠ dotazy
test_queries = [
    "candidates with Python skills",
    "who has AWS experience",
    "Java developers"
]

for query in test_queries:
    print(f"\nüîç Dotaz: '{query}'")
    print("-" * 80)
    
    results = retriever.retrieve(query, top_k=3)
    
    print(f"Nalezeno {len(results)} v√Ωsledk≈Ø:\n")
    
    for i, doc in enumerate(results, 1):
        candidate_name = doc.metadata.get("candidate_name", "Unknown")
        content_preview = doc.page_content[:200].replace("\n", " ")
        
        print(f"{i}. {candidate_name}")
        print(f"   D√©lka: {len(doc.page_content)} znak≈Ø")
        print(f"   Preview: {content_preview}...")
        print()

2025-12-17 16:42:38,365 - src.parent_retriever - INFO - Retrieving documents for query: 'candidates with Python skills' (top 3)
2025-12-17 16:42:38,436 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:38,499 - src.parent_retriever - INFO - Retrieved 2 parent documents
2025-12-17 16:42:38,500 - src.parent_retriever - INFO - Retrieving documents for query: 'who has AWS experience' (top 3)



KROK 5: Test Retrieval

üîç Dotaz: 'candidates with Python skills'
--------------------------------------------------------------------------------
Nalezeno 2 v√Ωsledk≈Ø:

1. Bronec Ond≈ôej
   D√©lka: 1953 znak≈Ø
   Preview: www.dolphinconsulting.cz	                Ing. Ond≈ôej Bronec  BI consultant    Prague    Key qualifications  Data mining and data science ‚Äì data cleaning, analysis, and modeling using Python and R or s...

2. Hrd√Ω Daniel
   D√©lka: 1937 znak≈Ø
   Preview: www.dolphinconsulting.cz	                Daniel Hrd√Ω  BI consultant   Prague, Czech Republic    Key qualifications  SQL procedures and queries, data modeling.  Data integration and transformation.  Us...


üîç Dotaz: 'who has AWS experience'
--------------------------------------------------------------------------------


2025-12-17 16:42:38,645 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:38,662 - src.parent_retriever - INFO - Retrieved 2 parent documents
2025-12-17 16:42:38,664 - src.parent_retriever - INFO - Retrieving documents for query: 'Java developers' (top 3)
2025-12-17 16:42:38,746 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:38,766 - src.parent_retriever - INFO - Retrieved 3 parent documents


Nalezeno 2 v√Ωsledk≈Ø:

1. Hu≈àa Tom√°≈°
   D√©lka: 1997 znak≈Ø
   Preview: www.dolphinconsulting.cz	                Tom√°≈° Hu≈àa  DWH & Business Intelligence Consultant  Prague  Main Qualifications  Data modeling  Data integration  Cloud Solutions Design and Management  BI App...

2. Bronec Ond≈ôej
   D√©lka: 1953 znak≈Ø
   Preview: www.dolphinconsulting.cz	                Ing. Ond≈ôej Bronec  BI consultant    Prague    Key qualifications  Data mining and data science ‚Äì data cleaning, analysis, and modeling using Python and R or s...


üîç Dotaz: 'Java developers'
--------------------------------------------------------------------------------
Nalezeno 3 v√Ωsledk≈Ø:

1. Hu≈°ek Michal
   D√©lka: 1996 znak≈Ø
   Preview: www.dolphinconsulting.cz	    Michal Hu≈°ek  BI Consultant & developer          Key qualifications  Full stack BI solutions development  .NET development      Skills & knowledge  Programming languages  ...

2. Bukovsk√Ω Petr
   D√©lka: 1927 znak≈Ø
   Preview: www.dolp

## Vlastn√≠ testy

Nyn√≠ m≈Ø≈æete zad√°vat vlastn√≠ dotazy a testovat retrieval.

In [14]:
# Vlastn√≠ dotaz
custom_query = "frontend developer with React"  # Zmƒõ≈àte podle pot≈ôeby

print(f"üîç Vlastn√≠ dotaz: '{custom_query}'")
print("=" * 80)

results = retriever.retrieve(custom_query, top_k=5)

for i, doc in enumerate(results, 1):
    candidate_name = doc.metadata.get("candidate_name", "Unknown")
    print(f"\n{i}. {candidate_name}")
    print("-" * 80)
    print(doc.page_content[:5000])  # Prvn√≠ 500 znak≈Ø
    print("...")

2025-12-17 16:42:38,788 - src.parent_retriever - INFO - Retrieving documents for query: 'frontend developer with React' (top 5)
2025-12-17 16:42:38,870 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"


üîç Vlastn√≠ dotaz: 'frontend developer with React'


2025-12-17 16:42:38,898 - src.parent_retriever - INFO - Retrieved 3 parent documents



1. Bukovsk√Ω Petr
--------------------------------------------------------------------------------
www.dolphinconsulting.cz	

Petr Bukovsk√Ω

DWH/BI Consultant 

 Brno



Key qualifications

Development ETL, data engineer

By education and previous experiences mechanical engineer specialized in FEA analysis and piping stress analyst in Oil and Gas industry



Skills & knowledge

Business intelligence

General

Data modeling

Business intelligence

Applications

Tabidoo (including automation using JS)

Microsoft Power BI and Power Query (basic knowledge)

MS Excel (including VBA)

ETL

Data Factory, Python, Databricks

Databases

Microsoft SQL, MS Access

Others

Basic knowledge of HTML, CSS and network setup

Python with packages for data analysis (Pandas, Numpy, Matplotlib, Seaborn)

Javascript (scripting only)

‚ÄãSEO optimization

Git

Languages

Czech (native),

English (B1/B2)

German (A2)






Selected project experience

Period

Customer

Activities

2025

2024

Skoda auto

Da

## s mƒõ≈ôen√≠m relevance retrievalu

In [15]:
print("\n" + "="*80)
print("TEST 1: Detekce relevance - Relevantn√≠ dotaz (Python developer)")
print("="*80)

relevant_query = "Python developer with Django experience"
print(f"\nDotaz: \"{relevant_query}\"\n")

# Test s scores
print("1a. Retrieve s scores (v≈°echny v√Ωsledky):")
print("-" * 60)
results_with_scores = retriever.retrieve_with_scores(relevant_query, top_k=5)

print(f"\n{'#':<3} | {'Kandid√°t':<30} | {'Score':<10} | Relevance")
print("-" * 80)

for i, result in enumerate(results_with_scores, 1):
    score = result.score
    name = result.candidate_name
    
    # Urƒçen√≠ relevance podle score (distance metric - ni≈æ≈°√≠ = lep≈°√≠)
    if score < 0.5:
        relevance = "VYSOK√Å ‚úì‚úì‚úì"
    elif score < 1.0:
        relevance = "ST≈òEDN√ç ‚úì‚úì"
    elif score < 1.5:
        relevance = "N√çZK√Å ‚úì"
    else:
        relevance = "IRELEVANTN√ç ‚úó"
    
    print(f"{i:<3} | {name:<30} | {score:<10.4f} | {relevance}")

# Test s filtrov√°n√≠m
print("\n" + "-" * 80)
print("1b. Retrieve relevant (s filtrem threshold=1.5):")
print("-" * 60)

relevant_docs = retriever.retrieve_relevant(relevant_query, top_k=5)
print(f"\nNalezeno {len(relevant_docs)} relevantn√≠ch dokument≈Ø:")
for i, doc in enumerate(relevant_docs, 1):
    print(f"  {i}. {doc.metadata.get('candidate_name', 'Unknown')}")


print("\n\n" + "="*80)
print("TEST 2: Detekce relevance - Irelevantn√≠ dotaz (React frontend)")
print("="*80)

irrelevant_query = "React frontend developer with GraphQL and TypeScript"
print(f"\nDotaz: \"{irrelevant_query}\"")
print("(P≈ôedpokl√°d√°me, ≈æe nem√°me ≈æ√°dn√©ho React v√Ωvoj√°≈ôe)\n")

# Test s scores
print("2a. Retrieve s scores (v≈°echny v√Ωsledky):")
print("-" * 60)
results_irrel = retriever.retrieve_with_scores(irrelevant_query, top_k=5)

print(f"\n{'#':<3} | {'Kandid√°t':<30} | {'Score':<10} | Relevance")
print("-" * 80)

for i, result in enumerate(results_irrel, 1):
    score = result.score
    name = result.candidate_name
    
    if score < 0.5:
        relevance = "VYSOK√Å ‚úì‚úì‚úì"
    elif score < 1.0:
        relevance = "ST≈òEDN√ç ‚úì‚úì"
    elif score < 1.5:
        relevance = "N√çZK√Å ‚úì"
    else:
        relevance = "IRELEVANTN√ç ‚úó"
    
    print(f"{i:<3} | {name:<30} | {score:<10.4f} | {relevance}")

# Test s filtrov√°n√≠m
print("\n" + "-" * 60)
print("2b. Retrieve relevant (s filtrem threshold=1.5):")
print("-" * 60)

relevant_docs_irrel = retriever.retrieve_relevant(irrelevant_query, top_k=5)
print(f"\nNalezeno {len(relevant_docs_irrel)} relevantn√≠ch dokument≈Ø")

if len(relevant_docs_irrel) == 0:
    print("\n‚ö†Ô∏è ≈Ω√ÅDN√â relevantn√≠ v√Ωsledky!")
    print("   ‚Üí Spr√°vn√© chov√°n√≠: Nevr√°tit n√°hodn√© kandid√°ty")
else:
    for i, doc in enumerate(relevant_docs_irrel, 1):
        print(f"  {i}. {doc.metadata.get('candidate_name', 'Unknown')}")


print("\n\n" + "="*80)
print("TEST 3: Porovn√°n√≠ ƒçesk√© vs anglick√© query")
print("="*80)

queries = [
    ("Python developer", "anglicky"),
    ("Python v√Ωvoj√°≈ô", "ƒçesky"),
]

for query, language in queries:
    print(f"\n{'='*60}")
    print(f"Query ({language}): \"{query}\"")
    print(f"{'='*60}")
    
    results = retriever.retrieve_with_scores(query, top_k=3)
    
    print(f"\nTop 3 v√Ωsledky:")
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result.candidate_name:<30} (score: {result.score:.4f})")

print("\nüí° TIP: Anglick√© dotazy obvykle d√°vaj√≠ lep≈°√≠ scores")
print("   (ni≈æ≈°√≠ distance = vy≈°≈°√≠ podobnost)")


print("\n\n" + "="*80)
print("TEST 4: RAG Chain s relevance filtrem")
print("="*80)

# Import RAG chain
from src.rag_chain import CVRAGChain

rag_chain = CVRAGChain(config.azure, retriever)

# Test 1: Relevantn√≠ dotaz
print("\n4a. Relevantn√≠ dotaz s filtrem:")
print("-" * 60)
response1 = rag_chain.invoke("Python developer", use_relevance_filter=True)
print(f"\nOt√°zka: {response1.query}")
print(f"Poƒçet kontext≈Ø: {response1.metadata['num_contexts']}")
print(f"Odpovƒõƒè:\n{response1.answer}")

# Test 2: Irelevantn√≠ dotaz
print("\n\n4b. Irelevantn√≠ dotaz s filtrem:")
print("-" * 60)
response2 = rag_chain.invoke("React frontend developer", use_relevance_filter=True)
print(f"\nOt√°zka: {response2.query}")
print(f"Poƒçet kontext≈Ø: {response2.metadata['num_contexts']}")
print(f"No relevant results: {response2.metadata.get('no_relevant_results', False)}")
print(f"Odpovƒõƒè:\n{response2.answer}")

# Test 3: Irelevantn√≠ dotaz BEZ filtru (pro porovn√°n√≠)
print("\n\n4c. Irelevantn√≠ dotaz BEZ filtru (pro porovn√°n√≠):")
print("-" * 60)
response3 = rag_chain.invoke("React frontend developer", use_relevance_filter=False)
print(f"\nOt√°zka: {response3.query}")
print(f"Poƒçet kontext≈Ø: {response3.metadata['num_contexts']}")
print(f"Odpovƒõƒè:\n{response3.answer}")


print("\n\n" + "="*80)
print("SHRNUT√ç")
print("="*80)
print("""
1. ‚úì Config: similarity_threshold = 1.5 (konfigurovateln√©)
2. ‚úì Nov√° metoda: retriever.retrieve_relevant() 
   - Filtruje podle threshold
   - Vrac√≠ jen relevantn√≠ v√Ωsledky (m≈Ø≈æe b√Ωt pr√°zdn√©!)
3. ‚úì Vylep≈°en√Ω RAG chain: 
   - use_relevance_filter=True (default)
   - Detekuje pr√°zdn√© v√Ωsledky
   - Nepos√≠l√° irelevantn√≠ kontext do LLM
4. ‚úì Lep≈°√≠ UX: "No candidates found" m√≠sto halucinac√≠

TIP: Nastav threshold v config.py podle pot≈ôeby:
  - < 1.0  = velmi p≈ô√≠sn√© (jen velmi podobn√©)
  - 1.0-1.5 = standardn√≠ (dobr√© v√Ωsledky)
  - > 1.5  = voln√© (v√≠ce v√Ωsledk≈Ø, ni≈æ≈°√≠ kvalita)
""")


TEST 1: Detekce relevance - Relevantn√≠ dotaz (Python developer)

Dotaz: "Python developer with Django experience"

1a. Retrieve s scores (v≈°echny v√Ωsledky):
------------------------------------------------------------


2025-12-17 16:42:39,058 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,073 - src.parent_retriever - INFO - Retrieved 5 results with scores
2025-12-17 16:42:39,080 - src.parent_retriever - INFO - Retrieving relevant documents for query: 'Python developer with Django experience' (top 5, threshold 1.5)



#   | Kandid√°t                       | Score      | Relevance
--------------------------------------------------------------------------------
1   | Bal√°ƒçek Daniel                 | 0.2978     | VYSOK√Å ‚úì‚úì‚úì
2   | Hu≈àa Tom√°≈°                     | 0.3186     | VYSOK√Å ‚úì‚úì‚úì
3   | Bukovsk√Ω Petr                  | 0.3240     | VYSOK√Å ‚úì‚úì‚úì
4   | Petr Jan                       | 0.3396     | VYSOK√Å ‚úì‚úì‚úì
5   | Bronec Ond≈ôej                  | 0.3410     | VYSOK√Å ‚úì‚úì‚úì

--------------------------------------------------------------------------------
1b. Retrieve relevant (s filtrem threshold=1.5):
------------------------------------------------------------


2025-12-17 16:42:39,147 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,185 - src.parent_retriever - INFO - Retrieved 5 relevant parent documents (filtered out 0 irrelevant results)
2025-12-17 16:42:39,250 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,266 - src.parent_retriever - INFO - Retrieved 5 results with scores
2025-12-17 16:42:39,274 - src.parent_retriever - INFO - Retrieving relevant documents for query: 'React frontend developer with GraphQL and TypeScript' (top 5, threshold 1.5)



Nalezeno 5 relevantn√≠ch dokument≈Ø:
  1. Bal√°ƒçek Daniel
  2. Hu≈àa Tom√°≈°
  3. Bukovsk√Ω Petr
  4. Petr Jan
  5. Bronec Ond≈ôej


TEST 2: Detekce relevance - Irelevantn√≠ dotaz (React frontend)

Dotaz: "React frontend developer with GraphQL and TypeScript"
(P≈ôedpokl√°d√°me, ≈æe nem√°me ≈æ√°dn√©ho React v√Ωvoj√°≈ôe)

2a. Retrieve s scores (v≈°echny v√Ωsledky):
------------------------------------------------------------

#   | Kandid√°t                       | Score      | Relevance
--------------------------------------------------------------------------------
1   | Bukovsk√Ω Petr                  | 0.4267     | VYSOK√Å ‚úì‚úì‚úì
2   | Petr Jan                       | 0.4382     | VYSOK√Å ‚úì‚úì‚úì
3   | Nƒõmeƒçek Tom√°≈°                  | 0.4433     | VYSOK√Å ‚úì‚úì‚úì
4   | Petr Jan                       | 0.4438     | VYSOK√Å ‚úì‚úì‚úì
5   | Petr Jan                       | 0.4451     | VYSOK√Å ‚úì‚úì‚úì

------------------------------------------------------------
2b. Retri

2025-12-17 16:42:39,371 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,413 - src.parent_retriever - INFO - Retrieved 4 relevant parent documents (filtered out 0 irrelevant results)
2025-12-17 16:42:39,501 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,507 - src.parent_retriever - INFO - Retrieved 3 results with scores
2025-12-17 16:42:39,580 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:39,598 - src.parent_retriever - INFO - Retrieved 3 results with scores
2025-12-17 16:42:39,609 - src.rag_chain - INFO - Initializing RAG Chain
2025-12-17 16


Nalezeno 4 relevantn√≠ch dokument≈Ø
  1. Bukovsk√Ω Petr
  2. Petr Jan
  3. Nƒõmeƒçek Tom√°≈°
  4. Petr Jan


TEST 3: Porovn√°n√≠ ƒçesk√© vs anglick√© query

Query (anglicky): "Python developer"

Top 3 v√Ωsledky:
  1. Bal√°ƒçek Daniel                 (score: 0.3234)
  2. Hu≈àa Tom√°≈°                     (score: 0.3582)
  3. Bronec Ond≈ôej                  (score: 0.3600)

Query (ƒçesky): "Python v√Ωvoj√°≈ô"

Top 3 v√Ωsledky:
  1. Bal√°ƒçek Daniel                 (score: 0.3216)
  2. Hu≈àa Tom√°≈°                     (score: 0.3399)
  3. Bronec Ond≈ôej                  (score: 0.3755)

üí° TIP: Anglick√© dotazy obvykle d√°vaj√≠ lep≈°√≠ scores
   (ni≈æ≈°√≠ distance = vy≈°≈°√≠ podobnost)


TEST 4: RAG Chain s relevance filtrem


2025-12-17 16:42:41,682 - src.rag_chain - INFO - RAG Chain created successfully
2025-12-17 16:42:41,684 - src.rag_chain - INFO - Processing query: 'Python developer' (relevance_filter=True)
2025-12-17 16:42:41,684 - src.parent_retriever - INFO - Retrieving relevant documents for query: 'Python developer' (top 5, threshold 1.5)
2025-12-17 16:42:41,763 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:41,778 - src.parent_retriever - INFO - Retrieved 5 relevant parent documents (filtered out 0 irrelevant results)
2025-12-17 16:42:41,780 - src.rag_chain - INFO - Retrieved 5 contexts
2025-12-17 16:42:41,782 - src.parent_retriever - INFO - Retrieving documents for query: 'Python developer' (top 5)
2025-12-17 16:42:41,880 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/em


4a. Relevantn√≠ dotaz s filtrem:
------------------------------------------------------------


2025-12-17 16:42:41,905 - src.parent_retriever - INFO - Retrieved 4 parent documents
2025-12-17 16:42:52,087 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:52,120 - src.rag_chain - INFO - Answer generated successfully
2025-12-17 16:42:52,123 - src.rag_chain - INFO - Processing query: 'React frontend developer' (relevance_filter=True)
2025-12-17 16:42:52,125 - src.parent_retriever - INFO - Retrieving relevant documents for query: 'React frontend developer' (top 5, threshold 1.5)
2025-12-17 16:42:52,321 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"



Ot√°zka: Python developer
Poƒçet kontext≈Ø: 5
Odpovƒõƒè:
Based on the provided CV excerpts, the candidates with Python development experience are:

1. **Bal√°ƒçek Daniel** - Intermediate level in SQL and basic knowledge in Python.
2. **Hu≈àa Tom√°≈°** - Basic knowledge in Python.
3. **Bronec Ond≈ôej** - Extensive experience in Python, including packages such as Pandas, NumPy, Matplotlib, Seaborn, TensorFlow, scikit-learn, SciPy, scikit-image, and PySpark. Also experienced in developing AI apps using Streamlit and Bot Framework SDK.
4. **Bukovsk√Ω Petr** - Experience with Python for data analysis using packages like Pandas, Numpy, Matplotlib, and Seaborn. Also involved in ETL processes and developing Python applications for specific engineering calculations.

Among these, **Bronec Ond≈ôej** has the most extensive and advanced experience in Python development.


4b. Irelevantn√≠ dotaz s filtrem:
------------------------------------------------------------


2025-12-17 16:42:52,355 - src.parent_retriever - INFO - Retrieved 4 relevant parent documents (filtered out 0 irrelevant results)
2025-12-17 16:42:52,358 - src.rag_chain - INFO - Retrieved 4 contexts
2025-12-17 16:42:52,366 - src.parent_retriever - INFO - Retrieving documents for query: 'React frontend developer' (top 5)
2025-12-17 16:42:52,474 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:52,506 - src.parent_retriever - INFO - Retrieved 3 parent documents
2025-12-17 16:42:53,120 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:53,125 - src.rag_chain - INFO - Answer generated successfully
2025-12-17 16:42:53,129 - src.rag_chain - INFO - Processing query: 'React frontend developer' (relevance_filter=False)
2025-12-


Ot√°zka: React frontend developer
Poƒçet kontext≈Ø: 4
No relevant results: False
Odpovƒõƒè:
I don't have enough information to answer this question. None of the provided CV excerpts mention experience or skills specifically related to React frontend development.


4c. Irelevantn√≠ dotaz BEZ filtru (pro porovn√°n√≠):
------------------------------------------------------------


2025-12-17 16:42:53,409 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/text-embedding-ada-002-dolphin-1/embeddings?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:53,443 - src.parent_retriever - INFO - Retrieved 3 parent documents
2025-12-17 16:42:53,807 - httpx - INFO - HTTP Request: POST https://oai-dolphin-1.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2023-05-15 "HTTP/1.1 200 OK"
2025-12-17 16:42:53,810 - src.rag_chain - INFO - Answer generated successfully



Ot√°zka: React frontend developer
Poƒçet kontext≈Ø: 3
Odpovƒõƒè:
I don't have enough information to answer this question.


SHRNUT√ç

1. ‚úì Config: similarity_threshold = 1.5 (konfigurovateln√©)
2. ‚úì Nov√° metoda: retriever.retrieve_relevant() 
   - Filtruje podle threshold
   - Vrac√≠ jen relevantn√≠ v√Ωsledky (m≈Ø≈æe b√Ωt pr√°zdn√©!)
3. ‚úì Vylep≈°en√Ω RAG chain: 
   - use_relevance_filter=True (default)
   - Detekuje pr√°zdn√© v√Ωsledky
   - Nepos√≠l√° irelevantn√≠ kontext do LLM
4. ‚úì Lep≈°√≠ UX: "No candidates found" m√≠sto halucinac√≠

TIP: Nastav threshold v config.py podle pot≈ôeby:
  - < 1.0  = velmi p≈ô√≠sn√© (jen velmi podobn√©)
  - 1.0-1.5 = standardn√≠ (dobr√© v√Ωsledky)
  - > 1.5  = voln√© (v√≠ce v√Ωsledk≈Ø, ni≈æ≈°√≠ kvalita)



## Ovƒõ≈ôen√≠ persistence

Zkontrolujte, ≈æe data jsou ulo≈æena na disku.

In [23]:
import os

print("\nüìÅ Soubory na disku:")
print("=" * 80)

persist_dir = Path(config.rag.persist_directory)
docstore_dir = persist_dir / "docstore"

print(f"\nVector store directory: {persist_dir}")
if persist_dir.exists():
    print(f"  ‚úì Existuje")
    chroma_files = list(persist_dir.glob("*.sqlite3"))
    print(f"  Soubory: {len(chroma_files)} SQLite datab√°z√≠")
else:
    print(f"  ‚úó Neexistuje")

print(f"\nDocstore directory: {docstore_dir}")
if docstore_dir.exists():
    print(f"  ‚úì Existuje")
    docstore_files = list(docstore_dir.glob("*"))
    print(f"  Soubory: {len(docstore_files)} parent chunks")
else:
    print(f"  ‚úó Neexistuje")

print("\n‚úì V≈°echna data jsou ulo≈æena a lze je naƒç√≠st i po restartu")


üìÅ Soubory na disku:

Vector store directory: chroma_db
  ‚úì Existuje
  Soubory: 1 SQLite datab√°z√≠

Docstore directory: chroma_db\docstore
  ‚úì Existuje
  Soubory: 61 parent chunks

‚úì V≈°echna data jsou ulo≈æena a lze je naƒç√≠st i po restartu


## Shrnut√≠

‚úÖ Training dokonƒçen!

**Co bylo vytvo≈ôeno:**
- ChromaDB vectorstore s child chunks (pro vyhled√°v√°n√≠)
- LocalFileStore docstore s parent chunks (pro kontext)
- Mapov√°n√≠ mezi child a parent chunks

**V≈°e je ulo≈æeno na disku:**
- `./chroma_db/` - ChromaDB datab√°ze
- `./chroma_db/docstore/` - Parent chunks

**Dal≈°√≠ kroky:**
- Otev≈ôete `query.ipynb` pro testov√°n√≠ dotaz≈Ø
- Nebo pou≈æijte `train.py` pro automatick√Ω training

## Jen vlastn√≠ hran√≠:

In [28]:
print("\n" + "="*80)
print("N√ÅHLED: Child chunks, kter√© p≈Øjdou do vectorstore")
print("="*80)

# Vytvo≈ô splitters pro preview
from langchain_text_splitters import RecursiveCharacterTextSplitter

child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=config.rag.child_chunk_size,
    chunk_overlap=config.rag.child_chunk_overlap,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Uk√°zka pro prvn√≠ 2 CVƒçka
print(f"\nUk√°zka child chunks pro prvn√≠ 2 dokumenty:\n")

for i, doc in enumerate(documents[:5]):
    candidate_name = doc.metadata.get('candidate_name', 'Unknown')
    
    print(f"\n{'='*60}")
    print(f"üìÑ CV: {candidate_name}")
    print(f"{'='*60}")
    
    # Split na child chunks
    child_chunks = child_splitter.split_documents([doc])
    
    print(f"Poƒçet child chunks: {len(child_chunks)}")
    print(f"\nPrvn√≠ch 3 child chunks:\n")
    
    for j, chunk in enumerate(child_chunks[:3]):
        print(f"\n--- Child chunk {j+1}/{len(child_chunks)} ---")
        print(f"Velikost: {len(chunk.page_content)} znak≈Ø")
        print(f"Metadata: {chunk.metadata}")
        print(f"\nObsah (prvn√≠ch 300 znak≈Ø):")
        print(chunk.page_content[:300] + "...")
        print()

# Celkov√Ω p≈ôehled
print("\n" + "="*80)
print("CELKOV√ù P≈òEHLED - v≈°echny dokumenty")
print("="*80 + "\n")

total_child_chunks = 0
for doc in documents:
    candidate_name = doc.metadata.get('candidate_name', 'Unknown')
    child_chunks = child_splitter.split_documents([doc])
    num_chunks = len(child_chunks)
    total_child_chunks += num_chunks
    print(f"üìÑ {candidate_name:30s} ‚Üí {num_chunks:3d} child chunks")

print(f"\n{'='*60}")
print(f"CELKEM: {total_child_chunks} child chunks p≈Øjde do vectorstore")
print(f"{'='*60}")


N√ÅHLED: Child chunks, kter√© p≈Øjdou do vectorstore

Uk√°zka child chunks pro prvn√≠ 2 dokumenty:


üìÑ CV: Bal√°ƒçek Daniel
Poƒçet child chunks: 10

Prvn√≠ch 3 child chunks:


--- Child chunk 1/10 ---
Velikost: 388 znak≈Ø
Metadata: {'candidate_name': 'Bal√°ƒçek Daniel', 'source': '..\\data\\OneDrive_2025-12-16\\Bal√°ƒçek_Daniel_CV_EN.docx', 'type': 'cv_parent', 'filename': 'Bal√°ƒçek_Daniel_CV_EN.docx', 'file_size': 496940, 'text_length': 3020}

Obsah (prvn√≠ch 300 znak≈Ø):
www.dolphinconsulting.cz	

Daniel Bal√°ƒçek

BI consultant 

 Praha



Key qualifications

Dashboard and report development and design

User requirements analysis and documentation

Business and data analysis



Skills & knowledge

Business intelligence

General Skills

Dashboard and report developmen...


--- Child chunk 2/10 ---
Velikost: 386 znak≈Ø
Metadata: {'candidate_name': 'Bal√°ƒçek Daniel', 'source': '..\\data\\OneDrive_2025-12-16\\Bal√°ƒçek_Daniel_CV_EN.docx', 'type': 'cv_parent', 'filename': 'Bal√°ƒçe

In [29]:
print("\n" + "="*80)
print("N√ÅHLED: Parent + Child documents vztahy")
print("="*80)

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Vytvo≈ô oba splitters
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=config.rag.parent_chunk_size,
    chunk_overlap=config.rag.parent_chunk_overlap,
    separators=["\n\n", "\n", ". ", " ", ""]
)

child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=config.rag.child_chunk_size,
    chunk_overlap=config.rag.child_chunk_overlap,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Uk√°zka pro prvn√≠ 2 CVƒçka
print(f"\nUk√°zka parent ‚Üí child vztah≈Ø pro prvn√≠ 2 dokumenty:\n")

total_parents = 0
total_children = 0

for i, doc in enumerate(documents[:2]):
    candidate_name = doc.metadata.get('candidate_name', 'Unknown')
    
    print(f"\n{'='*70}")
    print(f"üìÑ CV: {candidate_name}")
    print(f"   P≈Øvodn√≠ velikost: {len(doc.page_content)} znak≈Ø")
    print(f"{'='*70}")
    
    # 1. Split na parent chunks
    parent_chunks = parent_splitter.split_documents([doc])
    num_parents = len(parent_chunks)
    total_parents += num_parents
    
    print(f"\n‚Üí Rozdƒõleno na {num_parents} parent chunks:")
    
    # 2. Pro ka≈æd√Ω parent chunk uk√°zat child chunks
    doc_child_count = 0
    for p_idx, parent in enumerate(parent_chunks, 1):
        parent_id = f"parent-{i}-{p_idx}"  # Simulace ID
        
        # Split parent na child chunks
        child_chunks = child_splitter.split_documents([parent])
        num_children = len(child_chunks)
        doc_child_count += num_children
        total_children += num_children
        
        print(f"\n  üì¶ Parent chunk {p_idx}/{num_parents}")
        print(f"     Velikost: {len(parent.page_content)} znak≈Ø")
        print(f"     ID: {parent_id}")
        print(f"     ‚Üí Obsahuje {num_children} child chunks:")
        
        for c_idx, child in enumerate(child_chunks, 1):
            print(f"        ‚îî‚îÄ Child {c_idx}: {len(child.page_content)} znak≈Ø (parent_id: {parent_id})")
        
        # Uk√°zka obsahu prvn√≠ho parent chunku
        if p_idx == 1:
            print(f"\n     Obsah parent chunku (prvn√≠ch 200 znak≈Ø):")
            print(f"     {parent.page_content[:200]}...")
            
            print(f"\n     Prvn√≠ child chunk z tohoto parent:")
            print(f"     {child_chunks[0].page_content[:150]}...")
    
    print(f"\n  CELKEM pro {candidate_name}:")
    print(f"    - {num_parents} parent chunks")
    print(f"    - {doc_child_count} child chunks")

# Celkov√Ω p≈ôehled pro v≈°echny dokumenty
print("\n" + "="*80)
print("CELKOV√ù P≈òEHLED - v≈°echny dokumenty")
print("="*80 + "\n")

print(f"{'CV jm√©no':<30} | {'Parents':<10} | {'Children':<10} | Children/Parent")
print("-" * 80)

grand_total_parents = 0
grand_total_children = 0

for doc in documents:
    candidate_name = doc.metadata.get('candidate_name', 'Unknown')
    
    # Split na parents
    parent_chunks = parent_splitter.split_documents([doc])
    num_parents = len(parent_chunks)
    
    # Spoƒç√≠tej children
    num_children = 0
    for parent in parent_chunks:
        child_chunks = child_splitter.split_documents([parent])
        num_children += len(child_chunks)
    
    grand_total_parents += num_parents
    grand_total_children += num_children
    
    ratio = num_children / num_parents if num_parents > 0 else 0
    
    print(f"{candidate_name:<30} | {num_parents:<10} | {num_children:<10} | {ratio:.1f}")

print("-" * 80)
print(f"{'CELKEM':<30} | {grand_total_parents:<10} | {grand_total_children:<10} | {grand_total_children/grand_total_parents:.1f}")

print(f"\n{'='*80}")
print(f"CO SE ULO≈Ω√ç:")
print(f"  - Vectorstore (ChromaDB): {grand_total_children} child chunks (s embeddingy)")
print(f"  - Docstore (disk):        {grand_total_parents} parent chunks (jako JSON)")
print(f"{'='*80}")


N√ÅHLED: Parent + Child documents vztahy

Uk√°zka parent ‚Üí child vztah≈Ø pro prvn√≠ 2 dokumenty:


üìÑ CV: Bal√°ƒçek Daniel
   P≈Øvodn√≠ velikost: 3020 znak≈Ø

‚Üí Rozdƒõleno na 2 parent chunks:

  üì¶ Parent chunk 1/2
     Velikost: 1995 znak≈Ø
     ID: parent-0-1
     ‚Üí Obsahuje 7 child chunks:
        ‚îî‚îÄ Child 1: 388 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 2: 386 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 3: 224 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 4: 260 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 5: 241 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 6: 391 znak≈Ø (parent_id: parent-0-1)
        ‚îî‚îÄ Child 7: 241 znak≈Ø (parent_id: parent-0-1)

     Obsah parent chunku (prvn√≠ch 200 znak≈Ø):
     www.dolphinconsulting.cz	

Daniel Bal√°ƒçek

BI consultant 

 Praha



Key qualifications

Dashboard and report development and design

User requirements analysis and documentation

Business and data an...

     Prvn√≠ child chu