# üîç Tutorial 3: RAG - Query Building Codes with AI

## üéØ What You'll Learn
- What is RAG (Retrieval Augmented Generation)
- How to query Spanish building codes (CTE)
- See actual retrieved chunks with scores
- Compare with/without RAG

---


## ü§î What is RAG?

**Problem**: LLMs don't know about:
- Your company's documents
- Recent information
- Specific building codes

**Solution**: RAG = Retrieval + Generation

```
User Question
     ‚Üì
üîç Search Documents (Retrieval)
     ‚Üì
üìÑ Find Relevant Chunks
     ‚Üì
ü§ñ LLM + Context ‚Üí Answer (Generation)
```

We have Spanish building codes (CTE):
- **CTE DB-SI**: Seguridad en caso de incendio
- **CTE DB-SUA**: Seguridad de utilizaci√≥n y accesibilidad

Total: ~200 pages of regulations


## üß≠ Tutorial Structure

- **Baseline**: LLM without RAG (shows generic responses)
- **Part 1**: Keyword-only search on TXT (`data/normativa/cte_db_si_ejemplo.txt`)
- **Part 2**: Hybrid retrieval + reranker using embedded PDF (`data/normativa/DBSI.pdf`)


## Setup

This notebook auto-loads the vectorstore for Part 2. If it doesn't exist, it will create it from `data/normativa/DBSI.pdf`.

**If you get Chroma instance conflicts, restart the kernel and run all cells again.**


In [1]:
import sys
from pathlib import Path
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Ensure project root is on sys.path so `src` is importable, even if kernel starts in notebooks/
ROOT = Path.cwd()
try:
    # Walk up until we find a folder containing `src`
    while ROOT != ROOT.parent and not (ROOT / 'src').exists():
        ROOT = ROOT.parent
finally:
    if (ROOT / 'src').exists() and str(ROOT) not in sys.path:
        sys.path.insert(0, str(ROOT))

from IPython.display import display, HTML
import textwrap

print("‚úÖ Basic setup complete!")
print(f"   Project root: {ROOT}")
print(f"   Working directory: {Path.cwd()}")

# Check API key
if os.getenv("OPENAI_API_KEY"):
    print("   OpenAI API key: ‚úÖ Found")
else:
    print("   OpenAI API key: ‚ùå Not found in .env")

# Try to load vectorstore for Part 2 (optional)
rag = None
try:
    from src.rag.vectorstore_manager import VectorstoreManager
    
    # Clear any existing vectorstore to avoid conflicts
    import shutil
    vectorstore_path = ROOT / "vectorstore/normativa_db"
    if vectorstore_path.exists():
        shutil.rmtree(vectorstore_path)
        print("   Cleared existing vectorstore")
    
    # Create new vectorstore
    print("   Creating vectorstore from PDFs...")
    rag = VectorstoreManager(vectorstore_path)
    rag.create_from_pdfs(ROOT / "data/normativa")
    print("   Loading vectorstore...")
    rag.load_existing()
    
    # Test it works
    print("   Testing vectorstore...")
    test_docs = rag.vectorstore.similarity_search("test", k=1)
    print(f"‚úÖ Part 2 ready! Found {len(test_docs)} docs in vectorstore")
    
except Exception as e:
    print(f"‚ö†Ô∏è Part 2 (vectorstore) not available: {e}")
    print(f"   Error type: {type(e).__name__}")
    if "tenants" in str(e) or "Database error" in str(e):
        print("   üí° This is a ChromaDB schema issue. The vectorstore directory has been cleared.")
        print("   üí° Try restarting the kernel and running all cells again.")
    print("   Part 1 (TXT search) will still work")
    rag = None

print("\nüéØ Ready to start!")
print("   Baseline: LLM without RAG")
print("   Part 1: TXT keyword search (always works)")
print("   Part 2: PDF hybrid retrieval (if vectorstore loaded)")


‚úÖ Basic setup complete!
   Project root: /Users/rauladell/Work/Servitec/aec-compliance-agent
   Working directory: /Users/rauladell/Work/Servitec/aec-compliance-agent/notebooks
   OpenAI API key: ‚úÖ Found
‚ö†Ô∏è Part 2 (vectorstore) not available: No module named 'langchain_core.memory'
   Error type: ModuleNotFoundError
   Part 1 (TXT search) will still work

üéØ Ready to start!
   Baseline: LLM without RAG
   Part 1: TXT keyword search (always works)
   Part 2: PDF hybrid retrieval (if vectorstore loaded)


In [5]:
# Baseline: Test LLM WITHOUT RAG (shows generic responses)
print("ü§ñ Baseline: LLM WITHOUT RAG")
print("=" * 70)

if os.getenv("OPENAI_API_KEY"):
    try:
        from openai import OpenAI
        client = OpenAI()
        
        questions = [
            "¬øAncho m√≠nimo de puerta de evacuaci√≥n?",
            "¬øDistancia m√°xima de evacuaci√≥n en edificios?"
        ]
        
        for q in questions:
            print(f"\n‚ùì {q}")
            print("-" * 50)
            
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "Eres un experto en normativa de construcci√≥n espa√±ola. Responde bas√°ndote en tu conocimiento general."},
                    {"role": "user", "content": q}
                ],
                temperature=0.1
            )
            
            answer = response.choices[0].message.content
            print(f"üìù Respuesta: {answer}")
            
    except Exception as e:
        print(f"‚ö†Ô∏è Error en LLM: {e}")
else:
    print("‚ÑπÔ∏è Set OPENAI_API_KEY to see LLM responses without RAG")

print("\n" + "=" * 70)
print("üîç Now let's see how RAG improves these answers...")
print("=" * 70)


ü§ñ Baseline: LLM WITHOUT RAG

‚ùì ¬øAncho m√≠nimo de puerta de evacuaci√≥n?
--------------------------------------------------
üìù Respuesta: Seg√∫n la normativa espa√±ola, el ancho m√≠nimo de una puerta de evacuaci√≥n en edificios destinados a uso residencial es de 0,80 metros. Este ancho garantiza que una persona pueda salir con facilidad en caso de emergencia. Es importante cumplir con esta medida para garantizar la seguridad de los ocupantes del edificio.

‚ùì ¬øDistancia m√°xima de evacuaci√≥n en edificios?
--------------------------------------------------
üìù Respuesta: La distancia m√°xima de evacuaci√≥n en edificios en Espa√±a suele estar regulada por la normativa de prevenci√≥n de incendios, que puede variar seg√∫n la normativa auton√≥mica correspondiente. En general, se establece que la distancia m√°xima de evacuaci√≥n en un edificio no debe superar los 50 metros desde cualquier punto del edificio hasta la salida de evacuaci√≥n m√°s cercana. Es importante tener en cuenta

In [9]:
# Part 1: Keyword-only Search on TXT
from pathlib import Path
import re

print("\n" + "#" * 70)
print("üîé Part 1: Keyword-only Search (TXT)")
print("#" * 70)

# Use absolute path from project root
txt_path = ROOT / "data/normativa/cte_db_si_ejemplo.txt"
if not txt_path.exists():
    print(f"‚ö†Ô∏è TXT not found at {txt_path}. Please ensure it exists.")
else:
    raw = txt_path.read_text(encoding="utf-8", errors="ignore")

    # Very simple sectioning: split by headings like 'Secci√≥n X' or 'Cap√≠tulo X'
    sections = re.split(r"(?i)(?=\b(secci√≥n|cap√≠tulo)\s+\w+)", raw)
    # Recombine to keep the heading paired with text
    chunks = []
    for i in range(0, len(sections), 2):
        heading = sections[i].strip()
        body = sections[i+1].strip() if i + 1 < len(sections) else ""
        text = f"{heading} {body}".strip()
        if text:
            chunks.append(text)

    def keyword_rank(query: str, texts):
        q_words = [w for w in re.findall(r"\w+", query.lower()) if len(w) > 2]
        scored = []
        for t in texts:
            words = set(re.findall(r"\w+", t.lower()))
            overlap = sum(1 for w in q_words if w in words)
            density = overlap / max(len(set(q_words)), 1)
            score = overlap + 0.5 * density
            scored.append((t, score))
        return sorted(scored, key=lambda x: x[1], reverse=True)

    def render_txt_chunks(results):
        cards_html = []
        for i, (text, score) in enumerate(results, 1):
            # Best-effort parse of pseudo section heading
            m = re.search(r"(?i)\b(secci√≥n|cap√≠tulo)\s+([\w.-]+)", text)
            section = m.group(0) if m else None
            preview = textwrap.shorten(text.replace('\n', ' '), width=380, placeholder='...')
            card = f"""
            <div class='card'>
                <div class='card-title'>üìÑ Rank {i} ‚Äî Score {score:.3f}</div>
                <div class='card-meta'>Source: cte_db_si_ejemplo.txt{f' ‚Äî Section: {section}' if section else ''}</div>
                <div class='card-body'>{preview}</div>
            </div>
            """
            cards_html.append(card)
        html = f"""
        <style>
          .cards {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 12px; }}
          .card {{ border: 1px solid #e5e7eb; border-radius: 8px; padding: 12px; background: #fff; }}
          .card-title {{ font-weight: 600; margin-bottom: 6px; }}
          .card-meta {{ color: #6b7280; font-size: 12px; margin-bottom: 8px; }}
          .card-body {{ font-size: 14px; line-height: 1.4; }}
        </style>
        <div class='cards'>
          {''.join(cards_html)}
        </div>
        """
        display(HTML(html))

    keyword_questions = [
        "¬øAncho m√≠nimo de puerta de evacuaci√≥n?",
        "¬øDistancia m√°xima de evacuaci√≥n?"
    ]

    for q in keyword_questions:
        print(f"\n{'='*70}")
        print(f"‚ùì {q}")
        print('='*70)
        
        # Step 1: Show retrieval results
        print("üîç STEP 1: Retrieval Results")
        ranked = keyword_rank(q, chunks)
        top = ranked[:3]
        render_txt_chunks(top)
        
        # Step 2: Show LLM response (if available)
        if os.getenv("OPENAI_API_KEY"):
            print("\nü§ñ STEP 2: LLM Response")
            try:
                # Create a simple context from top results
                context = "\n\n".join([text for text, score in top])
                
                # Simple LLM call without RAG chain
                from openai import OpenAI
                client = OpenAI()
                
                response = client.chat.completions.create(
                    model="gpt-5-mini",
                    messages=[
                        {"role": "system", "content": "Eres un asistente que responde √öNICAMENTE bas√°ndote en el contexto proporcionado. NO uses conocimiento previo. Si la informaci√≥n no est√° en el contexto, di 'No se encuentra informaci√≥n espec√≠fica en el contexto proporcionado'. Cita siempre la fuente exacta."},
                        {"role": "user", "content": f"Contexto:\n{context}\n\nPregunta: {q}\n\nResponde bas√°ndote √öNICAMENTE en el contexto de arriba. Si no hay informaci√≥n suficiente, dilo claramente."}
                    ],
                )
                
                answer = response.choices[0].message.content
                print(f"üìù Respuesta: {answer}")
                
            except Exception as e:
                print(f"‚ö†Ô∏è Error en LLM: {e}")
        else:
            print("\nü§ñ STEP 2: LLM Response")
            print("‚ÑπÔ∏è Set OPENAI_API_KEY to see LLM response")



######################################################################
üîé Part 1: Keyword-only Search (TXT)
######################################################################

‚ùì ¬øAncho m√≠nimo de puerta de evacuaci√≥n?
üîç STEP 1: Retrieval Results



ü§ñ STEP 2: LLM Response
üìù Respuesta: Las puertas de evacuaci√≥n tendr√°n un ancho libre m√≠nimo de 0,80 m, excepto las puertas de evacuaci√≥n de locales con ocupaci√≥n inferior a 50 personas, que podr√°n tener un ancho libre m√≠nimo de 0,60 m.

Fuente: "SECCI√ìN 3: EVACUACI√ìN DE OCUPANTES ‚Äî 3.1.1 Puertas de evacuaci√≥n: Las puertas de evacuaci√≥n tendr√°n un ancho libre m√≠nimo de 0,80 m, excepto las puertas de evacuaci√≥n de locales con ocupaci√≥n inferior a 50 personas, que podr√°n tener un ancho libre m√≠nimo de 0,60 m."

‚ùì ¬øDistancia m√°xima de evacuaci√≥n?
üîç STEP 1: Retrieval Results



ü§ñ STEP 2: LLM Response
üìù Respuesta: La distancia m√°xima de evacuaci√≥n desde cualquier punto de un local hasta la salida m√°s pr√≥xima es:
- 25 m en locales de uso residencial
- 20 m en locales de uso comercial
- 15 m en locales de uso industrial

Fuente: SECCI√ìN 3: EVACUACI√ìN DE OCUPANTES ‚Äî 3.2.1 Distancia m√°xima de evacuaci√≥n (contexto proporcionado).


In [None]:
# Part 2: Hybrid Retrieval + Simple Reranker (semantic + keywords)
# Reload vectorstore if needed (fix for dependency issues)
if rag is None or not hasattr(rag, 'vectorstore') or rag.vectorstore is None:
    print("üîÑ Reloading vectorstore...")
    try:
        # Try the standard approach first
        from src.rag.vectorstore_manager import VectorstoreManager
        vectorstore_path = ROOT / "vectorstore/normativa_db"
        rag = VectorstoreManager(vectorstore_path)
        rag.load_existing()
        print("‚úÖ Vectorstore reloaded successfully")
    except Exception as e:
        print(f"‚ö†Ô∏è Standard approach failed: {e}")
        print("üîÑ Trying minimal ChromaDB approach...")
        try:
            # Fallback to minimal ChromaDB approach
            import chromadb
            from sentence_transformers import SentenceTransformer
            
            client = chromadb.PersistentClient(path=str(ROOT / "vectorstore/normativa_db"))
            collection = client.get_collection("langchain")
            model = SentenceTransformer('all-MiniLM-L6-v2')
            
            # Create a minimal rag object
            class MinimalRAG:
                def __init__(self, client, collection, model):
                    self.client = client
                    self.collection = collection
                    self.model = model
                    self.vectorstore = None  # For compatibility
                
                def similarity_search(self, query, k=3):
                    results = self.collection.query(
                        query_texts=[query],
                        n_results=k
                    )
                    # Convert to Document-like objects
                    from langchain_core.documents import Document
                    docs = []
                    for i, content in enumerate(results['documents'][0]):
                        metadata = results['metadatas'][0][i] if results['metadatas'][0] else {}
                        docs.append(Document(page_content=content, metadata=metadata))
                    return docs
                
                def similarity_search_with_score(self, query, k=3):
                    results = self.collection.query(
                        query_texts=[query],
                        n_results=k
                    )
                    # Convert to Document-like objects with scores
                    from langchain_core.documents import Document
                    docs_with_scores = []
                    for i, content in enumerate(results['documents'][0]):
                        metadata = results['metadatas'][0][i] if results['metadatas'][0] else {}
                        score = results['distances'][0][i] if results['distances'][0] else 0.0
                        doc = Document(page_content=content, metadata=metadata)
                        docs_with_scores.append((doc, score))
                    return docs_with_scores
            
            rag = MinimalRAG(client, collection, model)
            print("‚úÖ Minimal ChromaDB approach loaded successfully")
            
        except Exception as e2:
            print(f"‚ùå Minimal approach also failed: {e2}")
            rag = None

if rag and (hasattr(rag, 'vectorstore') or hasattr(rag, 'collection')):
    import re
    from typing import List, Tuple

    print("\n" + "#" * 70)
    print("üöÄ Part 2: Hybrid Retrieval + Reranking")
    print("#" * 70)

    hybrid_questions = [
        "¬øAncho m√≠nimo de puerta de evacuaci√≥n?",
        "¬øDistancia m√°xima de evacuaci√≥n en edificios?"
    ]

    def keyword_score(text: str, query: str) -> float:
        q_words = [w for w in re.findall(r"\w+", query.lower()) if len(w) > 2]
        if not q_words:
            return 0.0
        words = set(re.findall(r"\w+", text.lower()))
        overlap = sum(1 for w in q_words if w in words)
        density = overlap / max(len(set(q_words)), 1)
        return overlap + 0.5 * density

    def render_hybrid(results: List[Tuple[object, float]]):
        cards_html = []
        for i, (doc, score) in enumerate(results, 1):
            source = doc.metadata.get('source', 'Unknown')
            page = doc.metadata.get('page', 'N/A')
            section = doc.metadata.get('section')
            section_str = f" ‚Äî Section: {section}" if section else ""
            preview = textwrap.shorten(doc.page_content.replace('\n', ' '), width=380, placeholder='...')
            card = f"""
            <div class='card'>
                <div class='card-title'>üìÑ Rank {i} ‚Äî Score {score:.3f}</div>
                <div class='card-meta'>Source: {source} ‚Äî Page: {page}{section_str}</div>
                <div class='card-body'>{preview}</div>
            </div>
            """
            cards_html.append(card)
        html = f"""
        <style>
          .cards {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 12px; }}
          .card {{ border: 1px solid #e5e7eb; border-radius: 8px; padding: 12px; background: #fff; }}
          .card-title {{ font-weight: 600; margin-bottom: 6px; }}
          .card-meta {{ color: #6b7280; font-size: 12px; margin-bottom: 8px; }}
          .card-body {{ font-size: 14px; line-height: 1.4; }}
        </style>
        <div class='cards'>
          {''.join(cards_html)}
        </div>
        """
        display(HTML(html))

    for q in hybrid_questions:
        print(f"\n{'='*70}")
        print(f"‚ùì {q}")
        print('='*70)

        # Step 1: Show retrieval results
        print("üîç STEP 1: Hybrid Retrieval Results")
        
        # 1) Semantic candidates with scores (fallback to inverse-rank if unavailable)
        semantic_results = []
        try:
            # Try to use vectorstore scores (works for both standard and minimal approaches)
            if hasattr(rag, 'similarity_search_with_score'):
                sem_with_scores = rag.similarity_search_with_score(q, k=8)
                semantic_results = [(doc, float(score)) for doc, score in sem_with_scores]
            else:
                # Fallback for minimal approach
                sem_docs = rag.similarity_search(q, k=8)
                k = len(sem_docs) or 1
                semantic_results = [(d, (k - i) / k) for i, d in enumerate(sem_docs)]
            
            # Convert distance (lower is better) to similarity-like score (higher is better)
            if semantic_results:
                max_s = max(s for _, s in semantic_results)
                min_s = min(s for _, s in semantic_results)
                denom = max(max_s - min_s, 1e-9)
                semantic_results = [(d, 1.0 - ((s - min_s) / denom)) for d, s in semantic_results]
        except Exception as e:
            print(f"‚ö†Ô∏è Error with vectorstore scores: {e}")
            # Fallback: inverse rank scoring
            try:
                sem_docs = rag.similarity_search(q, k=8)
                k = len(sem_docs) or 1
                semantic_results = [(d, (k - i) / k) for i, d in enumerate(sem_docs)]
            except Exception as e2:
                print(f"‚ö†Ô∏è Error with similarity search: {e2}")
                print("Skipping this question...")
                continue

        # 2) Keyword scoring for the same docs
        kw_scores = {id(doc): keyword_score(doc.page_content, q) for doc, _ in semantic_results}
        max_kw = max(kw_scores.values()) if kw_scores else 1.0

        # 3) Combine (70% semantic, 30% keyword)
        combined: List[Tuple[object, float]] = []
        for doc, sem_s in semantic_results:
            kw_norm = kw_scores.get(id(doc), 0.0) / max_kw if max_kw > 0 else 0.0
            final = 0.7 * sem_s + 0.3 * kw_norm
            combined.append((doc, final))

        # 4) Sort and show top results
        combined.sort(key=lambda x: x[1], reverse=True)
        top = combined[:3]
        print(f"Top {len(top)} results (hybrid + rerank):")
        render_hybrid(top)
        
        # Step 2: Show LLM response (if available)
        if os.getenv("OPENAI_API_KEY"):
            print("\nü§ñ STEP 2: LLM Response")
            try:
                # Create context from top results
                context = "\n\n".join([doc.page_content for doc, score in top])
                
                # Simple LLM call
                from openai import OpenAI
                client = OpenAI()
                
                response = client.chat.completions.create(
                    model="gpt-5-mini",
                    messages=[
                        {"role": "system", "content": "Eres un asistente que responde √öNICAMENTE bas√°ndote en el contexto proporcionado. NO uses conocimiento previo. Si la informaci√≥n no est√° en el contexto, di 'No se encuentra informaci√≥n espec√≠fica en el contexto proporcionado'. Cita siempre la fuente exacta (documento, p√°gina, secci√≥n)."},
                        {"role": "user", "content": f"Contexto:\n{context}\n\nPregunta: {q}\n\nResponde bas√°ndote √öNICAMENTE en el contexto de arriba. Si no hay informaci√≥n suficiente, dilo claramente."}
                    ],
                )
                
                answer = response.choices[0].message.content
                print(f"üìù Respuesta: {answer}")
                
            except Exception as e:
                print(f"‚ö†Ô∏è Error en LLM: {e}")
        else:
            print("\nü§ñ STEP 2: LLM Response")
            print("‚ÑπÔ∏è Set OPENAI_API_KEY to see LLM response")
else:
    print("‚ö†Ô∏è Part 2 skipped - vectorstore not available")
    if rag is None:
        print("   Reason: rag object is None")
    elif not hasattr(rag, 'vectorstore') or rag.vectorstore is None:
        print("   Reason: vectorstore is not initialized")
    else:
        print("   Reason: unknown")



######################################################################
üöÄ Part 2: Hybrid Retrieval + Reranking
######################################################################

‚ùì ¬øAncho m√≠nimo de puerta de evacuaci√≥n?
üîç STEP 1: Hybrid Retrieval Results
Top 3 results (hybrid + rerank):



ü§ñ STEP 2: LLM Response
üìù Respuesta: No se encuentra informaci√≥n espec√≠fica en el contexto proporcionado sobre el ancho m√≠nimo de las puertas de evacuaci√≥n. (Fuente: contexto proporcionado ‚Äî p√°rrafo ‚Äú- El recorrido de evacuaci√≥n ‚Ä¶ 25 m y las puertas de salida deben abrir en el sentido de la evacuaci√≥n.‚Äù)

Observaciones relacionadas en el contexto:
- Se indica que las puertas deben abrir en el sentido de la evacuaci√≥n y que el recorrido desde cualquier punto del escenario no debe exceder 25 m. (Fuente: contexto proporcionado ‚Äî mismo p√°rrafo citado).
- El contexto s√≠ especifica anchuras m√≠nimas para otros elementos: ‚ÄúLas pasarelas y escaleras del escenario deben tener una anchura de 0,80 m, como m√≠nimo.‚Äù (Fuente: contexto proporcionado ‚Äî p√°rrafo ‚Äú- Las pasarelas y escaleras del escenario‚Ä¶‚Äù).
- Tabla 4.2 aporta capacidades de evacuaci√≥n en funci√≥n de la anchura de las escaleras, pero no establece un ancho m√≠nimo de puerta. (Fuente: contexto prop


ü§ñ STEP 2: LLM Response


## üéØ Summary

In this tutorial, you learned:

1. ‚úÖ **LLM without RAG**: Shows generic responses without specific building code knowledge
2. ‚úÖ **RAG combines retrieval + LLM generation**: Retrieval finds relevant chunks, LLM generates answers
3. ‚úÖ **Keyword search**: Simple text matching with scores
4. ‚úÖ **Hybrid retrieval**: Combines semantic similarity + keyword matching with reranking
5. ‚úÖ **Visual results**: See exactly what chunks were retrieved with scores
6. ‚úÖ **Source citations**: Always includes document, page, and section references

**Key Insight**: RAG dramatically improves answer quality by providing specific, accurate information from your documents rather than relying on the LLM's general knowledge.

**Next**: Tutorial 4 - Autonomous Agent


## üì¶ Part 3: Build Vectorstore from `data/normativa/` for Agents

This section builds/updates the vectorstore from all PDFs in `data/normativa/` (e.g., `DBSI.pdf`, `DccSUA.pdf`) and prepares a hybrid retriever suitable for agent usage.

- Source folder: `data/normativa/`
- Persist path: `vectorstore/normativa_db`
- Output: a retriever function you can import and use in agents


In [None]:
from pathlib import Path
import os

print("\n" + "#" * 70)
print("üß± Part 3: Build/Update Vectorstore from normativa/")
print("#" * 70)

norm_dir = ROOT / "data/normativa"
vectorstore_path = ROOT / "vectorstore/normativa_db"

print(f"üìÇ Source dir: {norm_dir}")
print(f"üíæ Persist dir: {vectorstore_path}")

# Ensure directories exist
norm_dir.mkdir(parents=True, exist_ok=True)
vectorstore_path.mkdir(parents=True, exist_ok=True)

# Attempt standard pipeline first
hybrid_retriever = None
try:
    from src.rag.vectorstore_manager import VectorstoreManager

    manager = VectorstoreManager(vectorstore_path)
    print("üîÅ Creating/refreshing vectorstore...")
    manager.create_from_pdfs(norm_dir)
    print("üîÅ Loading vectorstore...")
    manager.load_existing()

    # Prepare retriever: MMR with k=6 gives diverse candidates
    hybrid_retriever = manager.get_retriever(k=6, search_type="mmr")
    print("‚úÖ Standard retriever ready (MMR, k=6)")

except Exception as e:
    print(f"‚ö†Ô∏è Standard pipeline failed: {e}")
    print("üîÑ Falling back to minimal ChromaDB retriever")
    try:
        import chromadb
        from sentence_transformers import SentenceTransformer
        from langchain_core.documents import Document

        client = chromadb.PersistentClient(path=str(vectorstore_path))
        # The default collection name used by LangChain Chroma is "langchain"
        collection = client.get_or_create_collection("langchain")
        model = SentenceTransformer("all-MiniLM-L6-v2")

        class MinimalRetriever:
            def __init__(self, collection):
                self.collection = collection
            def invoke(self, query: str):
                res = self.collection.query(query_texts=[query], n_results=6)
                docs = []
                for i, content in enumerate(res["documents"][0]):
                    meta = res["metadatas"][0][i] if res["metadatas"][0] else {}
                    docs.append(Document(page_content=content, metadata=meta))
                return docs

        hybrid_retriever = MinimalRetriever(collection)
        print("‚úÖ Minimal retriever ready (top-6)")
    except Exception as e2:
        print(f"‚ùå Minimal retriever failed: {e2}")
        hybrid_retriever = None

# Quick smoke test
if hybrid_retriever is not None:
    test_q = "ancho m√≠nimo puerta evacuaci√≥n"
    print(f"\nüß™ Smoke test query: {test_q}")
    try:
        # Works for both retriever types (`invoke`) or .get_relevant_documents
        results = None
        if hasattr(hybrid_retriever, "invoke"):
            results = hybrid_retriever.invoke(test_q)
        else:
            results = hybrid_retriever.get_relevant_documents(test_q)
        n = len(results) if results else 0
        print(f"‚úÖ Retriever returned {n} docs")
        if results:
            print("üìÑ Preview:")
            for d in results[:2]:
                src = d.metadata.get("source", "Unknown")
                page = d.metadata.get("page", "?")
                print(f" - {src} (page {page}): {d.page_content[:120]}...")
    except Exception as e:
        print(f"‚ö†Ô∏è Smoke test failed: {e}")
else:
    print("‚ùå No retriever available")

# Expose a simple function for agents
AGENT_RETRIEVER = hybrid_retriever
print("\nüì¶ AGENT_RETRIEVER is ready for import/use in agents.")
