# ðŸ”— Week 9: LLM Orchestration & Streaming

**Learning Objectives:**
1. Connect retrieval pipeline to LLM
2. Implement Server-Sent Events (SSE) streaming
3. Build structured output parsing
4. Create prompt templates and chains

---

In [None]:
import json
import time
from typing import Generator, Dict, Any

---
# Section 1: Theory
---

## RAG Pipeline
```
Query â†’ Embed â†’ Retrieve â†’ [Re-rank] â†’ Context + Query â†’ LLM â†’ Response
```

## Why Streaming?
- Better UX: Users see response immediately
- Lower perceived latency
- Can stop early if answer found

---
# Section 2: Hands-On Implementation
---

In [None]:
class PromptTemplate:
    """Simple prompt template."""
    
    def __init__(self, template: str):
        self.template = template
    
    def format(self, **kwargs) -> str:
        return self.template.format(**kwargs)

# RAG prompt
RAG_PROMPT = PromptTemplate("""
You are a helpful assistant. Answer based only on the context provided.

Context:
{context}

Question: {question}

Answer:""")

In [None]:
class MockLLM:
    """Simulated LLM for demo."""
    
    def generate(self, prompt: str) -> str:
        """Generate response."""
        return f"Based on the context, the answer is: [simulated response]"
    
    def stream(self, prompt: str) -> Generator[str, None, None]:
        """Stream response token by token."""
        response = self.generate(prompt)
        for word in response.split():
            time.sleep(0.1)  # Simulate token generation delay
            yield word + " "

In [None]:
class RAGOrchestrator:
    """Orchestrates retrieval and LLM."""
    
    def __init__(self, retriever, llm, prompt_template):
        self.retriever = retriever
        self.llm = llm
        self.prompt = prompt_template
    
    def query(self, question: str, top_k: int = 3) -> Dict[str, Any]:
        """Run full RAG pipeline."""
        # Retrieve
        docs = self.retriever.search(question, top_k=top_k)
        context = "\n".join([d[2] for d in docs])
        
        # Generate
        prompt = self.prompt.format(context=context, question=question)
        response = self.llm.generate(prompt)
        
        return {
            "answer": response,
            "sources": [d[2] for d in docs],
            "prompt": prompt
        }
    
    def stream_query(self, question: str, top_k: int = 3) -> Generator[str, None, None]:
        """Stream RAG response."""
        docs = self.retriever.search(question, top_k=top_k)
        context = "\n".join([d[2] for d in docs])
        prompt = self.prompt.format(context=context, question=question)
        
        for token in self.llm.stream(prompt):
            yield token

In [None]:
# Mock retriever
class MockRetriever:
    def search(self, query, top_k=3):
        return [
            (0, 0.9, "Machine learning is AI subset."),
            (1, 0.8, "Deep learning uses neural networks."),
            (2, 0.7, "Python is popular for ML.")
        ]

# Test streaming
rag = RAGOrchestrator(MockRetriever(), MockLLM(), RAG_PROMPT)

print("Streaming response:")
for token in rag.stream_query("What is ML?"):
    print(token, end="", flush=True)
print()

## 2.2 SSE for Web Streaming

In [None]:
sse_example = """
# FastAPI SSE Endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat/stream")
async def stream_chat(query: str):
    async def generate():
        for token in rag.stream_query(query):
            yield f"data: {json.dumps({'token': token})}\\n\\n"
        yield "data: [DONE]\\n\\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )
"""
print(sse_example)

---
# Section 3: Unit Tests
---

In [None]:
def run_tests():
    print("Running Unit Tests...\n")
    
    # Test prompt template
    pt = PromptTemplate("Hello {name}!")
    assert pt.format(name="World") == "Hello World!"
    print("âœ“ Prompt template test passed")
    
    # Test RAG pipeline
    rag = RAGOrchestrator(MockRetriever(), MockLLM(), RAG_PROMPT)
    result = rag.query("test")
    assert "answer" in result
    assert "sources" in result
    print("âœ“ RAG pipeline test passed")
    
    print("\nðŸŽ‰ All tests passed!")

run_tests()

---
# Section 4: Interview Prep
---

### Q1: How do you handle hallucinations in RAG?
**Answer:** Grounding in retrieved docs, citation verification, guardrails, user feedback.

### Q2: Streaming vs batch for chat UX?
**Answer:** Streaming for chat (lower latency feel), batch for batch processing.

---
# Section 5: Deliverable
---

**Created:** `orchestrator.py` with RAG pipeline and streaming

**Next Week:** Guardrails & Cost Metrics