# ðŸ’° Week 10: Guardrails & Cost Metrics

**Learning Objectives:**
1. Implement evaluation frameworks (RAGAS-style)
2. Track token usage and costs
3. Measure faithfulness and relevance
4. Build monitoring dashboards

---

In [None]:
import numpy as np
from typing import Dict, List
from dataclasses import dataclass

---
# Section 1: Theory
---

## RAG Evaluation Metrics

| Metric | Measures | Range |
|--------|----------|-------|
| Faithfulness | Is answer grounded in context? | 0-1 |
| Relevance | Does answer address question? | 0-1 |
| Context Precision | Are retrieved docs relevant? | 0-1 |
| Answer Correctness | Is answer factually correct? | 0-1 |

---
# Section 2: Hands-On Implementation
---

In [None]:
@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens
    
    def cost(self, prompt_price=0.01, completion_price=0.03) -> float:
        """Calculate cost per 1K tokens."""
        return (self.prompt_tokens * prompt_price + 
                self.completion_tokens * completion_price) / 1000

class CostTracker:
    """Track API costs."""
    
    def __init__(self):
        self.usages: List[TokenUsage] = []
    
    def add(self, usage: TokenUsage):
        self.usages.append(usage)
    
    def total_cost(self) -> float:
        return sum(u.cost() for u in self.usages)
    
    def summary(self) -> Dict:
        return {
            "total_requests": len(self.usages),
            "total_tokens": sum(u.total_tokens for u in self.usages),
            "total_cost": f"${self.total_cost():.4f}"
        }

In [None]:
class RAGEvaluator:
    """Evaluate RAG responses."""
    
    def faithfulness(self, answer: str, context: str) -> float:
        """Check if answer is grounded in context."""
        answer_words = set(answer.lower().split())
        context_words = set(context.lower().split())
        if not answer_words:
            return 0
        overlap = len(answer_words & context_words)
        return min(overlap / len(answer_words), 1.0)
    
    def relevance(self, answer: str, question: str) -> float:
        """Check if answer addresses the question."""
        q_words = set(question.lower().split())
        a_words = set(answer.lower().split())
        if not q_words:
            return 0
        return len(q_words & a_words) / len(q_words)
    
    def evaluate(self, question: str, answer: str, context: str) -> Dict:
        return {
            "faithfulness": self.faithfulness(answer, context),
            "relevance": self.relevance(answer, question)
        }

In [None]:
# Test cost tracking
tracker = CostTracker()
tracker.add(TokenUsage(500, 100))
tracker.add(TokenUsage(800, 200))

print("Cost Summary:", tracker.summary())

# Test evaluation
evaluator = RAGEvaluator()
metrics = evaluator.evaluate(
    question="What is machine learning?",
    answer="Machine learning is a subset of AI that learns from data.",
    context="Machine learning is a subset of artificial intelligence."
)
print(f"\nEvaluation: {metrics}")

---
# Section 3: Unit Tests
---

In [None]:
def run_tests():
    print("Running Unit Tests...\n")
    
    # Test cost tracking
    usage = TokenUsage(1000, 500)
    assert usage.total_tokens == 1500
    print("âœ“ Token usage test passed")
    
    # Test faithfulness
    ev = RAGEvaluator()
    assert ev.faithfulness("hello world", "hello world!") > 0.5
    print("âœ“ Faithfulness test passed")
    
    print("\nðŸŽ‰ All tests passed!")

run_tests()

---
# Section 4: Interview Prep
---

### Q1: How do you reduce LLM costs?
**Answer:** Prompt caching, smaller models, truncation, batching.

### Q2: What is faithfulness in RAG?
**Answer:** Whether the answer is grounded in retrieved context vs hallucinated.

---
# Section 5: Deliverable
---

**Created:** `orchestrator_v2.py` with cost tracking and evaluation

**Next Week:** Flutter Chat UI