<a href="https://colab.research.google.com/github/Nareshedagotti/RAG/blob/main/Day_7_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Complete RAG Evaluation Guide: From Basics to Production**


##### **What is RAG Evaluation?**

RAG (Retrieval-Augmented Generation) Evaluation is the systematic process of measuring and assessing the quality, accuracy, and effectiveness of RAG systems. It involves testing both the retrieval component (how well the system finds relevant information) and the generation component (how well the language model uses that information to create responses).

---

##### **Core Components of RAG Evaluation**

**1. Retrieval Quality Assessment**
- Measures how well the system retrieves relevant documents  
- Evaluates ranking and relevance of retrieved content  
- Assesses coverage of important information  

**2. Generation Quality Assessment**
- Measures how well the LLM uses retrieved information  
- Evaluates accuracy, coherence, and relevance of generated responses  
- Checks for hallucinations and factual errors  

**3. End-to-End System Performance**
- Measures overall user satisfaction  
- Evaluates response time and system reliability  
- Assesses real-world effectiveness  

---

##### **Why RAG Evaluation is Critical**

**1. Quality Assurance**  
RAG systems combine complex retrieval and generation processes. Without proper evaluation, you cannot ensure the system produces accurate, relevant, and reliable responses.

**2. System Optimization**  
Evaluation identifies bottlenecks and areas for improvement. Research shows that the retrieval component contributes to approximately 90% of RAG system quality.

**3. Production Readiness**  
Real-world deployment requires confidence in system performance across diverse queries and edge cases.

**4. User Trust and Safety**  
Especially critical in domains like healthcare, legal, or financial services where incorrect information can have serious consequences.

**5. Regulatory Compliance**  
Many industries require documented evaluation processes for AI systems.

**6. Cost Optimization**  
Proper evaluation helps optimize computational resources and reduce operational costs.

---

##### **Consequences of Not Evaluating RAG**

**Immediate Consequences**

**1. Silent Failures**  
```python
# Example: Dangerous misinformation goes undetected
Query: "What is the recommended dosage for aspirin?"
Bad Response: "Take 10 tablets daily"  # Potentially fatal
# Without evaluation, this goes unnoticed until harm occurs
```
**2. Poor User Experience**

- Irrelevant responses to queries  
- Inconsistent quality across topics  
- Slow response times  

**3. Resource Waste**

- Retrieving irrelevant documents  
- Higher token usage due to poor ranking  
- Inefficient system architecture  

---

##### **Long-term Consequences**

**1. System Degradation**

```python
# Quality decline over time
Month 1: Accuracy = 85%
Month 6: Accuracy = 65%  # Unnoticed degradation
Month 12: Accuracy = 45% # System becomes unreliable
```
**2. Business Impact**

- Loss of user trust and engagement  
- Increased support costs  
- Legal liability in regulated industries  
- Competitive disadvantage  

**3. Technical Debt**

- Difficult root cause identification  
- Expensive fixes when discovered  
- Increasingly fragile architecture  


#### **RAG Evaluation Types**
##### **1. Retrieval Component Evaluation**

##### **1.1 Context Precision**

**What it is:** Measures whether relevant information appears early in retrieved results.

**When to use:** When users need relevant information quickly (search engines, FAQs).

**Formula:**
Context Precision = (1/|GT|) × Σ(Precision@k × relevance@k)

##### **Example Implementation:**


In [None]:
def calculate_context_precision(retrieved_docs, ground_truth_docs):
    precisions = []
    relevant_count = 0

    for i, doc in enumerate(retrieved_docs, 1):
        if doc in ground_truth_docs:
            relevant_count += 1
            precision_at_i = relevant_count / i
            precisions.append(precision_at_i)

    gt_positions = [i for i, doc in enumerate(retrieved_docs)
                   if doc in ground_truth_docs]

    if not gt_positions:
        return 0.0

    relevant_precisions = [precisions[i-1] for i in gt_positions]
    return sum(relevant_precisions) / len(ground_truth_docs)

# Example usage
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
ground_truth = {"doc2", "doc3", "doc5"}
precision = calculate_context_precision(retrieved, ground_truth)
print(f"Context Precision: {precision:.3f}")  # Output: 0.622

**Use Cases:**
* E-commerce product search
* Document retrieval systems
* Knowledge base systems


##### **1.2 Context Recall**

**What it is:** Measures what proportion of relevant information is captured in retrieved context.

**When to use:** When completeness is critical (medical diagnosis, legal research).

**Formula:**
Context Recall = Found GT claims / Total GT claims

##### **Example Implementation:**


In [None]:
def calculate_context_recall(retrieved_context, ground_truth_claims):
    found_claims = 0
    context_text = " ".join(retrieved_context)

    for claim in ground_truth_claims:
        if can_infer_claim(claim, context_text):
            found_claims += 1

    return found_claims / len(ground_truth_claims) if ground_truth_claims else 0

def can_infer_claim(claim, context):
    # Simple keyword-based approach (use NLI models in production)
    claim_keywords = extract_key_terms(claim)
    context_lower = context.lower()

    found_keywords = sum(1 for keyword in claim_keywords
                        if keyword.lower() in context_lower)

    return (found_keywords / len(claim_keywords)) >= 0.7

# Example usage
retrieved_context = [
    "Water has the chemical formula H2O and is essential for life.",
    "It covers approximately 71% of Earth's surface."
]

ground_truth_claims = [
    "Water has chemical formula H2O",
    "Water boils at 100°C at sea level",
    "Water is essential for life"
]

recall = calculate_context_recall(retrieved_context, ground_truth_claims)
print(f"Context Recall: {recall:.3f}")  # Output: 0.667

**Use Cases:**
* Medical diagnosis systems
* Legal research platforms
* Scientific literature review


##### **1.3 Mean Reciprocal Rank (MRR)**
**What it is:** Measures average reciprocal rank of first relevant document.

**When to use:** When finding at least one highly relevant result quickly matters.

**Formula:**
MRR = (1/|Q|) × Σ(1/rank_i)

##### **Example Implementation:**


In [None]:
def calculate_mrr(query_results):
    reciprocal_ranks = []

    for query, retrieved, relevant in query_results:
        first_relevant_rank = None

        for i, doc in enumerate(retrieved, 1):
            if doc in relevant:
                first_relevant_rank = i
                break

        if first_relevant_rank:
            reciprocal_ranks.append(1.0 / first_relevant_rank)
        else:
            reciprocal_ranks.append(0.0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

# Example usage
query_results = [
    ("python tutorial", ["doc1", "doc2", "doc3"], {"doc2"}),  # RR = 1/2
    ("machine learning", ["doc5", "doc6", "doc7"], {"doc5"}), # RR = 1/1
]

mrr_score = calculate_mrr(query_results)
print(f"MRR: {mrr_score:.3f}")  # Output: 0.750

**Use Cases:**
* Web search engines
* Question answering systems
* Product recommendations


#### **2. Generation Component Evaluation**



##### **2.1 Faithfulness**

**What it is:**Measures how well generated answers stay true to provided context.

**When to use:** Critical for accuracy-dependent applications (medical, financial).

**Formula:**
Faithfulness = Verifiable claims / Total claims in answer

##### **Example Implementation:**

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re

class FaithfulnessEvaluator:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = 0.7

    def extract_claims(self, text):
        sentences = re.split(r'[.!?]+', text)
        claims = [s.strip() for s in sentences
                 if len(s.strip()) > 10 and not s.endswith('?')]
        return claims

    def can_infer_claim(self, claim, context):
        claim_embedding = self.model.encode([claim])
        context_sentences = re.split(r'[.!?]+', context)
        context_sentences = [s.strip() for s in context_sentences if len(s.strip()) > 10]

        if not context_sentences:
            return False

        context_embeddings = self.model.encode(context_sentences)
        similarities = cosine_similarity(claim_embedding, context_embeddings)[0]

        return max(similarities) >= self.threshold

    def calculate_faithfulness(self, answer, context):
        claims = self.extract_claims(answer)
        if not claims:
            return 1.0

        faithful_claims = sum(1 for claim in claims
                            if self.can_infer_claim(claim, context))

        return faithful_claims / len(claims)

# Example usage
evaluator = FaithfulnessEvaluator()

context = "Napoleon was defeated at Waterloo on June 18, 1815."
answer = "Napoleon lost the Battle of Waterloo on June 18, 1815."

faithfulness = evaluator.calculate_faithfulness(answer, context)
print(f"Faithfulness: {faithfulness:.3f}")  # Output: 1.000

**Use Cases:**
* Medical information systems
* Financial advisory platforms
* Legal research tools

##### **2.2 Answer Relevance**

**What it is:** Measures how well answers address original questions.

**When to use:** To ensure responses stay on-topic and directly address queries.

**Formula:**
Answer Relevance = Average cosine similarity between original and generated questions

##### **Example Implementation:**

In [None]:
class AnswerRelevanceEvaluator:
    def __init__(self):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')

    def calculate_answer_relevance(self, original_question, answer):
        # Generate questions from answer (simplified version)
        generated_questions = self.generate_questions_from_answer(answer)

        if not generated_questions:
            return 0.0

        original_embedding = self.encoder.encode([original_question])
        generated_embeddings = self.encoder.encode(generated_questions)

        similarities = []
        for gen_embedding in generated_embeddings:
            similarity = cosine_similarity(
                original_embedding.reshape(1, -1),
                gen_embedding.reshape(1, -1)
            )[0][0]
            similarities.append(similarity)

        return np.mean(similarities)

    def generate_questions_from_answer(self, answer):
        # Simplified question generation
        # In production, use LLM-based generation
        return [f"What is {answer[:50]}?"]

# Example usage
evaluator = AnswerRelevanceEvaluator()
relevance = evaluator.calculate_answer_relevance(
    "What is water's formula?",
    "Water has the chemical formula H2O"
)
print(f"Answer Relevance: {relevance:.3f}")

Use Cases:
* Customer service chatbots
* Educational Q&A systems
* Content recommendation

##### **2.3 Answer Semantic Similarity**

**What it is:** Measures semantic similarity between generated and reference answers.

**When to use:** When you have gold standard answers for comparison.

**Formula:**
Semantic Similarity = cosine_similarity(generated_embedding, reference_embedding)

##### **Example Implementation:**


In [None]:
class SemanticSimilarityEvaluator:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def calculate_similarity(self, generated_answer, reference_answer):
        if not generated_answer.strip() or not reference_answer.strip():
            return 0.0

        generated_embedding = self.model.encode([generated_answer])
        reference_embedding = self.model.encode([reference_answer])

        similarity = cosine_similarity(generated_embedding, reference_embedding)[0][0]
        return float(similarity)

# Example usage
evaluator = SemanticSimilarityEvaluator()
similarity = evaluator.calculate_similarity(
    "H2O is water's chemical formula",
    "Water has the chemical formula H2O"
)
print(f"Semantic Similarity: {similarity:.3f}")  # Output: ~0.85

**Use Cases:**
* Educational assessment
* Content quality control
* Translation evaluation

#### **3. End-to-End Evaluation**

##### **3.1 User Satisfaction**

**What it is:** Direct measurement of user satisfaction with responses.

**When to use:** For production systems where user experience is paramount.

**Example Implementation:**


In [None]:
class UserSatisfactionTracker:
    def __init__(self):
        self.feedback_db = []

    def collect_feedback(self, query, answer, rating, feedback_text=None):
        feedback = {
            'query': query,
            'answer': answer,
            'rating': rating,  # 1-5 scale
            'feedback_text': feedback_text,
            'timestamp': datetime.now()
        }
        self.feedback_db.append(feedback)

    def calculate_metrics(self):
        ratings = [f['rating'] for f in self.feedback_db]
        return {
            'average_rating': np.mean(ratings),
            'satisfaction_rate': sum(1 for r in ratings if r >= 4) / len(ratings),
            'total_responses': len(ratings)
        }

##### **3.2 Response Time**

**What it is:** Time measurement for retrieval and generation.

**When to use:** When response speed affects user experience.

**Example Implementation:**


In [None]:
class ResponseTimeEvaluator:
    def measure_response_time(self, rag_system, query):
        start_time = time.time()

        retrieval_start = time.time()
        context = rag_system.retrieve(query)
        retrieval_time = time.time() - retrieval_start

        generation_start = time.time()
        answer = rag_system.generate(query, context)
        generation_time = time.time() - generation_start

        total_time = time.time() - start_time

        return {
            'total_time': total_time,
            'retrieval_time': retrieval_time,
            'generation_time': generation_time
        }

#### **Hallucination in RAG Systems**

---

##### **What is Hallucination?**  
Hallucination occurs when RAG systems generate information that is not supported by the retrieved context or factual knowledge.

---

##### **Why Check for Hallucinations?**  
- **Safety**: Prevents harmful misinformation  
- **Trust**: Maintains user confidence  
- **Compliance**: Meets regulatory requirements  
- **Quality**: Ensures system reliability  

---

##### **Types of Hallucinations**

---

**1. Factual Hallucinations**  
Generating false facts not in the context.  

```python
# Example
Context: "Paris is the capital of France"
Query: "What is the population of Paris?"
Hallucinated Answer: "Paris has 15 million people"  # Not in context
```
**2. Contextual Hallucinations**  
Information not present in retrieved context.

```python
# Example
Context: "Water boils at 100°C"
Query: "What is water's freezing point?"
Hallucinated Answer: "Water freezes at 0°C"  # Not in provided context
```
##### **3. Logical Hallucinations**  
Contradictory or illogical reasoning.

```python
# Example
Context: "The meeting is scheduled for Monday"
Query: "When is the meeting?"
Hallucinated Answer: "The meeting is on Tuesday"  # Contradicts context
```


##### **How to Check for Hallucinations**
**1. Claim-Level Verification**


In [None]:
def detect_hallucinations(answer, context):
    claims = extract_claims(answer)
    hallucinations = []

    for claim in claims:
        if not can_be_inferred(claim, context):
            hallucinations.append(claim)

    return len(hallucinations) / len(claims) if claims else 0

# Example usage
context = "Water boils at 100°C at sea level"
answer = "Water boils at 90°C and freezes at -5°C"
hallucination_rate = detect_hallucinations(answer, context)
print(f"Hallucination Rate: {hallucination_rate:.3f}")

**2. Consistency Checking**

In [None]:
def check_consistency(rag_system, query, n_runs=3):
    answers = [rag_system.generate_answer(query) for _ in range(n_runs)]

    # Check for contradictions between answers
    contradictions = 0
    for i in range(len(answers)):
        for j in range(i+1, len(answers)):
            if are_contradictory(answers[i], answers[j]):
                contradictions += 1

    total_pairs = len(answers) * (len(answers) - 1) / 2
    consistency_score = 1 - (contradictions / total_pairs)
    return consistency_score

**3. External Knowledge Validation**

In [None]:
def validate_against_knowledge_base(answer, knowledge_base):
    claims = extract_claims(answer)
    validated_claims = 0

    for claim in claims:
        if is_supported_by_kb(claim, knowledge_base):
            validated_claims += 1

    return validated_claims / len(claims) if claims else 1.0

**Hallucination Detection Implementation**

In [None]:
class HallucinationDetector:
    def __init__(self):
        self.fact_checker = FactChecker()
        self.consistency_checker = ConsistencyChecker()

    def detect_hallucinations(self, query, answer, context):
        results = {
            'factual_accuracy': self.check_factual_accuracy(answer, context),
            'contextual_grounding': self.check_contextual_grounding(answer, context),
            'logical_consistency': self.check_logical_consistency(answer),
            'external_validation': self.validate_externally(answer)
        }

        # Combine scores
        overall_score = np.mean(list(results.values()))
        results['overall_hallucination_score'] = 1 - overall_score

        return results

    def check_factual_accuracy(self, answer, context):
        # Implementation for factual checking
        pass

    def check_contextual_grounding(self, answer, context):
        # Implementation for context grounding
        pass

#### **Common Production Failures in RAG Systems**

---

##### **1. Context Length Limitations**  
**Problem:** Retrieved context exceeds model token limits, causing truncation or processing failures.  
**Real-world Example:** A legal document search system retrieves 10 relevant case studies totaling 50,000 tokens, but the model only accepts 32,000 tokens, resulting in important context being cut off mid-sentence.  
**Solutions:**
- **Intelligent Context Ranking:** Prioritize retrieved chunks based on semantic similarity to the query  
- **Hierarchical Summarization:** Create multi-level summaries of lengthy documents  
- **Context Windowing:** Use overlapping context windows to maintain coherence  
- **Dynamic Truncation:** Cut at natural break points (sentence/paragraph boundaries) rather than arbitrary token limits  

---

##### **2. Retrieval Quality Degradation**  
**Problem:** Gradual decline in retrieval accuracy over time due to index corruption, embedding drift, or changing data patterns.  
**Real-world Example:** A customer support RAG system starts returning increasingly irrelevant articles over months, causing customer satisfaction scores to drop from 85% to 60%.  
**Solutions:**
- **Automated Quality Benchmarks:** Run daily tests against golden question-answer pairs  
- **Retrieval Metrics Monitoring:** Track precision@k, recall@k, and NDCG scores  
- **A/B Testing Framework:** Compare current performance against baseline models  
- **Periodic Reindexing:** Schedule regular complete index rebuilds  
- **Embedding Model Updates:** Monitor for newer, better-performing embedding models  

---

##### **3. Hallucination in Production**  
**Problem:** Model generates plausible but false information not supported by retrieved context.  
**Real-world Example:** A medical information RAG system confidently states "Drug X is approved for treating condition Y" when the retrieved context only mentions ongoing clinical trials.  
**Solutions:**
- **Multi-Generation Validation:** Generate multiple answers and check for consistency  
- **Context Grounding Checks:** Verify each claim can be traced back to source material  
- **Confidence Scoring:** Return uncertainty indicators with answers  
- **Fallback Responses:** Use "I don't know" templates when confidence is low  
- **External Fact Verification:** Cross-reference critical claims with authoritative sources  

---

##### **4. Performance Bottlenecks**  
**Problem:** Slow response times affecting user experience, especially during peak usage.  
**Real-world Example:** An e-commerce product recommendation system takes 8–12 seconds to respond during Black Friday traffic, causing 40% of users to abandon their searches.  
**Solutions:**
- **Multi-Layer Caching:** Cache frequent queries, embeddings, and intermediate results  
- **Async Processing:** Use background workers for non-critical operations  
- **Vector Database Optimization:** Implement approximate nearest neighbor search  
- **Load Balancing:** Distribute queries across multiple model instances  
- **Pre-computation:** Generate answers for common questions in advance  
- **Response Streaming:** Send partial results while processing continues  

---

##### **5. Data Drift and Knowledge Staleness**  
**Problem:** Knowledge base becomes outdated, leading to incorrect or obsolete information.  
**Real-world Example:** A financial advisory RAG system continues recommending investment strategies based on pre-pandemic market conditions, missing crucial economic changes.  
**Solutions:**
- **Content Freshness Tracking:** Monitor document ages and update frequencies  
- **Automated Source Monitoring:** Check original sources for changes  
- **Version Control:** Maintain document versioning and change logs  
- **Incremental Updates:** Add new information without full reindexing  
- **Deprecation Workflows:** Mark and phase out outdated content  
- **Real-time Data Integration:** Connect to live data feeds for dynamic information  

---

##### **6. Security and Privacy Violations**  
**Problem:** Sensitive information leakage through retrieved context or generated responses.  
**Real-world Example:** A HR chatbot accidentally reveals salary information from one employee's query to another employee asking about company policies.  
**Solutions:**
- **Access Control Integration:** Implement user-based document filtering  
- **PII Detection and Masking:** Automatically redact sensitive information  
- **Audit Logging:** Track all queries and responses for compliance  
- **Data Classification:** Tag documents with sensitivity levels  
- **Response Sanitization:** Remove potentially sensitive information from outputs  

---

##### **7. Inconsistent Answer Quality**  
**Problem:** Wide variation in answer quality across different topics or query types.  
**Real-world Example:** A technical documentation RAG system excels at API questions but performs poorly on conceptual architecture queries, creating user confusion about system reliability.  
**Solutions:**
- **Domain-Specific Models:** Use specialized models for different content types  
- **Quality Scoring Frameworks:** Implement consistent evaluation metrics  
- **Feedback Loop Integration:** Learn from user ratings and corrections  
- **Content Gap Analysis:** Identify and fill knowledge base weaknesses  
- **Answer Template Systems:** Provide structured response formats for consistency  


##### **Evaluation Tools & Frameworks**

##### **1. RAGAS (RAG Assessment)**

**Purpose**: Comprehensive RAG evaluation framework

**Implementation:**

In [None]:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Evaluate your RAG system
result = evaluate(
    dataset=your_test_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(f"Faithfulness: {result['faithfulness']:.3f}")
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

**Features:**
* Multiple evaluation metrics
* Automated assessment
* Integration with popular RAG frameworks


##### **2. DeepEval**

**Purpose:** Modular evaluation framework for LLM applications

**Implementation:**


In [None]:
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric

faithfulness_metric = FaithfulnessMetric(threshold=0.8)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

evaluate(
    test_cases=[test_case],
    metrics=[faithfulness_metric, relevancy_metric]
)

**Features:**
* Customizable metrics
* Real-time evaluation
* Detailed reporting

##### **3. TruLens**

**Purpose:** Real-time evaluation and monitoring for LLM applications

**Implementation:**


In [None]:
from trulens_eval import TruLlama

# Wrap your RAG app
tru_rag = TruLlama(your_rag_app)

# Automatic logging and evaluation
with tru_rag as recording:
    response = your_rag_app.query("Your question")

**Features:**
* Production monitoring
* Interactive dashboards
* Feedback collection