# RAG Workshop - Naive RAG Challenges

This notebook demonstrates the key limitations of naive RAG systems using our extended Wikipedia dataset. We'll focus on scenarios that clearly show where naive RAG fails and why advanced techniques are necessary.

## Dataset Overview:

- **61 articles** including Wikipedia + long technical blogs from Lilian Weng, arXiv papers
- **1,210 pre-chunked** pieces with 300 character chunks, 50 character overlap
- **Pre-embedded** using OpenAI text-embedding-3-small
- **Cloud-hosted** on Qdrant for reliable access
- **Includes cross-domain articles** to demonstrate naive RAG limitations

# 1. Connect to Pre-Populated Qdrant Cloud Collection

Instead of fetching and processing data, we'll connect directly to a pre-populated Qdrant Cloud collection containing the extended Wikipedia dataset.

**Note**: The ingestion process has already been completed using our automated scripts!

### Please use the API Key provided by instructor to access the preuploaded collection

In [1]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

### 1.1. Initialize Clients

In [2]:
import os
from openai import OpenAI
from qdrant_client import QdrantClient

# Initialize OpenAI client
openai_client = OpenAI()

# Initialize Qdrant Cloud client
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY")
)

# Collection configuration
collection_name = "workshop_wikipedia_extended"
embedding_model = "text-embedding-3-small"

print(f"✅ Connected to Qdrant Cloud")
print(f"📚 Collection: {collection_name}")
print(f"🤖 Embedding model: {embedding_model}")

✅ Connected to Qdrant Cloud
📚 Collection: workshop_wikipedia_extended
🤖 Embedding model: text-embedding-3-small


### 1.2. Verify Collection and Dataset

In [3]:
# Get collection information
collection_info = qdrant_client.get_collection(collection_name)
point_count = collection_info.points_count

print(f"📊 Collection Statistics:")
print(f"   Total chunks: {point_count:,}")
print(f"   Vector dimension: {collection_info.config.params.vectors.size}")
print(f"   Distance metric: {collection_info.config.params.vectors.distance}")

# Sample a few points to see the data structure
sample_points = qdrant_client.scroll(
    collection_name=collection_name,
    limit=3,
    with_payload=True,
    with_vectors=False
)[0]

print(f"\n📝 Sample data structure:")
for i, point in enumerate(sample_points):
    payload = point.payload
    print(f"\nChunk {i+1}:")
    print(f"   Title: {payload.get('title', 'Unknown')}")
    print(f"   Text preview: {payload.get('text', '')[:100]}...")
    print(f"   Chunk {payload.get('chunk_index', 0)+1} of {payload.get('total_chunks', 0)}")

📊 Collection Statistics:
   Total chunks: 1,210
   Vector dimension: 1536
   Distance metric: Cosine

📝 Sample data structure:

Chunk 1:
   Title: BERT (language model)
   Text preview: Bidirectional encoder representations from transformers (BERT) is a language model introduced in Oct...
   Chunk 1 of 10

Chunk 2:
   Title: BERT (language model)
   Text preview: Euclidean space. Encoder: a stack of Transformer blocks with self-attention, but without causal mask...
   Chunk 2 of 10

Chunk 3:
   Title: BERT (language model)
   Text preview: consists of a sinusoidal function that takes the position in the sequence as input. Segment type: Us...
   Chunk 3 of 10


## 2. Build the Q/A Chatbot

Now we can focus on the core RAG functionality without worrying about data preparation!

![../imgs/naive-rag.png](../imgs/naive-rag.png)

### 2.1. Retrieval - Search the cloud database for relevant embeddings

In [4]:
def vector_search(query, top_k=2):
    """Search the Qdrant Cloud collection for relevant chunks."""
    # Create embedding of the query
    response = openai_client.embeddings.create(
        input=query,
        model=embedding_model
    )
    query_embeddings = response.data[0].embedding
    
    # Similarity search using the embedding
    search_result = qdrant_client.query_points(
        collection_name=collection_name,
        query=query_embeddings,
        with_payload=True,
        limit=top_k,
    ).points
    
    return [result.payload for result in search_result]

### 2.2. Generation - Use retrieved chunks to generate answers

In [None]:
import json

def model_generate(prompt, model="gpt-4o"):
    """Generate response using OpenAI's chat completion."""
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # Deterministic output
    )
    return response.choices[0].message.content

def prompt_template(question, context):
    """Create a prompt template for RAG."""
    return f"""You are an AI Assistant that provides answers to questions based on the following context. 
Make sure to only use the context to answer the question. Keep the wording very close to the context.

Context:
```
{json.dumps(context)}
```

User question: {question}

Answer in markdown:"""

def generate_answer(question):
    """Complete RAG pipeline: retrieve and generate."""
    # Retrieval: search the knowledge base
    search_result = vector_search(question)
    if not search_result:
        return "No relevant information found."
        
    
    # Generation: create prompt and generate answer
    prompt = prompt_template(question, search_result)
    return model_generate(prompt)

### 2.3. Test Basic RAG Functionality

In [None]:
# Test with a clear, unambiguous question first
question = "What does the word 'deep' in 'deep learning' refer to?"
search_result = vector_search(question, top_k=3)

print(f"🔍 Question: {question}")
print(f"\n📚 Retrieved Sources:")

# Generate answer
answer = generate_answer(question)
print(f"\n🤖 Generated Answer:")
print(answer)

🔍 Question: What is the tokenizer of BERT, which is a sub-word strategy like byte-pair encoding?

📚 Retrieved Sources:

🤖 Generated Answer:
```markdown
The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte-pair encoding.
```


## 3. Demonstrating Naive RAG Limitations

Now let's test scenarios where naive RAG fails due to **terminology overlap** and **cross-domain confusion**. We'll focus only on cases that clearly demonstrate confusion.

### 🧪 Confusion Test 1: Optimization Context Confusion

This question should demonstrate confusion between mathematical optimization and machine learning optimization.

In [9]:
confusion_question_1 = "What is optimization in machine learning?"

print(f"🧪 CONFUSION TEST 1: {confusion_question_1}")
print("=" * 60)

# Get detailed search results
results = vector_search(confusion_question_1, top_k=5)

print(f"📚 Retrieved Sources:")
optimization_contexts = set()
for i, result in enumerate(results):
    title = result.get('title', 'Unknown')
    text = result.get('text', '')
    text_preview = text[:120]
    print(f"\n{i+1}. {title}")
    print(f"   Preview: {text_preview}...")
    
    # Categorize optimization contexts
    if any(term in title.lower() for term in ['mathematical optimization', 'optimization (mathematics)']):
        optimization_contexts.add('Mathematical Optimization')
    elif any(term in text.lower() for term in ['machine learning', 'neural network', 'gradient']):
        optimization_contexts.add('ML Optimization')
    elif any(term in text.lower() for term in ['evolutionary', 'genetic algorithm']):
        optimization_contexts.add('Evolutionary Optimization')
    elif 'optimization' in title.lower():
        optimization_contexts.add('General Optimization')

print(f"\n🔍 Analysis:")
print(f"   Optimization contexts found: {', '.join(optimization_contexts) if optimization_contexts else 'Mixed/Unclear'}")

if len(optimization_contexts) > 1:
    print(f"   ⚠️  CONFUSION DETECTED: Multiple optimization contexts mixed")
else:
    print(f"   ✅ Results focused on single optimization context")

# Generate answer
answer = generate_answer(confusion_question_1)
print(f"\n🤖 Generated Answer:")
print(answer)

🧪 CONFUSION TEST 1: What is optimization in machine learning?
📚 Retrieved Sources:

1. Mathematical optimization
   Preview: Mathematical optimization (alternatively spelled optimisation) or mathematical programming is the selection of a best el...

2. Optimization (mathematics)
   Preview: Mathematical optimization (alternatively spelled optimisation) or mathematical programming is the selection of a best el...

3. Artificial neural network
   Preview: If after learning, the error rate is too high, the network typically must be redesigned. Practically this is done by def...

4. Mathematical optimization
   Preview: recent and growing subset of this field is multidisciplinary design optimization, which, while useful in many problems, ...

5. Optimization (mathematics)
   Preview: recent and growing subset of this field is multidisciplinary design optimization, which, while useful in many problems, ...

🔍 Analysis:
   Optimization contexts found: Mathematical Optimization, ML Optimizati

## 4. RAG Evaluation with RAGAS

Now let's evaluate our naive RAG system using **RAGAS** to establish baseline performance metrics and quantify the confusion we've observed.

### Context-Focused Metrics:

1. **Context Precision**: How well are relevant chunks ranked at the top?
2. **Context Recall**: How much of the necessary information was retrieved?
3. **Context Relevancy**: How relevant is the retrieved context to the question?

We're using **RAGAS** because it's purpose-built for RAG evaluation and provides deep insights into context quality - the most critical component of RAG performance.

In [10]:
# Import the RAGAS evaluation utility
from rag_evaluator_v2 import evaluate_naive_rag_v2

# Run evaluation on the current RAG system using RAGAS
print("🔍 Evaluating your Naive RAG system with RAGAS...")
print("This will evaluate context quality metrics on 15 questions...\n")

baseline_results = evaluate_naive_rag_v2(
    vector_search_func=vector_search,
    generate_answer_func=generate_answer
)

🔍 Evaluating your Naive RAG system with RAGAS...
This will evaluate context quality metrics on 15 questions...

✅ Loaded 14 questions from evaluation dataset

Evaluating 14 questions...

Question 1/14: Who introduced the ReLU (rectified linear unit) ac...
Question 2/14: What was the first working deep learning algorithm...
Question 3/14: Which CNN achieved superhuman performance in a vis...
Question 4/14: When was BERT introduced and by which organization...
Question 5/14: What are the two model sizes BERT was originally i...
Question 6/14: What percentage of tokens are randomly selected fo...
Question 7/14: Who introduced the term 'deep learning' to the mac...
Question 8/14: Which three researchers were awarded the 2018 Turi...
Question 9/14: When was the first GPT introduced and by which org...
Question 10/14: What were the three parameter sizes of the first v...
Question 11/14: What is the 'one in ten rule' in regression analys...
Question 12/14: What is the essence of overfitting a

Evaluating:   0%|          | 0/14 [00:00<?, ?it/s]


RAGAS EVALUATION RESULTS

CONTEXT RECALL METRIC (0.0 - 1.0 scale):
  🟡 Context Recall: 0.786


## 5. Confusion Analysis Summary

Let's analyze the patterns of confusion we've observed and quantify the impact on RAG performance.

In [11]:
print("📊 NAIVE RAG CONFUSION ANALYSIS")
print("=" * 50)

# Test our key confusion scenarios
confusion_tests = [
    ("What are agents in AI systems?", "Should focus on AI/ML agents"),
    ("What is the architecture of transformer models?", "Should focus on transformer architecture"),
    ("What is active learning in machine learning?", "Should focus on active learning"),
    ("What is optimization in machine learning?", "Should focus on ML optimization")
]

confusion_detected = 0
total_tests = len(confusion_tests)

for question, expected_focus in confusion_tests:
    results = vector_search(question, top_k=3)
    titles = [r.get('title', '') for r in results]
    
    # Analyze domain mixing - check for articles from different sources
    domains_found = set()
    for title in titles:
        title_lower = title.lower()
        # Check for long articles
        if 'lilianweng' in title_lower:
            domains_found.add('Lilian Weng Blog')
        elif 'arxiv' in title_lower:
            domains_found.add('arXiv Papers')
        elif any(term in title_lower for term in ['gpt', 'llama', 'bert']):
            domains_found.add('Model-Specific Articles')
        elif any(term in title_lower for term in ['transformer', 'attention']):
            domains_found.add('Transformer Articles')
        elif any(term in title_lower for term in ['optimization', 'mathematical']):
            domains_found.add('Optimization Articles')
        else:
            domains_found.add('Wikipedia Articles')
    
    print(f"\n❓ Question: {question}")
    print(f"   Expected: {expected_focus}")
    print(f"   Retrieved from: {', '.join(titles)}")
    print(f"   Domains found: {', '.join(domains_found)}")
    
    if len(domains_found) > 1:
        confusion_detected += 1
        print(f"   ⚠️  CONFUSION: Mixed domains detected")
    else:
        print(f"   ✅ Focused results")

print(f"\n📈 CONFUSION SUMMARY:")
print(f"   Confused queries: {confusion_detected}/{total_tests} ({confusion_detected/total_tests*100:.1f}%)")

if confusion_detected >= total_tests * 0.5:
    print(f"\n🎯 SIGNIFICANT CONFUSION DETECTED!")
    print(f"   The extended dataset successfully demonstrates naive RAG limitations.")
else:
    print(f"\n✅ Limited confusion observed.")
    print(f"   Results suggest the current dataset may need refinement.")

📊 NAIVE RAG CONFUSION ANALYSIS

❓ Question: What are agents in AI systems?
   Expected: Should focus on AI/ML agents
   Retrieved from: Artificial intelligence, History of artificial intelligence, Artificial intelligence
   Domains found: Wikipedia Articles
   ✅ Focused results

❓ Question: What is the architecture of transformer models?
   Expected: Should focus on transformer architecture
   Retrieved from: Transformer (machine learning model), Transformer (machine learning model), Artificial neural network
   Domains found: Wikipedia Articles, Transformer Articles
   ⚠️  CONFUSION: Mixed domains detected

❓ Question: What is active learning in machine learning?
   Expected: Should focus on active learning
   Retrieved from: Artificial intelligence, Deep learning, Ensemble learning
   Domains found: Wikipedia Articles
   ✅ Focused results

❓ Question: What is optimization in machine learning?
   Expected: Should focus on ML optimization
   Retrieved from: Optimization (mathematics), 

## 6. Summary: Why We Need Advanced RAG Techniques

Through our focused evaluation with the extended dataset, we've identified key limitations of naive RAG:

### 📋 Key Findings:

1. **Terminology Overlap**: Simple vector similarity struggles with terms that have different meanings across domains ("agent", "model", "learning", "optimization")

2. **Context Disambiguation**: The system cannot distinguish between semantically similar but contextually different concepts

3. **Domain Boundary Issues**: Cross-domain vocabulary creates confusion when the same terms appear in different technical contexts

4. **Ranking Limitations**: Most relevant information isn't always ranked first due to semantic similarity bias

5. **Scale Challenges**: Performance degrades with larger, more diverse knowledge bases containing overlapping terminology

### 🎯 Next Steps:

These limitations motivate the need for **Advanced RAG techniques**:

- **Reranking**: Improve relevance of retrieved chunks using cross-encoders that understand context better
- **Hybrid Search**: Combine semantic and keyword search for better coverage and precision
- **Query Enhancement**: Improve query understanding and expansion to disambiguate intent
- **Context Optimization**: Better chunk strategies and context assembly techniques
- **Multi-step Reasoning**: Chain multiple retrieval steps for complex questions

### 💡 Benefits of This Focused Approach:

- ✅ **Clear Demonstrations**: Focused on scenarios that actually show confusion
- ✅ **Quantifiable Issues**: RAGAS metrics provide concrete evidence of limitations
- ✅ **Educational Value**: Students see exactly where and why naive RAG fails
- ✅ **Motivation for Advanced Techniques**: Clear justification for the complexity of advanced RAG
- ✅ **Realistic Scenarios**: Demonstrates actual challenges with larger knowledge bases

**Ready for the next notebook**: Advanced RAG techniques that address these specific limitations!