# RAG Workshop Notebook - Naive RAG with Cohere Reranking


# 0. Setup Environment



In [1]:
%pip install --upgrade pip
%pip install wikipedia mwparserfromhell beautifulsoup4 openai qdrant-client tqdm python-dotenv cohere


Note: you may need to restart the kernel to use updated packages.
Collecting cohere
  Downloading cohere-5.16.1-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Using cached fastavro-1.11.1-cp311-cp311-macosx_10_9_universal2.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Using cached httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting tokenizers<1,>=0.15 (from cohere)
  Using cached tokenizers-0.21.2-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Using cached types_requests-2.32.4.20250611-py3-none-any.whl.metadata (2.1 kB)
Downloading cohere-5.16.1-py3-none-any.whl (291 kB)
Using cached httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Using cached fastavro-1.11.1-cp311-cp311-macosx_10_9_universal2.whl (957 kB)
Using cached tokenizers-0.21.2-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB)
Using cached types_requests-2.32.4.20250611-py3-none-any.whl (20 kB)
Installing collected 

In [2]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

### 1.1. Initialize Clients

In [3]:

import os
from openai import OpenAI
from qdrant_client import QdrantClient

# Initialize OpenAI client
openai_client = OpenAI()

# Initialize Qdrant Cloud client
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY")
)

# Collection configuration
collection_name = "workshop_wikipedia_extended"
embedding_model = "text-embedding-3-small"

print(f"✅ Connected to Qdrant Cloud")
print(f"📚 Collection: {collection_name}")
print(f"🤖 Embedding model: {embedding_model}")

✅ Connected to Qdrant Cloud
📚 Collection: workshop_wikipedia_extended
🤖 Embedding model: text-embedding-3-small


### 1.2. Verify Collection and Dataset

In [4]:

# Get collection information
collection_info = qdrant_client.get_collection(collection_name)
point_count = collection_info.points_count

print(f"📊 Collection Statistics:")
print(f"   Total chunks: {point_count:,}")
print(f"   Vector dimension: {collection_info.config.params.vectors.size}")
print(f"   Distance metric: {collection_info.config.params.vectors.distance}")

# Sample a few points to see the data structure
sample_points = qdrant_client.scroll(
    collection_name=collection_name,
    limit=3,
    with_payload=True,
    with_vectors=False
)[0]

print(f"\n📝 Sample data structure:")
for i, point in enumerate(sample_points):
    payload = point.payload
    print(f"\nChunk {i+1}:")
    print(f"   Title: {payload.get('title', 'Unknown')}")
    print(f"   Text preview: {payload.get('text', '')[:100]}...")
    print(f"   Chunk {payload.get('chunk_index', 0)+1} of {payload.get('total_chunks', 0)}")

📊 Collection Statistics:
   Total chunks: 1,210
   Vector dimension: 1536
   Distance metric: Cosine

📝 Sample data structure:

Chunk 1:
   Title: BERT (language model)
   Text preview: Bidirectional encoder representations from transformers (BERT) is a language model introduced in Oct...
   Chunk 1 of 10

Chunk 2:
   Title: BERT (language model)
   Text preview: Euclidean space. Encoder: a stack of Transformer blocks with self-attention, but without causal mask...
   Chunk 2 of 10

Chunk 3:
   Title: BERT (language model)
   Text preview: consists of a sinusoidal function that takes the position in the sequence as input. Segment type: Us...
   Chunk 3 of 10


## 2. Build the Q/A Chatbot

![../imgs/naive-rag.png](../imgs/naive-rag.png)


### 2.1. Retrieval - Search the database for the most relevant embeddings.

In [5]:
# Function to search the database
def vector_search(query, top_k=1):
    # create embedding of the query
    response = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )
    query_embeddings = response.data[0].embedding
    # similarity search using the embedding, give top n results which are close to the query embeddings
    search_result = qdrant_client.query_points(
        collection_name=collection_name,
        query=query_embeddings,
        with_payload=True,
        limit=top_k,
    ).points
    return [result.payload for result in search_result]


search_result = vector_search("What does the word 'deep' in 'deep learning' refer")

from pprint import pprint

pprint(search_result[0])

{'chunk_index': 0,
 'text': 'In machine learning, deep learning focuses on utilizing multilayered '
         'neural networks to perform tasks such as classification, regression, '
         'and representation learning. The field takes inspiration from '
         'biological neuroscience and is centered around stacking artificial '
         'neurons into layers and "training" them to process data. The '
         'adjective "deep" refers to the use of multiple layers (ranging from '
         'three to several hundred or thousands) in the network. Methods used '
         'can be supervised, semi-supervised or unsupervised. Some common deep '
         'learning network architectures include fully connected networks, '
         'deep belief networks, recurrent neural networks, convolutional '
         'neural networks, generative adversarial networks, transformers, and '
         'neural radiance fields. These architectures have been applied to '
         'fields including computer vision,

### 2.2. Generation - Use the retrieved embeddings to generate the answer.

In [None]:
def model_generate(prompt, model="gpt-4.1-nano"):
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

In [7]:
import json


def prompt_template(question, context):
    return """You are a AI Assistant that provides answer to the question at the end, over the following
  pieces of context. Make sure to only use the context to answer the question. Keep the wording very close to the context.
  Explicitly mention you DONT KNOW if the answer is not in the context. Answering questions when the answers are not in the context is NOT allowed.
  context:
  ```
  """ + json.dumps(context) + """
  ```
  User question: """ + question + """
  Answer in markdown:"""


In [8]:
def generate_answer(question):
    #Retrieval: search a knowledge base.
    search_result = vector_search(question)

    prompt = prompt_template(question, search_result)
    # Generation: LLMs' ability to generate the answer
    return model_generate(prompt)

In [9]:
question = f"Who introduced the time delay neural network (TDNN)? and when ?"
answer = generate_answer(question)
print("Answer:", answer)

Answer: ```markdown
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel.
```


## 3. RAG Evaluation with RAGAS

Before proceeding with improvements, let's establish baseline scores using **RAGAS** (Retrieval Augmented Generation Assessment Suite) - a specialized framework designed specifically for evaluating RAG systems.

### Context-Focused Metrics:

1. **Context Precision**: How well are relevant chunks ranked at the top?
2. **Context Recall**: How much of the necessary information was retrieved?
3. **Context Relevancy**: How relevant is the retrieved context to the question?

We're using **RAGAS** because it's purpose-built for RAG evaluation and provides deep insights into context quality - the most critical component of RAG performance. The evaluation is simple to use - just call one function!


In [10]:
# Import the RAGAS evaluation utility
# Import the RAGAS evaluation utility
import sys
import os
sys.path.append('../naive-rag')
from rag_evaluator_v2 import evaluate_naive_rag_v2

# Run evaluation on the current RAG system using RAGAS
print("🔍 Evaluating your Naive RAG system with RAGAS...")
print("This will evaluate context quality metrics on 15 questions...\n")

baseline_results = evaluate_naive_rag_v2(
    vector_search_func=vector_search,
    generate_answer_func=generate_answer
)


  from .autonotebook import tqdm as notebook_tqdm


🔍 Evaluating your Naive RAG system with RAGAS...
This will evaluate context quality metrics on 15 questions...

✅ Loaded 14 questions from evaluation dataset

Evaluating 14 questions...

Question 1/14: Who introduced the ReLU (rectified linear unit) ac...
Question 2/14: What was the first working deep learning algorithm...
Question 3/14: Which CNN achieved superhuman performance in a vis...
Question 4/14: When was BERT introduced and by which organization...
Question 5/14: What are the two model sizes BERT was originally i...
Question 6/14: What percentage of tokens are randomly selected fo...
Question 7/14: Who introduced the term 'deep learning' to the mac...
Question 8/14: Which three researchers were awarded the 2018 Turi...
Question 9/14: When was the first GPT introduced and by which org...
Question 10/14: What were the three parameter sizes of the first v...
Question 11/14: What is the 'one in ten rule' in regression analys...
Question 12/14: What is the essence of overfitting a

Evaluating: 100%|██████████| 14/14 [00:03<00:00,  4.21it/s]



RAGAS EVALUATION RESULTS

CONTEXT RECALL METRIC (0.0 - 1.0 scale):
  🟡 Context Recall: 0.643

🟡 GOOD: Your context retrieval is working well.


In [11]:
# Store baseline scores for comparison later
baseline_scores = baseline_results.get('aggregate_scores', {})

print("📊 BASELINE CONTEXT METRICS SUMMARY:")
if 'context_recall' in baseline_scores:
    print(f"Context Recall: {baseline_scores['context_recall']:.3f}")

print("\n💡 What these RAGAS metrics mean:")
print("• Context Recall: How much of the necessary information was retrieved")

print("\n🎯 Score Interpretation:")
print("• 0.8+ = Excellent")
print("• 0.6-0.8 = Good") 
print("• 0.4-0.6 = Needs Improvement")
print("• <0.4 = Poor")


📊 BASELINE CONTEXT METRICS SUMMARY:
Context Recall: 0.643

💡 What these RAGAS metrics mean:
• Context Recall: How much of the necessary information was retrieved

🎯 Score Interpretation:
• 0.8+ = Excellent
• 0.6-0.8 = Good
• 0.4-0.6 = Needs Improvement
• <0.4 = Poor


### 📋 Why We Need These Baseline Scores

These **RAGAS-powered** baseline scores are crucial because:

1. **Context Quality Focus**: RAGAS specifically measures how well your retrieval system finds and ranks relevant information
2. **Purpose-Built for RAG**: Unlike general evaluation tools, RAGAS is designed specifically for RAG systems
3. **Objective Measurement**: Quantitative metrics that measure actual retrieval performance
4. **Debugging Aid**: Low context scores immediately tell you where your RAG is failing
5. **Optimization Guide**: Use these metrics to systematically improve your retrieval strategy

🔬 **What makes RAGAS special**: 
- **Context Precision** helps ensure the most relevant information appears first
- **Context Recall** ensures you're not missing important information
- **Context Relevancy** validates that retrieved chunks actually help answer the question

**Next Steps**: Now that we have our baseline context metrics, let's improve our RAG system with Cohere's reranking!


## 4. Improving RAG with Cohere Reranking

Now let's see how adding a reranking step can improve our context selection and overall RAG performance.

### Why Reranking?

While vector similarity search is good at finding semantically related content, it has limitations:
- **Bi-encoder limitation**: Vector embeddings compress all information into a fixed-size representation
- **Lost nuances**: Subtle relevance signals can be lost in the embedding process
- **No query-document interaction**: Embeddings are created independently

Reranking solves these issues by:
- **Cross-encoder architecture**: Processes query and document together
- **Fine-grained relevance**: Captures subtle semantic relationships
- **Better precision**: Filters out less relevant results even if they have high vector similarity

### 4.1. Initialize Cohere Client

In [12]:
import cohere
import os

# Initialize Cohere client
cohere_client = cohere.Client(os.getenv("COHERE_API_KEY"))

print("✅ Cohere client initialized successfully!")

✅ Cohere client initialized successfully!


### 4.2. Create Reranking Function

In [13]:
def rerank_results(query, search_results, top_k=3, max_retries=5, initial_backoff=10):
    """
    Rerank search results using Cohere's rerank model with rate limit handling.
    
    Args:
        query: The user's question
        search_results: List of search results from vector search
        top_k: Number of top results to return after reranking
        max_retries: Maximum number of retry attempts for rate-limited requests
        initial_backoff: Initial backoff time in seconds (will increase exponentially)
    
    Returns:
        List of reranked results
    """
    import time
    
    # Extract texts from search results
    documents = [result.get('text', '') for result in search_results]
    
    # Implement retry with exponential backoff for rate limit handling
    retry_count = 0
    backoff_time = initial_backoff
    
    while retry_count <= max_retries:
        try:
            # Call Cohere rerank
            rerank_response = cohere_client.rerank(
                query=query,
                documents=documents,
                model='rerank-english-v3.0',  # Latest rerank model
                top_n=top_k
            )
            
            # Return reranked results maintaining original structure
            reranked_results = []
            for result in rerank_response.results:
                original_result = search_results[result.index].copy()
                original_result['rerank_score'] = result.relevance_score
                reranked_results.append(original_result)
            
            return reranked_results
            
        except Exception as e:
            # Check if it's a rate limit error (429)
            if hasattr(e, 'status_code') and e.status_code == 429:
                if retry_count < max_retries:
                    print(f"⚠️ Rate limit reached. Waiting for {backoff_time} seconds before retrying...")
                    time.sleep(backoff_time)
                    # Exponential backoff
                    backoff_time *= 2
                    retry_count += 1
                    continue
                else:
                    print(f"❌ Maximum retries ({max_retries}) reached. Falling back to vector search results.")
            else:
                print(f"❌ Error during reranking: {str(e)}")
            
            # Fallback: return the top_k results from the original search
            print("⚠️ Using original vector search results without reranking.")
            return search_results[:top_k]

### 4.3. Create Enhanced Answer Generation Function

In [14]:
def generate_answer_with_rerank(question, initial_top_k=10, final_top_k=1):
    """
    Generate answer using reranked results.
    
    Args:
        question: The user's question
        initial_top_k: Number of candidates to retrieve from vector search
        final_top_k: Number of results to keep after reranking
    
    Returns:
        Generated answer
    """
    # Step 1: Retrieval - get more candidates
    search_results = vector_search(question, top_k=initial_top_k)
    
    # Step 2: Reranking - select best results
    reranked_results = rerank_results(question, search_results, top_k=final_top_k)
    
    # Step 3: Generation - use reranked results
    prompt = prompt_template(question, reranked_results)
    return model_generate(prompt)

### 4.4. Compare Results: Naive RAG vs. Reranked RAG

In [15]:
# Let's compare results on a challenging question
# test_question = "What is the vanishing gradient problem and how does it affect deep neural networks?"
# test_question = "When was BERT introduced and by which organization?"
test_question = "In which year and paper was the modern version of the transformer proposed?"

print("=" * 80)
print("COMPARISON: Naive RAG vs. Reranked RAG")
print("=" * 80)
print(f"\nQuestion: {test_question}\n")

print("=== Without Reranking (Top 3 by vector similarity) ===")
search_results = vector_search(test_question, top_k=3)
for i, result in enumerate(search_results):
    print(f"\n{i+1}. {result['text'][:200]}...")

answer_naive = generate_answer(test_question)
print(f"\n📝 Answer (Naive RAG):\n{answer_naive}")

print("\n" + "=" * 80 + "\n")

print("=== With Cohere Reranking (Top 3 from 10 candidates) ===")
search_results_extended = vector_search(test_question, top_k=10)
reranked_results = rerank_results(test_question, search_results_extended, top_k=3)

for i, result in enumerate(reranked_results):
    print(f"\n{i+1}. [Rerank Score: {result['rerank_score']:.3f}] {result['text'][:200]}...")

answer_reranked = generate_answer_with_rerank(test_question)
print(f"\n📝 Answer (Reranked RAG):\n{answer_reranked}")

COMPARISON: Naive RAG vs. Reranked RAG

Question: In which year and paper was the modern version of the transformer proposed?

=== Without Reranking (Top 3 by vector similarity) ===

1. In deep learning, transformer is an architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a ...

2. In 1884, Sir Charles Parsons invented the steam turbine allowing for more efficient electric power generation. Alternating current, with its ability to transmit power more efficiently over long distan...

3. keep its text processing performance. This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence. Its pa...

📝 Answer (Naive RAG):
```markdown
The modern version of the transformer was proposed in the 2017 paper "Attention Is All You Need" by researchers at Google.
```


=== With Cohere Reranking (Top

### 4.5. Evaluate Improvement with RAGAS

In [16]:
# Run evaluation with reranking
print("🔍 Evaluating RAG system with Cohere Reranking...")
print("This will evaluate the improved system on the same 15 questions...\n")

reranked_results = evaluate_naive_rag_v2(
    vector_search_func=lambda q: vector_search(q, top_k=15),
    generate_answer_func=generate_answer_with_rerank
)

# Store reranked scores
reranked_scores = reranked_results.get('aggregate_scores', {})

🔍 Evaluating RAG system with Cohere Reranking...
This will evaluate the improved system on the same 15 questions...

✅ Loaded 14 questions from evaluation dataset

Evaluating 14 questions...

Question 1/14: Who introduced the ReLU (rectified linear unit) ac...
Question 2/14: What was the first working deep learning algorithm...
Question 3/14: Which CNN achieved superhuman performance in a vis...
Question 4/14: When was BERT introduced and by which organization...
Question 5/14: What are the two model sizes BERT was originally i...
Question 6/14: What percentage of tokens are randomly selected fo...
Question 7/14: Who introduced the term 'deep learning' to the mac...
Question 8/14: Which three researchers were awarded the 2018 Turi...
Question 9/14: When was the first GPT introduced and by which org...
⚠️ Rate limit reached. Waiting for 10 seconds before retrying...
⚠️ Rate limit reached. Waiting for 20 seconds before retrying...
⚠️ Rate limit reached. Waiting for 40 seconds before retr

Evaluating: 100%|██████████| 14/14 [00:39<00:00,  2.82s/it]



RAGAS EVALUATION RESULTS

CONTEXT RECALL METRIC (0.0 - 1.0 scale):
  🟢 Context Recall: 1.000

🟢 EXCELLENT: Your context retrieval is highly effective!


In [17]:
# Compare improvements
print("\n" + "=" * 60)
print("📊 IMPROVEMENT WITH RERANKING")
print("=" * 60)

for metric in ['context_recall']:
    if metric in baseline_scores and metric in reranked_scores:
        baseline = baseline_scores[metric]
        reranked = reranked_scores[metric]
        improvement = reranked - baseline
        improvement_pct = (improvement / baseline) * 100 if baseline > 0 else 0
        
        print(f"\n{metric.replace('_', ' ').title()}:")
        print(f"  Baseline: {baseline:.3f}")
        print(f"  With Reranking: {reranked:.3f}")
        print(f"  Improvement: {improvement:+.3f} ({improvement_pct:+.1f}%)")

print("\n" + "=" * 60)
print("\n🎉 Key Insights:")
print("• Reranking typically improves context precision significantly")
print("• Better context selection leads to more accurate answers")
print("• The cross-encoder architecture of rerankers captures nuanced relevance")
print("• This is especially valuable for complex or ambiguous queries")


📊 IMPROVEMENT WITH RERANKING

Context Recall:
  Baseline: 0.643
  With Reranking: 1.000
  Improvement: +0.357 (+55.6%)


🎉 Key Insights:
• Reranking typically improves context precision significantly
• Better context selection leads to more accurate answers
• The cross-encoder architecture of rerankers captures nuanced relevance
• This is especially valuable for complex or ambiguous queries


## 5. Summary and Next Steps

### What We've Learned

1. **Naive RAG Limitations**: While vector search is effective, it can miss nuanced relevance
2. **Reranking Benefits**: Cross-encoder models like Cohere's reranker significantly improve context selection
3. **Measurable Improvements**: RAGAS metrics clearly show the performance gains

### Architecture Comparison

**Naive RAG:**
```
Query → Embedding → Vector Search (Top 3) → Generate Answer
```

**Reranked RAG:**
```
Query → Embedding → Vector Search (Top 10) → Rerank (Top 3) → Generate Answer
```

### When to Use Reranking

✅ **Use reranking when:**
- Answer quality is critical
- You have complex, nuanced queries
- Your corpus contains similar but subtly different content
- You can afford the additional API call latency

❌ **Skip reranking when:**
- Speed is more important than accuracy
- Queries are simple and unambiguous
- Your corpus has clearly distinct topics

### Further Improvements

1. **Hybrid Search**: Combine vector search with keyword search
2. **Query Expansion**: Generate multiple query variations
3. **Document Expansion**: Add metadata and summaries to chunks
4. **Fine-tuning**: Train custom rerankers on your domain
5. **Caching**: Store reranked results for common queries

### Try It Yourself!

Experiment with:
- Different `initial_top_k` values (try 20, 50)
- Different `final_top_k` values (try 5, 7)
- Different reranking models
- Your own questions and see the improvement!