# Retrieval Augmented Generation (RAG)

In this notebook, we'll explore Retrieval Augmented Generation (RAG), a powerful technique that enhances Large Language Models (LLMs) by enabling them to access and use external knowledge beyond their training data.

We'll cover the following topics:
1. Introduction to Embeddings and Vector Representations
2. Using Google's Embeddings API
3. Building a Simple RAG System
4. Evaluating and Fine-tuning RAG Performance
5. Advanced RAG Techniques

Let's start by importing the necessary libraries and exploring the concept of embeddings, which are the foundation of effective RAG systems.

In [None]:
# Import required libraries
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from google import genai
import scipy.spatial.distance as distance

# Load environment variables from .env file (if you have any API keys)
load_dotenv()

# Configure the Google Generative AI API with your API key
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

## Helper Functions

Let's create some utility functions to make it easier to interact with the language model and embeddings throughout the notebook.

In [None]:
def get_gemini_response(prompt, model="gemini-2.0-flash"):
    """
    Get a response from Google's Gemini model
    
    Args:
        prompt (str): The prompt to send to the model
        model (str): The model to use (default: gemini-2.0-flash)
        
    Returns:
        str: The model's response
    """
    try:
        # Configure the model
        generation_config = {
            "temperature": 0.2,
            "top_p": 0.95,
            "top_k": 40,
            "max_output_tokens": 1024,
        }
        
        # Generate the response
        response = client.models.generate_content(
            contents=prompt,
            model=model,
            config=generation_config,
        )
        
        # Return the response text
        return response.text
    except Exception as e:
        return f"Error: {str(e)}"

def get_embedding(text, model="text-embedding-004", task_type="RETRIEVAL_DOCUMENT", title=None):
    """
    Get embedding vector for text using Google's Embedding API
    
    Args:
        text (str): The text to embed
        model (str): The model to use
        task_type (str): The task type for the embedding
        title (str, optional): Optional title for the document
        
    Returns:
        list: The embedding vector
    """
    try:
        from google.genai import types
        config = types.EmbedContentConfig(
            task_type=task_type,
            title=title
        )
        
        result = client.models.embed_content(
            model=model,
            contents=text,
            config=config
        )
        
        return result.embeddings[0].values if result.embeddings else None
    except Exception as e:
        print(f"Error getting embedding: {str(e)}")
        return None

def cosine_similarity(v1, v2):
    """
    Calculate cosine similarity between two vectors
    
    Args:
        v1 (list): First vector
        v2 (list): Second vector
        
    Returns:
        float: Cosine similarity (-1 to 1)
    """
    return 1 - distance.cosine(v1, v2)

def display_prompt_response(prompt, response, technique=""):
    """
    Display the prompt and response in a formatted way
    
    Args:
        prompt (str): The prompt sent to the model
        response (str): The model's response
        technique (str): The technique used
    """
    print(f"{'='*80}")
    print(f"TECHNIQUE: {technique}")
    print(f"{'='*80}")
    print("\nüìù PROMPT:")
    print(f"{'-'*80}")
    print(prompt)
    print(f"{'-'*80}")
    print("\nü§ñ RESPONSE:")
    print(f"{'-'*80}")
    print(response)
    print(f"{'-'*80}\n")

## 1. Introduction to Embeddings and Vector Representations

Embeddings are numerical representations of data (like text, images, or audio) in a high-dimensional vector space. They capture semantic meaning in a way that similar concepts are positioned closer to each other in this space.

### How Embeddings Work

When we convert words, sentences, or documents into embeddings, we're essentially mapping them to points in a multi-dimensional space. The position of these points captures the semantic relationships between the words or concepts they represent.

Let's visualize this with a simple example comparing words like "king", "queen", and "baseball" to see how embeddings capture semantic relationships.

In [None]:
# Let's compare word embeddings to see how they capture semantic relationships
words = ["king", "queen", "man", "woman", "baseball", "sports", "computer", "technology"]

# Get embeddings for these words
word_embeddings = {}
for word in words:
    embedding = get_embedding(word, task_type="SEMANTIC_SIMILARITY")
    word_embeddings[word] = embedding

# Calculate similarities between pairs of words
print("Word Similarity Comparisons (Cosine Similarity):")
print("-" * 50)

comparisons = [
    ("king", "queen"),
    ("king", "baseball"),
    ("man", "woman"),
    ("queen", "woman"),
    ("baseball", "sports"),
    ("computer", "technology"),
    ("king", "man"),
    ("queen", "technology")
]

for word1, word2 in comparisons:
    similarity = cosine_similarity(word_embeddings[word1], word_embeddings[word2])
    print(f"{word1} vs {word2}: {similarity:.4f}")

# Let's visualize the relationships on a 2D plot
# We'll use PCA to reduce dimensions for visualization
from sklearn.decomposition import PCA

# Extract embeddings and corresponding words
word_list = list(word_embeddings.keys())
embedding_list = [word_embeddings[word] for word in word_list]

# Reduce dimensions for visualization
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embedding_list)

# Plot the words in 2D space
plt.figure(figsize=(10, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.7)

# Label each point with its word
for i, word in enumerate(word_list):
    plt.annotate(word, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]), 
                 fontsize=12, alpha=0.8)

plt.title("Word Embeddings Visualized in 2D Space", fontsize=14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(alpha=0.3)
plt.show()

## 2. Using Google's Embeddings API

Google provides powerful embedding models through its Generative AI API. The `text-embedding-004` model can generate high-quality embeddings for various use cases including:

- Semantic search
- Document retrieval
- Question answering
- Recommendation systems
- Text classification

The API allows us to create embeddings optimized for different tasks using the `task_type` parameter:
- `RETRIEVAL_QUERY`: For search queries
- `RETRIEVAL_DOCUMENT`: For documents to be searched
- `SEMANTIC_SIMILARITY`: For semantic textual similarity tasks
- `CLASSIFICATION`: For text classification tasks
- `CLUSTERING`: For clustering tasks

Let's explore how to use these embeddings for different purposes:

In [None]:
# Example 1: Creating embeddings for different texts
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast auburn canine leaps above the idle hound.",
    "Machine learning models require large amounts of training data.",
    "Artificial intelligence systems need extensive datasets for training."
]

# Get embeddings for these texts
embeddings = []
for text in texts:
    embedding = get_embedding(text, task_type="SEMANTIC_SIMILARITY")
    embeddings.append(embedding)

# Calculate similarity matrix
similarity_matrix = np.zeros((len(texts), len(texts)))
for i in range(len(texts)):
    for j in range(len(texts)):
        similarity_matrix[i, j] = cosine_similarity(embeddings[i], embeddings[j])

# Display the similarity matrix
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='viridis')
plt.colorbar(label='Cosine Similarity')
plt.title("Semantic Text Similarity Matrix")
plt.xticks(np.arange(len(texts)), [f"Text {i+1}" for i in range(len(texts))], rotation=45)
plt.yticks(np.arange(len(texts)), [f"Text {i+1}" for i in range(len(texts))])

# Add text annotations
for i in range(len(texts)):
    for j in range(len(texts)):
        plt.text(j, i, f"{similarity_matrix[i, j]:.2f}", 
                 ha="center", va="center", color="white" if similarity_matrix[i, j] < 0.7 else "black")

plt.tight_layout()
plt.show()

# Print the actual texts for reference
for i, text in enumerate(texts):
    print(f"Text {i+1}: {text}")

# Example 2: Task-specific embeddings
query = "How does artificial intelligence work?"
document1 = "Artificial intelligence works by training machine learning models on large datasets to recognize patterns and make predictions."
document2 = "The history of artificial intelligence dates back to the 1950s when early computer scientists began exploring the concept."

# Get task-specific embeddings
query_embedding = get_embedding(query, task_type="RETRIEVAL_QUERY")
doc1_embedding = get_embedding(document1, task_type="RETRIEVAL_DOCUMENT", title="AI Functionality")
doc2_embedding = get_embedding(document2, task_type="RETRIEVAL_DOCUMENT", title="AI History")

# Calculate relevance scores
relevance1 = cosine_similarity(query_embedding, doc1_embedding)
relevance2 = cosine_similarity(query_embedding, doc2_embedding)

print("\nQuery relevance scores:")
print(f"Document 1 (about AI functionality): {relevance1:.4f}")
print(f"Document 2 (about AI history): {relevance2:.4f}")

## 3. Building a Simple RAG System

Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models with the ability to access external knowledge. RAG systems typically follow these steps:

1. **Document Chunking**: Break documents into manageable chunks
2. **Embedding Generation**: Convert document chunks into embeddings
3. **Storage**: Store the embeddings and their corresponding text chunks
4. **Retrieval**: When a query arrives, convert it to an embedding and find relevant documents
5. **Augmented Generation**: Feed the retrieved information to the LLM along with the query to generate a response

Let's build a simple RAG system from scratch using Google's Gemini API:

In [None]:
# Step 1: Create a document collection
documents = [
    {
        "title": "What is RAG?",
        "content": """
        Retrieval Augmented Generation (RAG) is a technique that enhances large language models (LLMs) 
        by providing them with relevant information retrieved from an external knowledge base. 
        This allows the LLM to generate more accurate, up-to-date, and contextually relevant responses.
        
        RAG systems combine the broad knowledge of LLMs with specialized, current, or proprietary information, 
        making them powerful tools for knowledge-intensive tasks. They are particularly useful when dealing with 
        domain-specific questions, recent events, or proprietary data that wasn't part of the LLM's training.
        """
    },
    {
        "title": "How RAG Works",
        "content": """
        RAG systems typically work in two phases: retrieval and generation.
        
        In the retrieval phase, when a user asks a question, the system:
        1. Converts the query into an embedding (a numerical representation)
        2. Searches a knowledge base for relevant documents by comparing embeddings
        3. Retrieves the most semantically similar documents
        
        In the generation phase, the system:
        1. Combines the original query with the retrieved information
        2. Sends this augmented context to the LLM
        3. The LLM generates a response based on both its pre-trained knowledge and the retrieved information
        """
    },
    {
        "title": "Benefits of RAG",
        "content": """
        RAG offers several advantages over using vanilla LLMs:
        
        1. Reduced hallucinations: By grounding responses in retrieved facts, RAG reduces the tendency of LLMs to generate plausible-sounding but incorrect information.
        
        2. Access to current information: RAG can incorporate recent information that wasn't available during the LLM's training.
        
        3. Domain specialization: Organizations can add specific domain knowledge without fine-tuning the entire model.
        
        4. Transparency and attribution: RAG systems can cite the sources of information used to generate responses.
        
        5. Privacy and data control: Sensitive information can be kept in private knowledge bases rather than being included in model training.
        """
    },
    {
        "title": "Challenges in RAG Implementation",
        "content": """
        Implementing effective RAG systems comes with several challenges:
        
        1. Information retrieval quality: The system's effectiveness depends greatly on retrieving the right information. Poor retrieval leads to irrelevant context and potentially incorrect answers.
        
        2. Context window limitations: LLMs have finite context windows, limiting how much retrieved information can be included.
        
        3. Balancing relevance: Determining how much weight to give to retrieved information versus the model's pre-trained knowledge can be tricky.
        
        4. Knowledge base management: Creating, updating, and maintaining the knowledge base requires ongoing effort.
        
        5. Computational overhead: RAG systems involve additional processing steps compared to simple LLM inference.
        """
    },
    {
        "title": "Advanced RAG Techniques",
        "content": """
        Recent advancements in RAG include:
        
        1. Hybrid search: Combining keyword-based and semantic search for better retrieval.
        
        2. Re-ranking: Using a separate model to re-rank initially retrieved documents for better relevance.
        
        3. Query rewriting: Transforming the user's query to better match relevant documents.
        
        4. Multi-vector retrieval: Representing documents with multiple embeddings to capture different aspects.
        
        5. Self-RAG: Systems that can decide when to retrieve information and perform retrieval iteratively.
        
        6. Adaptive retrieval: Dynamically adjusting how much and what kind of information to retrieve based on the query.
        """
    }
]

# Step 2: Define a function to chunk documents
def chunk_document(doc, chunk_size=250, overlap=50):
    """
    Split a document into smaller chunks with overlap
    
    Args:
        doc (dict): Document with title and content
        chunk_size (int): Approximate size of each chunk in characters
        overlap (int): Overlap between chunks in characters
        
    Returns:
        list: List of document chunks with title and content
    """
    content = doc["content"]
    title = doc["title"]
    
    # Simple chunking by characters (in a real application, you'd chunk by sentences or paragraphs)
    chunks = []
    start = 0
    
    while start < len(content):
        end = start + chunk_size
        
        # Adjust end to not cut words
        if end < len(content):
            # Try to find a space to break at
            space_pos = content.rfind(' ', start, end)
            if space_pos != -1:
                end = space_pos
        
        # Create the chunk
        chunk_content = content[start:end].strip()
        if chunk_content:
            chunks.append({
                "title": title,
                "content": chunk_content
            })
        
        # Move start position for next chunk, including overlap
        start = end - overlap if end - overlap > start else end
    
    return chunks

# Step 3: Chunk all documents
all_chunks = []
for doc in documents:
    chunks = chunk_document(doc)
    all_chunks.extend(chunks)

# Preview the chunks
print(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
print("\nFirst chunk example:")
print(f"Title: {all_chunks[0]['title']}")
print(f"Content: {all_chunks[0]['content'][:150]}...")

In [None]:
# Step 4: Create embeddings for the chunks
chunk_embeddings = []

for i, chunk in enumerate(all_chunks):
    print(f"Processing chunk {i+1}/{len(all_chunks)}...", end="\r")
    
    # Create a combined text with title for better context
    text_to_embed = f"{chunk['title']}: {chunk['content']}"
    
    # Get the embedding
    embedding = get_embedding(
        text_to_embed, 
        task_type="RETRIEVAL_DOCUMENT", 
        title=chunk['title']
    )
    
    # Store the chunk and its embedding
    if embedding:
        chunk_embeddings.append({
            "chunk": chunk,
            "embedding": embedding
        })

print(f"\nCreated {len(chunk_embeddings)} embeddings successfully")

# Step 5: Create a simple vector store class
class SimpleVectorStore:
    def __init__(self, items=None):
        """Initialize the vector store"""
        self.items = items or []
    
    def add_item(self, item):
        """Add a single item to the store"""
        self.items.append(item)
    
    def add_items(self, items):
        """Add multiple items to the store"""
        self.items.extend(items)
    
    def search(self, query_embedding, top_k=3):
        """Search for similar items based on embedding similarity"""
        if not self.items:
            return []
        
        # Calculate similarities
        similarities = [
            (item, cosine_similarity(query_embedding, item["embedding"]))
            for item in self.items
        ]
        
        # Sort by similarity (highest first)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        return similarities[:top_k]

# Create our vector store
vector_store = SimpleVectorStore()
vector_store.add_items(chunk_embeddings)

print(f"Vector store created with {len(vector_store.items)} items")

In [None]:
# Step 6: Create a RAG query function
def rag_query(user_query, vector_store, top_k=3):
    """
    Perform a RAG query
    
    Args:
        user_query (str): User's question
        vector_store: Vector store containing document chunks and embeddings
        top_k (int): Number of documents to retrieve
        
    Returns:
        str: Generated response
    """
    # 1. Convert the query to an embedding
    query_embedding = get_embedding(user_query, task_type="RETRIEVAL_QUERY")
    
    if not query_embedding:
        return "Error: Could not generate embedding for query."
    
    # 2. Retrieve relevant documents
    search_results = vector_store.search(query_embedding, top_k=top_k)
    
    # 3. Format the retrieved context
    context_chunks = []
    for item, score in search_results:
        context_chunks.append(f"Title: {item['chunk']['title']}\nContent: {item['chunk']['content']}\nRelevance: {score:.4f}")
    
    context_text = "\n\n".join(context_chunks)
    
    # 4. Create the augmented prompt
    prompt = f"""
    Answer the following question based on the provided context. If the context doesn't contain relevant information, 
    say that you don't have enough information to answer accurately.
    
    Context:
    {context_text}
    
    Question: {user_query}
    
    Answer:
    """
    
    # 5. Generate response using the LLM
    response = get_gemini_response(prompt)
    
    return response

# Let's test our RAG system with some queries
test_queries = [
    "What is Retrieval Augmented Generation?",
    "What are the main benefits of using RAG?",
    "How can I improve the retrieval quality in a RAG system?",
    "What is quantum computing?" # A query our knowledge base doesn't cover
]

for i, query in enumerate(test_queries):
    print(f"\nQuery {i+1}: {query}")
    print("-" * 80)
    
    response = rag_query(query, vector_store, top_k=2)
    print(response)
    print("=" * 80)

## 4. Evaluating and Fine-tuning RAG Performance

For RAG systems to be effective, both the retrieval and generation components need to work well together. Let's look at some ways to evaluate and improve RAG performance:

### Key Metrics for Evaluation

1. **Retrieval Performance**:
   - Precision: Are the retrieved documents relevant?
   - Recall: Did we retrieve all the relevant documents?
   - Mean Reciprocal Rank (MRR): How high were relevant documents ranked?

2. **Generation Quality**:
   - Factual accuracy: Does the answer contain correct information?
   - Relevance: Does the answer address the question?
   - Coherence: Is the answer well-structured and understandable?
   - Citation accuracy: Does the model properly use the provided context?

Let's implement some simple evaluation techniques for our RAG system:

In [None]:
# Let's create a simple evaluation framework for our RAG system

# Define some test cases with ground truth
evaluation_cases = [
    {
        "query": "What is RAG and how does it work?",
        "relevant_docs": ["What is RAG?", "How RAG Works"],  # Titles of relevant documents
        "key_points": [
            "combines retrieval and generation",
            "enhances LLMs with external knowledge",
            "two phases: retrieval and generation"
        ]
    },
    {
        "query": "What are the main benefits of using RAG systems?",
        "relevant_docs": ["Benefits of RAG"],
        "key_points": [
            "reduced hallucinations",
            "access to current information",
            "domain specialization",
            "transparency",
            "privacy"
        ]
    },
    {
        "query": "What challenges might I face when implementing RAG?",
        "relevant_docs": ["Challenges in RAG Implementation"],
        "key_points": [
            "information retrieval quality",
            "context window limitations",
            "knowledge base management",
            "computational overhead"
        ]
    }
]

# Function to evaluate retrieval performance
def evaluate_retrieval(query, relevant_doc_titles, vector_store, top_k=3):
    """
    Evaluate the retrieval component of RAG
    
    Args:
        query (str): The query
        relevant_doc_titles (list): Titles of relevant documents
        vector_store: The vector store
        top_k (int): Number of documents to retrieve
        
    Returns:
        dict: Evaluation metrics
    """
    # Convert the query to an embedding
    query_embedding = get_embedding(query, task_type="RETRIEVAL_QUERY")
    
    if not query_embedding:
        return {"error": "Could not generate embedding"}
    
    # Retrieve documents
    results = vector_store.search(query_embedding, top_k=top_k)
    retrieved_titles = [item["chunk"]["title"] for item, _ in results]
    
    # Calculate metrics
    relevant_retrieved = [title for title in retrieved_titles if title in relevant_doc_titles]
    
    precision = len(relevant_retrieved) / len(retrieved_titles) if retrieved_titles else 0
    recall = len(relevant_retrieved) / len(relevant_doc_titles) if relevant_doc_titles else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Calculate MRR (Mean Reciprocal Rank)
    mrr = 0
    for i, title in enumerate(retrieved_titles):
        if title in relevant_doc_titles:
            mrr = 1 / (i + 1)  # Rank is 1-indexed
            break
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mrr": mrr,
        "retrieved_titles": retrieved_titles
    }

# Function to evaluate the generated answer
def evaluate_generation(query, key_points, response):
    """
    Evaluate the generation component of RAG
    
    Args:
        query (str): The query
        key_points (list): Key points that should be in the answer
        response (str): The generated response
        
    Returns:
        dict: Evaluation metrics
    """
    # Count how many key points are mentioned in the response
    mentioned_points = 0
    for point in key_points:
        if point.lower() in response.lower():
            mentioned_points += 1
    
    coverage = mentioned_points / len(key_points) if key_points else 0
    
    return {
        "key_point_coverage": coverage,
        "key_points_mentioned": mentioned_points,
        "total_key_points": len(key_points)
    }

# Run evaluation on our test cases
print("Evaluating RAG performance...")
for case in evaluation_cases:
    print(f"\nQuery: {case['query']}")
    print("-" * 60)
    
    # Get RAG response
    response = rag_query(case["query"], vector_store, top_k=3)
    
    # Evaluate retrieval
    retrieval_metrics = evaluate_retrieval(
        case["query"], case["relevant_docs"], vector_store, top_k=3
    )
    
    # Evaluate generation
    generation_metrics = evaluate_generation(
        case["query"], case["key_points"], response
    )
    
    # Print results
    print("Retrieval Metrics:")
    print(f"  Precision: {retrieval_metrics['precision']:.2f}")
    print(f"  Recall: {retrieval_metrics['recall']:.2f}")
    print(f"  F1 Score: {retrieval_metrics['f1']:.2f}")
    print(f"  MRR: {retrieval_metrics['mrr']:.2f}")
    print(f"  Retrieved: {retrieval_metrics['retrieved_titles']}")
    
    print("\nGeneration Metrics:")
    print(f"  Key Point Coverage: {generation_metrics['key_point_coverage']:.2f}")
    print(f"  Key Points Mentioned: {generation_metrics['key_points_mentioned']}/{generation_metrics['total_key_points']}")
    
    print("\nResponse:")
    print(response[:300] + "..." if len(response) > 300 else response)
    print("=" * 80)

### Strategies for Improving RAG Performance

Based on evaluation results, we can improve our RAG system in several ways:

1. **Retrieval Improvements**:
   - **Better chunking strategies**: Use semantic chunking instead of fixed-size chunking
   - **Query reformulation**: Expand queries to better match relevant documents
   - **Hybrid search**: Combine semantic search with keyword-based search
   - **Re-ranking**: Use a second model to re-rank initial retrieval results

2. **Generation Improvements**:
   - **Prompt engineering**: Refine the prompt template for better use of retrieved context
   - **Citation instruction**: Explicitly instruct the model to cite sources from the context
   - **Filtering hallucinations**: Detect and remove generated content not supported by the context
   
Let's implement some of these improvements to our basic RAG system:

In [None]:
# Let's implement some improvements to our RAG system

# 1. Improvement: Query Reformulation
def reformulate_query(original_query):
    """
    Use the LLM to reformulate the query for better retrieval
    
    Args:
        original_query (str): The original user query
        
    Returns:
        str: The reformulated query
    """
    prompt = f"""
    Your task is to reformulate the given query to improve information retrieval.
    Make the query more specific and include key terms that might appear in relevant documents.
    Keep the reformulation focused and concise.
    
    Original Query: {original_query}
    
    Reformulated Query:
    """
    
    response = get_gemini_response(prompt)
    
    # Clean up response to get just the reformulated query
    # (In a production system, you'd want more robust parsing)
    return response.strip()

# 2. Improvement: Enhanced RAG with Query Reformulation and Better Prompting
def enhanced_rag_query(user_query, vector_store, top_k=3, use_query_reformulation=True):
    """
    Enhanced RAG query with improvements
    
    Args:
        user_query (str): User's question
        vector_store: Vector store containing document chunks and embeddings
        top_k (int): Number of documents to retrieve
        use_query_reformulation (bool): Whether to use query reformulation
        
    Returns:
        dict: Results including response and metadata
    """
    # 1. Reformulate the query if enabled
    if use_query_reformulation:
        retrieval_query = reformulate_query(user_query)
        print(f"Original query: {user_query}")
        print(f"Reformulated query: {retrieval_query}")
    else:
        retrieval_query = user_query
    
    # 2. Convert the query to an embedding
    query_embedding = get_embedding(retrieval_query, task_type="RETRIEVAL_QUERY")
    
    if not query_embedding:
        return {"response": "Error: Could not generate embedding for query."}
    
    # 3. Retrieve relevant documents
    search_results = vector_store.search(query_embedding, top_k=top_k)
    
    # 4. Format the retrieved context with clear source markers
    context_chunks = []
    for i, (item, score) in enumerate(search_results):
        source_id = f"[Source {i+1}]"
        context_chunks.append(f"{source_id} Title: {item['chunk']['title']}\nContent: {item['chunk']['content']}")
    
    context_text = "\n\n".join(context_chunks)
    
    # 5. Create an improved prompt with clear instructions
    improved_prompt = f"""
    Answer the user's question based ONLY on the provided context. 
    
    If the context doesn't contain enough information to provide a complete answer, acknowledge what you can answer 
    and clearly state what information is missing.
    
    IMPORTANT: 
    - Cite your sources using the [Source X] notation.
    - Don't include information that isn't supported by the provided context.
    - Be concise and focus on directly addressing the question.
    
    Context:
    {context_text}
    
    User question: {user_query}
    
    Answer:
    """
    
    # 6. Generate response using the LLM
    response = get_gemini_response(improved_prompt)
    
    # Return both the response and metadata for analysis
    return {
        "response": response,
        "metadata": {
            "original_query": user_query,
            "retrieval_query": retrieval_query,
            "retrieved_chunks": [item["chunk"]["title"] for item, _ in search_results],
            "similarity_scores": [score for _, score in search_results]
        }
    }

# Let's test our enhanced RAG system
test_queries = [
    "How do RAG systems work?",
    "What problems can RAG help solve?",
    "What are the most recent developments in RAG technology?"
]

for query in test_queries:
    print(f"\nTesting enhanced RAG with query: {query}")
    print("-" * 80)
    
    # Get response using the enhanced RAG
    result = enhanced_rag_query(query, vector_store, top_k=2)
    
    # Print metadata
    print("\nRetrieval Metadata:")
    for chunk_title, score in zip(result["metadata"]["retrieved_chunks"], result["metadata"]["similarity_scores"]):
        print(f"  - {chunk_title} (Score: {score:.4f})")
    
    # Print response
    print("\nEnhanced RAG Response:")
    print(result["response"])
    print("=" * 80)

## 5. Advanced RAG Techniques

Beyond the basic RAG implementation we've explored, there are several advanced techniques that can significantly improve performance:

### Multi-step RAG

Multi-step RAG breaks the retrieval process into multiple stages:
1. **Initial Query**: Start with the user's original query
2. **Retrieval & Reading**: Retrieve documents and have the LLM read them
3. **Query Refinement**: Generate a better query based on the initial findings
4. **Final Retrieval**: Retrieve more accurate documents with the refined query
5. **Response Generation**: Generate a comprehensive answer

### Hypothetical Document Embeddings (HyDE)

HyDE improves retrieval by:
1. Having the LLM generate a hypothetical document that would answer the query
2. Using this hypothetical document as the retrieval query instead of the original query
3. This often leads to better semantic matching with relevant documents

### Multi-Vector Retrieval

Instead of representing a document with a single embedding vector:
1. Split documents into multiple chunks
2. Create an embedding for each chunk
3. Retrieve based on chunk-level similarity
4. This helps with long documents where different sections cover different topics

Let's implement a simple version of multi-step RAG:

In [None]:
# Let's implement Multi-step RAG as an example of an advanced technique

def multi_step_rag_query(user_query, vector_store, top_k_initial=3, top_k_final=2):
    """
    Multi-step RAG query with query refinement
    
    Args:
        user_query (str): User's question
        vector_store: Vector store containing document chunks and embeddings
        top_k_initial (int): Number of documents for initial retrieval
        top_k_final (int): Number of documents for final retrieval
        
    Returns:
        dict: Results including response and metadata
    """
    print(f"Original query: {user_query}")
    
    # Step 1: Initial retrieval with original query
    query_embedding = get_embedding(user_query, task_type="RETRIEVAL_QUERY")
    initial_results = vector_store.search(query_embedding, top_k=top_k_initial)
    
    # Step 2: Format the initial context
    initial_context = []
    for item, score in initial_results:
        initial_context.append(f"Title: {item['chunk']['title']}\nContent: {item['chunk']['content']}")
    
    initial_context_text = "\n\n".join(initial_context)
    
    # Step 3: Have the LLM analyze the initial results and refine the query
    refine_prompt = f"""
    Based on the user's question and the initial search results, create a better search query.
    The improved query should help retrieve more relevant information by including key terms found in the initial results.
    
    User's question: {user_query}
    
    Initial search results:
    {initial_context_text}
    
    Provide ONLY the improved search query without any explanation or additional text:
    """
    
    improved_query = get_gemini_response(refine_prompt)
    print(f"Improved query: {improved_query}")
    
    # Step 4: Second retrieval with improved query
    improved_embedding = get_embedding(improved_query, task_type="RETRIEVAL_QUERY")
    final_results = vector_store.search(improved_embedding, top_k=top_k_final)
    
    # Step 5: Format the final context
    final_context = []
    for i, (item, score) in enumerate(final_results):
        source_id = f"[Source {i+1}]"
        final_context.append(f"{source_id} Title: {item['chunk']['title']}\nContent: {item['chunk']['content']}")
    
    final_context_text = "\n\n".join(final_context)
    
    # Step 6: Generate the final answer
    final_prompt = f"""
    Answer the user's question based ONLY on the provided context. 
    
    If the context doesn't contain enough information to provide a complete answer, acknowledge what you can answer 
    and clearly state what information is missing.
    
    IMPORTANT: 
    - Cite your sources using the [Source X] notation.
    - Don't include information that isn't supported by the provided context.
    - Be concise and focus on directly addressing the question.
    
    Context:
    {final_context_text}
    
    User question: {user_query}
    
    Answer:
    """
    
    final_response = get_gemini_response(final_prompt)
    
    # Return results and metadata
    return {
        "response": final_response,
        "metadata": {
            "original_query": user_query,
            "improved_query": improved_query,
            "initial_results": [item["chunk"]["title"] for item, _ in initial_results],
            "final_results": [item["chunk"]["title"] for item, _ in final_results]
        }
    }

# Test the multi-step RAG
complex_query = "What techniques can help overcome the challenges of implementing effective RAG systems?"

print("\nTesting Multi-step RAG:")
print("-" * 80)

multi_step_result = multi_step_rag_query(complex_query, vector_store)

# Print metadata
print("\nMulti-step RAG process:")
print(f"1. Initial retrieval with: '{multi_step_result['metadata']['original_query']}'")
print(f"   Retrieved: {', '.join(multi_step_result['metadata']['initial_results'])}")
print(f"2. Query refinement: '{multi_step_result['metadata']['improved_query']}'")
print(f"3. Final retrieval: {', '.join(multi_step_result['metadata']['final_results'])}")

# Print response
print("\nFinal Response:")
print(multi_step_result["response"])
print("=" * 80)

## Conclusion

In this notebook, we've explored Retrieval Augmented Generation (RAG) from the ground up:

1. **Embeddings Fundamentals**: We learned how vector embeddings capture semantic relationships between words and documents, making them powerful tools for information retrieval.

2. **Google's Embeddings API**: We utilized Google's text-embedding-004 model to create high-quality embeddings optimized for different tasks.

3. **Basic RAG System**: We built a simple but functional RAG system that can retrieve relevant information and use it to generate accurate responses.

4. **Evaluation and Fine-tuning**: We explored methods to evaluate RAG performance and implemented improvements like query reformulation and better prompting.

5. **Advanced Techniques**: We implemented a multi-step RAG approach to handle more complex queries and improve retrieval accuracy.

RAG represents one of the most practical ways to enhance LLMs for real-world applications. By grounding LLM outputs in retrieved facts, we can reduce hallucinations, provide access to specialized knowledge, and create more trustworthy AI systems.

As you continue to work with RAG, remember that both the retrieval and generation components require ongoing refinement. The best RAG systems are those that are continuously evaluated and improved based on real-world usage and feedback.

## Additional Resources and References

Here are some valuable resources for further exploration of RAG:

### Research Papers
- Lewis, P., et al. (2020). "[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)"
- Guu, K., et al. (2020). "[REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909)"
- Borgeaud, S., et al. (2022). "[Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426)"

### Guides and Tutorials
- [Google's Embedding API Documentation](https://ai.google.dev/api/embeddings)
- [LangChain RAG Documentation](https://python.langchain.com/docs/tutorials/rag/)
- [LlamaIndex RAG Guide](https://docs.llamaindex.ai/en/stable/use_cases/query_engine/)

### Community Resources
- [Hugging Face Embeddings Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
- [Awesome RAG GitHub Repository](https://github.com/Danielskry/Awesome-RAG)

As RAG continues to evolve, staying up-to-date with the latest techniques will help you build increasingly powerful and accurate information retrieval and generation systems.