🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Building a Semantic Search RAG Pipeline (Naive RAG)

In this notebook, we'll build a semantic search pipeline using the Haystack framework. This implementation represents a "naive" RAG (Retrieval-Augmented Generation) approach, which follows these key steps:

1. **Query Processing**: Convert user questions into vector embeddings
2. **Document Retrieval**: Find relevant documents using semantic similarity
3. **Context Building**: Create a prompt combining the question and retrieved documents
4. **Answer Generation**: Use an LLM to generate answers based on the context

## What You'll Learn

- How to build a basic RAG pipeline using Haystack components
- Understanding semantic search with vector embeddings
- Techniques for connecting pipeline components
- Working with document stores and retrievers
- Using LLMs for answer generation

This represents the simplest form of RAG, providing a foundation for more advanced implementations.

# 1. Component Imports and Setup

We'll import several key components from the Haystack framework:

1. **`SentenceTransformersTextEmbedder`**
   - Converts text into vector embeddings
   - Uses the powerful sentence-transformers library
   - Essential for semantic understanding

2. **`InMemoryEmbeddingRetriever`**
   - Finds relevant documents using vector similarity
   - Works with our document store
   - Configurable for precision vs. recall trade-offs

3. **`PromptBuilder`**
   - Creates structured prompts for the LLM
   - Uses Jinja2 templates for flexible formatting
   - Combines context with user questions

4. **`OpenAIGenerator`**
   - Interfaces with OpenAI's LLM models
   - Generates natural language responses
   - Handles API communication securely

We'll also import Pipeline from Haystack, which lets us connect these components into a cohesive system.

In [1]:
# Continue from the previous script, assuming 'document_store' is populated.
from scripts.indexing import document_store  # Adjust the import as necessary

# Import necessary components for the query pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
from haystack import Pipeline


  from .autonotebook import tqdm as notebook_tqdm


Running unified indexing pipeline for web, local files, and CSV...


Error processing document 1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0. Keeping it, but skipping cleaning. Error: Error tokenizing data. C error: Expected 5 fields in line 5, saw 7

Error processing document 1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0. Keeping it, but skipping splitting. Error: Error tokenizing data. C error: Expected 5 fields in line 5, saw 7

Error processing document 1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0. Keeping it, but skipping splitting. Error: Error tokenizing data. C error: Expected 5 fields in line 5, saw 7

Batches: 100%|██████████| 5/5 [00:01<00:00,  2.96it/s]



# 2. Building the Naive RAG Pipeline

In this section, we'll initialize and connect the core components of our RAG pipeline. Each component plays a crucial role:

**Text Embedder Configuration**
- Uses the all-MiniLM-L6-v2 model
- Optimized for semantic similarity tasks
- Produces 384-dimensional embeddings

**Retriever Setup**
- Connected to our document store
- Returns top 3 most similar documents
- Uses cosine similarity for matching

**Prompt Engineering**
- Template includes clear instructions
- Handles multiple documents elegantly
- Includes fallback for missing information

**LLM Integration**
- Uses GPT-4 for high-quality responses
- Securely manages API keys
- Optimized for contextual understanding

The code below shows how these components are initialized and assembled into a working pipeline.

In [6]:

# --- 1. Initialize Query Pipeline Components ---

# Text Embedder: To embed the user's query. Must be compatible with the document embedder.
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Retriever: Fetches documents from the DocumentStore based on vector similarity.
retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=3)

# PromptBuilder: Creates a prompt using the retrieved documents and the query.
# The Jinja2 template iterates through the documents and adds their content to the prompt.
prompt_template_for_pipeline = """
Given the following information, answer the user's question.
If the information is not available in the provided documents, say that you don't have enough information to answer.

Context:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
prompt_builder_inst = PromptBuilder(template=prompt_template_for_pipeline,
                                    required_variables="*")
llm_generator_inst = OpenAIGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-4o-mini")


# --- 2. Build the Naive RAG Pipeline ---

naive_rag_pipeline = Pipeline()

# Add components to the pipeline
naive_rag_pipeline.add_component("text_embedder", text_embedder)
naive_rag_pipeline.add_component("retriever", retriever)
naive_rag_pipeline.add_component("prompt_builder", prompt_builder_inst)
naive_rag_pipeline.add_component("llm", llm_generator_inst)

# --- 3. Connect the Components ---

# The query embedding is sent to the retriever
naive_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
# The retriever's documents are sent to the prompt builder
naive_rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
# The final prompt is sent to the LLM
naive_rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")


<haystack.core.pipeline.pipeline.Pipeline object at 0x309f520c0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (list[float])
  - retriever.documents -> prompt_builder.documents (list[Document])
  - prompt_builder.prompt -> llm.prompt (str)

# 3. Visualizing the Pipeline

Understanding the flow of information through the pipeline is crucial. The visualization below shows:

**Data Flow Path:**
1. User question → Text Embedder
2. Embeddings → Retriever
3. Retrieved Documents → Prompt Builder
4. Final Prompt → LLM Generator

**Key Connections:**
- `text_embedder.embedding` → `retriever.query_embedding`
- `retriever.documents` → `prompt_builder.documents`
- `prompt_builder.prompt` → `llm.prompt`

Each arrow represents a data transformation step, showing how the question flows through the system to generate an answer.

In [2]:
# --- 4. Visualize the Pipeline ---
naive_rag_pipeline.draw(path="./images/naive_rag_pipeline.png")

![](./images/naive_rag_pipeline.png)

# 4. Running the Pipeline

Now we'll test our pipeline with real questions. When we run the pipeline:

1. The question is converted to embeddings
2. Similar documents are retrieved from our store
3. A prompt is constructed with the context
4. The LLM generates a natural language answer

**Input Requirements:**
- `text_embedder`: Needs the raw question text
- `prompt_builder`: Needs the question for template

**Expected Output:**
- Natural language answer based on retrieved documents
- "No information" response if context is insufficient

Try modifying the question to explore different types of queries!

In [3]:
# --- 5. Run the Pipeline ---

question = "Which company released the Claude 3 model family?"

# The run method requires inputs for the components that don't have an incoming connection.
# In this case, 'text_embedder' needs the 'text' (the query) and 'prompt_builder' needs the 'question'.
result = naive_rag_pipeline.run({
    "text_embedder": {"text": question},
    "prompt_builder": {"question": question}
})

print(f"\nQuestion: {question}")
print(f"Answer: {result['llm']['replies']}")

Batches: 100%|██████████| 1/1 [00:00<00:00,  5.32it/s]



Question: Which company released the Claude 3 model family?
Answer: ['The Claude 3 model family was released by Anthropic.']


# 5. Exploring Different Data Sources

Our pipeline can handle various types of questions across different data sources. Let's explore:

**Web Content Queries**
- Questions about Haystack framework
- Technical documentation queries
- Current technology trends

The example below demonstrates how the pipeline handles queries about technical documentation.

In [4]:
# Another example question using the web data
question_2 = "What is Haystack 2.0?"
result_2 = naive_rag_pipeline.run({
    "text_embedder": {"text": question_2},
    "prompt_builder": {"question": question_2}
})
print(f"\nQuestion: {question_2}")
print(f"Answer: {result_2['llm']['replies']}")

Batches: 100%|██████████| 1/1 [00:00<00:00, 69.67it/s]



Question: What is Haystack 2.0?
Answer: ['Haystack 2.0 is an open-source Python framework for building production-ready LLM (Large Language Model) applications. It allows for the implementation of composable AI systems that are easy to use, customize, extend, optimize, evaluate, and deploy to production. This version is a major rework of the previous version, with the goal of providing flexibility and a common component interface for seamless interaction between different components. Haystack 2.0 integrates with almost all major model providers and databases and enables users to create custom components and foster an open ecosystem around its framework.']


# 6. Querying Structured Data

The pipeline is equally capable of handling structured data from CSV files. This demonstrates:

**Advantages:**
- Unified interface for different data types
- Semantic understanding of tabular data
- Natural language queries for structured information

**Example Use Cases:**
- Release dates of AI models
- Technical specifications
- Historical data queries

The example below shows how to query specific information from our CSV dataset.

In [5]:
# Another example question using the csv data
question_2 = "When was Gemini released?"
result_2 = naive_rag_pipeline.run({
    "text_embedder": {"text": question_2},
    "prompt_builder": {"question": question_2}
})
print(f"\nQuestion: {question_2}")
print(f"Answer: {result_2['llm']['replies']}")

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.62it/s]



Question: When was Gemini released?
Answer: ['Gemini was released in 2023.']
