# Building a Document Indexing Pipeline with Haystack

Welcome to this tutorial notebook on building a document indexing pipeline using Haystack! In this notebook, you'll learn how to create a robust pipeline for processing and indexing documents that can be later used for question answering and information retrieval.

## Learning Objectives
By the end of this notebook, you will:
- Understand how to set up a Haystack document processing pipeline
- Learn to work with different document formats (TXT, CSV, PDF)
- Create and configure an in-memory document store
- Implement document preprocessing and indexing
- Test your indexed documents with basic queries

## Prerequisites
- Basic Python knowledge
- Understanding of basic NLP concepts
- Familiarity with Jupyter notebooks

Let's get started!

In [12]:
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(".env")

True

## Setup and Package Installation

In this section, we'll import the necessary packages for our indexing pipeline. Here's what each package does:

- `haystack`: The main framework we'll use for building our document processing pipeline
- `haystack.document_stores`: Contains different backends for storing our processed documents
- `haystack.nodes`: Pipeline components for processing and transforming documents
- `haystack.pipelines`: Tools for connecting different processing nodes together
- `os`: For handling file paths and environment variables
- `logging`: To get helpful feedback about what's happening in our pipeline

In [13]:
# Import core Haystack classes
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.joiners import DocumentJoiner

# Import components for data fetching and conversion
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import (
    PyPDFToDocument,
    TextFileToDocument,
    HTMLToDocument,
)
from haystack.components.routers import FileTypeRouter

# Import components for preprocessing
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter,
    CSVDocumentCleaner,
    CSVDocumentSplitter
)

# Import components for embedding
from haystack.components.embedders import SentenceTransformersDocumentEmbedder


## Document Loading and Preprocessing

In this section, we'll set up our document loading pipeline. We'll work with multiple document formats:

1. Text files (.txt)
2. CSV files
3. PDF documents

The data we'll be working with includes:
- A text file about Haystack's introduction
- A CSV file containing information about LLM models
- A sample PDF document

We'll create a preprocessing pipeline that will:
1. Load these documents from different sources
2. Convert them into a unified format
3. Clean and prepare them for indexing


In [14]:
# --- 1. Create Sample Data Files ---
# Create a directory to hold our source files
data_dir = Path("data_for_indexing")
data_dir.mkdir(exist_ok=True)

# Create a sample text file
text_file_path = data_dir / "haystack_intro.txt"
text_file_path.write_text(
    "Haystack is an open-source framework by deepset for building production-ready LLM applications. "
    "It enables developers to create retrieval-augmented generative pipelines and state-of-the-art search systems."
)

205

In [15]:
# Create a sample CSV file with some empty rows/columns for cleaning
csv_content = """Company,Model,Release Year,,Notes
OpenAI,GPT-4,2023,,Generative Pre-trained Transformer 4
,,,
Google,Gemini,2023,,A family of multimodal models
Anthropic,Claude 3,2024,,Includes Opus, Sonnet, and Haiku models
"""
csv_file_path = data_dir / "llm_models.csv"
csv_file_path.write_text(csv_content)

# Define a sample URL to fetch
web_url = "https://haystack.deepset.ai/blog/haystack-2-release"

In [16]:
# For this example, we'll skip the actual PDF creation and assume one exists.
# You can place any PDF file in the 'data_for_indexing' directory and name it 'sample.pdf'.
# For a runnable example, we will simulate its path.
pdf_file_path = data_dir / "sample.pdf"
# In a real scenario, you would have this file. For this script to run, we'll check for it.
if not pdf_file_path.exists():
    print(f"Warning: PDF file not found at {pdf_file_path}. The PDF processing branch will not run.")
    # Create a dummy file to avoid path errors, but it won't be processed as PDF
    pdf_file_path.touch()

In [17]:
# DocumentStore: For this example, we use an in-memory store.
# For production, you would use a persistent vector database like Qdrant, Pinecone, or Weaviate. [11, 12]
document_store = InMemoryDocumentStore()

# FileTypeRouter: Directs files to the correct converter based on their MIME type. 
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/html"])

# Converters: One for each file type we want to handle.
text_file_converter = TextFileToDocument()
pdf_converter = PyPDFToDocument()
html_converter = HTMLToDocument()

# LinkContentFetcher: Fetches content from URLs and returns it as ByteStream objects. 
link_fetcher = LinkContentFetcher()

# DocumentJoiner: Merges lists of Documents from different paths into one. 
document_joiner = DocumentJoiner()

# Preprocessors for Text Data:
# DocumentCleaner: Removes extra whitespace, etc. 
cleaner = DocumentCleaner()
# DocumentSplitter: Chunks documents into smaller pieces. 
text_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=20)

# Preprocessors for Tabular Data (CSV):
# CSVDocumentCleaner: Removes empty rows and columns from CSV data. [16, 17]
csv_cleaner = CSVDocumentCleaner()
# CSVDocumentSplitter: Splits a large CSV into smaller tables or row-wise documents. 
# Here, we split each row into a separate Document.
csv_splitter = CSVDocumentSplitter(split_mode="row-wise")

# Embedder: Creates vector representations of the documents.
# It's crucial to use a model that aligns with the one you'll use for querying.
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# DocumentWriter: Writes the final documents to the DocumentStore.
writer = DocumentWriter(document_store)


## Building the Indexing Pipeline

Now we'll construct our document processing pipeline. This pipeline will consist of several key components:

1. **File Converters**
   - TextConverter: Processes plain text files
   - PDFToTextConverter: Extracts text from PDF documents
   - TableReader: Handles CSV and other tabular data

2. **PreProcessor**
   - Splits documents into smaller chunks
   - Cleans and normalizes text
   - Creates clean metadata

3. **DocumentStore**
   - InMemory backend for storing processed documents
   - Enables efficient searching and retrieval
   - Maintains document metadata

Let's see how these components work together:

In [18]:
# --- 3. Build the Indexing Pipeline ---

indexing_pipeline = Pipeline()

# Add all components to the pipeline with unique names
indexing_pipeline.add_component("link_fetcher", link_fetcher)
indexing_pipeline.add_component("html_converter", html_converter)
indexing_pipeline.add_component("file_type_router", file_type_router)
indexing_pipeline.add_component("text_file_converter", text_file_converter)
indexing_pipeline.add_component("pdf_converter", pdf_converter)
indexing_pipeline.add_component("document_joiner", document_joiner)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("text_splitter", text_splitter)
indexing_pipeline.add_component("doc_embedder", doc_embedder)
indexing_pipeline.add_component("writer", writer)

In [19]:
# Add CSV-specific components
# We'll create a separate pipeline for CSV processing for clarity, then integrate the concept.
# In a single large pipeline, you would route CSV files similarly.
# For this example, we'll process the CSV separately and add it to the store.

# --- 4. Connect the Pipeline Components ---

# Web data branch
indexing_pipeline.connect("link_fetcher.streams", "html_converter.sources")
indexing_pipeline.connect("html_converter.documents", "document_joiner.documents")

# Local file data branch
indexing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
indexing_pipeline.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipeline.connect("text_file_converter.documents", "document_joiner.documents")
indexing_pipeline.connect("pdf_converter.documents", "document_joiner.documents")

# Main processing path after joining
indexing_pipeline.connect("document_joiner", "cleaner")
indexing_pipeline.connect("cleaner", "text_splitter")
indexing_pipeline.connect("text_splitter", "doc_embedder")
indexing_pipeline.connect("doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x32350fe30>
🚅 Components
  - link_fetcher: LinkContentFetcher
  - html_converter: HTMLToDocument
  - file_type_router: FileTypeRouter
  - text_file_converter: TextFileToDocument
  - pdf_converter: PyPDFToDocument
  - document_joiner: DocumentJoiner
  - cleaner: DocumentCleaner
  - text_splitter: DocumentSplitter
  - doc_embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - link_fetcher.streams -> html_converter.sources (list[ByteStream])
  - html_converter.documents -> document_joiner.documents (list[Document])
  - file_type_router.text/plain -> text_file_converter.sources (list[Union[str, Path, ByteStream]])
  - file_type_router.application/pdf -> pdf_converter.sources (list[Union[str, Path, ByteStream]])
  - text_file_converter.documents -> document_joiner.documents (list[Document])
  - pdf_converter.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> clean

Let's visualize the pipeline:

In [20]:
indexing_pipeline.draw(path="./images/indexing_pipeline.png")

![](./images/indexing_pipeline.png)

In [21]:
# --- 5. Run the Pipeline ---

print("Running indexing pipeline for web and local files...")
# Note: The PDF path will be ignored if the file doesn't exist.
file_paths_to_process = [text_file_path]
if pdf_file_path.exists() and pdf_file_path.stat().st_size > 0:
    file_paths_to_process.append(pdf_file_path)
else:
    print(f"Skipping PDF file: {pdf_file_path}")

indexing_pipeline.run({
    "link_fetcher": {"urls": [web_url]},
    "file_type_router": {"sources": file_paths_to_process}
})

Running indexing pipeline for web and local files...
Skipping PDF file: data_for_indexing/sample.pdf


Batches: 100%|██████████| 1/1 [00:00<00:00,  5.96it/s]


{'writer': {'documents_written': 14}}

## Understanding the Results

Let's examine what our pipeline has processed. We'll look at:
1. The number of documents processed
2. How the documents were split
3. The metadata that was extracted

This will help us verify that our pipeline is working as expected and give us insights into our indexed documents.

First, let's query our document store to see what we've indexed:

In [22]:
# --- 7. Verify the DocumentStore ---
doc_count = document_store.count_documents()
print(f"\nTotal documents in DocumentStore: {doc_count}")
print("Sample document from the store:")
print(document_store.filter_documents())


Total documents in DocumentStore: 14
Sample document from the store:
[Document(id=be2fb4afe8f3e531ae2e97314778b92789c44d794c981d67bdea8658cf3fe51e, content: 'Haystack 2.0: The Composable Open-Source LLM Framework
Meet Haystack 2.0, a more flexible, customiza...', meta: {'content_type': 'text/html', 'url': 'https://haystack.deepset.ai/blog/haystack-2-release', 'source_id': '0b188f4690ab3496d2270baf378be9fde19707e9b1ece003123c0442af918bb7', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': 'ff9dcb8208955755eb458bdf693e12ba8c2dd7f27de0bcb51e47e474255ad47b', 'range': (0, 124)}]}, embedding: vector of size 384), Document(id=ff9dcb8208955755eb458bdf693e12ba8c2dd7f27de0bcb51e47e474255ad47b, content: 'user before or not. You can get started by installing haystack-ai
, our new package for Haystack 2.0...', meta: {'content_type': 'text/html', 'url': 'https://haystack.deepset.ai/blog/haystack-2-release', 'source_id': '0b188f4690ab3496d2270baf378be9fde19707e9b1e