# Building a Document Indexing Pipeline with Haystack

Welcome to this comprehensive tutorial on building a document indexing pipeline using Haystack! This notebook demonstrates how to create a robust system for processing and indexing various types of documents for retrieval-augmented generation (RAG) applications.

## What you'll learn
After completing this notebook, you will be able to:
- Build a multi-format document processing pipeline using Haystack
- Handle different document types (TXT, CSV, PDF, HTML) in a unified way
- Implement proper document cleaning and splitting strategies
- Create and configure an efficient document store
- Generate and store document embeddings for semantic search
- Test and validate your indexed documents

## Key Concepts
Before we begin, let's understand some core concepts:
- **Document Store**: A database that holds our processed documents and their embeddings
- **Document Processing**: Converting raw files into a standardized format
- **Embeddings**: Dense vector representations of text for semantic search
- **Pipeline Components**: Modular pieces that handle specific processing tasks

## Prerequisites
- Basic Python knowledge
- Understanding of basic NLP concepts
- Familiarity with Jupyter notebooks
- Haystack library installed

Let's dive in and build our indexing pipeline step by step!

## 1. Setting Up Our Environment

Before we build our pipeline, we need to import the necessary components. Let's understand what each package brings to our indexing system:

In [1]:
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(".env")

True



## Core Components
- **`Pipeline`**: The main orchestrator that connects all components
- **`Document`**: The fundamental data structure for text in Haystack
- **`InMemoryDocumentStore`**: Our database for storing processed documents

## Document Processing Components
- **`DocumentWriter`**: Handles saving documents to our store
- **`DocumentJoiner`**: Combines documents from different sources
- **`FileTypeRouter`**: Routes files to appropriate processors based on type

## Format-Specific Components
- **`LinkContentFetcher`**: Retrieves content from web URLs
- **`PyPDFToDocument`**: Processes PDF files
- **`TextFileToDocument`**: Handles plain text files
- **`HTMLToDocument`**: Processes HTML content

## Text Processing Components
- **`DocumentCleaner`**: Removes unwanted content and normalizes text
- **`DocumentSplitter`**: Breaks documents into manageable chunks
- **`CSVDocumentCleaner`/`Splitter`**: Specialized handlers for tabular data

## Embedding Component
- **`SentenceTransformersDocumentEmbedder`**: Converts text to vector representations

In the next cell, we'll import these components and set up our environment.

In [2]:
# Import core Haystack classes
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.joiners import DocumentJoiner

# Import components for data fetching and conversion
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import (
    PyPDFToDocument,
    TextFileToDocument,
    HTMLToDocument,
)
from haystack.components.routers import FileTypeRouter

# Import components for preprocessing
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter,
    CSVDocumentCleaner,
    CSVDocumentSplitter
)

# Import components for embedding
from haystack.components.embedders import SentenceTransformersDocumentEmbedder


  from .autonotebook import tqdm as notebook_tqdm


## 2. Document Loading and Preprocessing

In this section, we'll set up our document loading pipeline. We'll work with multiple document formats:

1. Text files (.txt)
2. CSV files
3. PDF documents

The data we'll be working with includes:
- A text file about Haystack's introduction
- A CSV file containing information about LLM models
- A sample PDF document

We'll create a preprocessing pipeline that will:
1. Load these documents from different sources
2. Convert them into a unified format
3. Clean and prepare them for indexing


In [3]:
# --- 1. Create Sample Data Files ---
# Create a directory to hold our source files
data_dir = Path("data_for_indexing")
data_dir.mkdir(exist_ok=True)

# Create a sample text file
text_file_path = data_dir / "haystack_intro.txt"
text_file_path.write_text(
    "Haystack is an open-source framework by deepset for building production-ready LLM applications. "
    "It enables developers to create retrieval-augmented generative pipelines and state-of-the-art search systems."
)

205

Create mock CSV

In [22]:
# Create a sample CSV file with some empty rows/columns for cleaning
csv_content = """Company,Model,Release Year,,Notes
OpenAI,GPT-4,2023,,Generative Pre-trained Transformer 4
,,,
Google,Gemini,2023,,A family of multimodal models
Anthropic,Claude 3,2024,,Includes Opus, Sonnet, and Haiku models
"""
csv_file_path = data_dir / "llm_models.csv"
csv_file_path.write_text(csv_content)


209

Scrape a webpage

In [None]:
# Define a sample URL to fetch
web_url = "https://haystack.deepset.ai/blog/haystack-2-release"

PDF for file that does not exist.

In [5]:
# For this example, we'll skip the actual PDF creation and assume one exists.
# You can place any PDF file in the 'data_for_indexing' directory and name it 'sample.pdf'.
# For a runnable example, we will simulate its path.
pdf_file_path = data_dir / "sample.pdf"
# In a real scenario, you would have this file. For this script to run, we'll check for it.
if not pdf_file_path.exists():
    print(f"Warning: PDF file not found at {pdf_file_path}. The PDF processing branch will not run.")
    # Create a dummy file to avoid path errors, but it won't be processed as PDF
    pdf_file_path.touch()

## 3. Initializing Pipeline Components

Now we'll create instances of each component needed for our pipeline. Let's understand each component's role and configuration:

### Storage Components
- **`DocumentStore`**: In-memory database for documents and embeddings
  - Efficient for testing and development
  - Supports vector similarity search
  - Non-persistent (clears on restart)

### Routing Components
- **`FileTypeRouter`**: Smart file dispatcher
  - Routes files based on MIME types
  - Supports: text/plain, PDF, HTML, CSV
  - Ensures proper format-specific processing

### Document Conversion
- **`TextFileToDocument`**: Converts plain text
- **`PyPDFToDocument`**: Extracts text from PDFs
- **`HTMLToDocument`**: Processes web content
- **CSV `Converter`**: Handles tabular data

### Document Processing
- **`DocumentCleaner`**: Text normalization
  - Removes extra whitespace
  - Handles special characters
  - Normalizes formatting

- **`DocumentSplitter`**: Chunking strategy
  - Splits by words (150 words per chunk)
  - 20-word overlap for context
  - Preserves semantic meaning

### Embedding Generation
- **`SentenceTransformersEmbedder`**:
  - Model: all-MiniLM-L6-v2
  - Creates 384-dimensional embeddings
  - Optimized for semantic search

The following cell initializes all these components with our chosen configurations.

In [None]:
# DocumentStore:
document_store = InMemoryDocumentStore()

# FileTypeRouter: Directs files to the correct converter based on their MIME type.
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/html", "text/csv"])

# Converters: One for each file type we want to handle.
text_converter = TextFileToDocument()
pdf_converter = PyPDFToDocument()
html_converter = HTMLToDocument()
csv_converter = TextFileToDocument()

# LinkContentFetcher: Fetches content from URLs.
link_fetcher = LinkContentFetcher()

# DocumentJoiners:
unstructured_doc_joiner = DocumentJoiner()
# This joiner will gather documents from *all* processing branches
# (the split text docs and the split csv docs) before embedding.
final_doc_joiner = DocumentJoiner()

# Preprocessors for Text Data:
text_cleaner = DocumentCleaner()
text_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=20)

# Preprocessors for Tabular Data (CSV):
# These will now be part of the main pipeline.
csv_cleaner = CSVDocumentCleaner()
csv_splitter = CSVDocumentSplitter(split_mode="row-wise")

# Embedder:
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# DocumentWriter:
writer = DocumentWriter(document_store)


# 4. Constructing the Pipeline

We'll now assemble our components into a complete indexing pipeline. Our pipeline architecture follows these principles:

## Pipeline Architecture
1. **Parallel Processing Branches**
   - Unstructured data branch (Web, TXT, PDF)
   - Structured data branch (CSV)
   - Each branch optimized for its data type

2. **Document Flow**
   - Input → Conversion → Cleaning → Splitting → Embedding → Storage
   - Specialized handling for each format
   - Joins documents at strategic points

3. **Component Naming**
   - Each component gets a unique identifier
   - Names reflect component functionality
   - Helps in debugging and monitoring

## Key Design Decisions
- **Split Processing**: Separate paths for structured/unstructured data
- **Document Joining**: Strategic combination points for efficiency
- **Embedding Strategy**: Single embedder for consistency
- **Error Handling**: Built-in format validation

The following cell implements this architecture:

In [21]:

indexing_pipeline = Pipeline()

# Add all components to the pipeline with unique names
indexing_pipeline.add_component("link_fetcher", link_fetcher)
indexing_pipeline.add_component("html_converter", html_converter)
indexing_pipeline.add_component("file_type_router", file_type_router)
indexing_pipeline.add_component("text_converter", text_converter)
indexing_pipeline.add_component("pdf_converter", pdf_converter)
indexing_pipeline.add_component("unstructured_doc_joiner", unstructured_doc_joiner)
indexing_pipeline.add_component("text_cleaner", text_cleaner)
indexing_pipeline.add_component("text_splitter", text_splitter)
indexing_pipeline.add_component("doc_embedder", doc_embedder)
indexing_pipeline.add_component("writer", writer)

# **NEW**: Add the CSV components to the pipeline
indexing_pipeline.add_component("csv_converter", csv_converter)
indexing_pipeline.add_component("csv_cleaner", csv_cleaner)
indexing_pipeline.add_component("csv_splitter", csv_splitter)
indexing_pipeline.add_component("final_doc_joiner", final_doc_joiner)

# 5. Connecting Pipeline Components

This is where we define the flow of data through our pipeline. The connections create two main processing branches that eventually merge:

## Unstructured Data Branch
1. **Web Content Processing**
   - URL → LinkFetcher → HTMLConverter
   - Handles dynamic web content
   - Preserves important metadata

2. **Local File Processing**
   - Files → FileTypeRouter → Appropriate Converter
   - Automatic format detection
   - Specialized handling per type

3. **Text Processing**
   - Cleaning → Splitting → Joining
   - Maintains document coherence
   - Optimizes chunk size

## Structured Data Branch (CSV)
1. **CSV Processing**
   - CSV files → CSVConverter
   - Row-wise splitting
   - Metadata preservation

## Final Processing
1. **Document Joining**
   - Combines all processed documents
   - Maintains source information
   - Prepares for embedding

2. **Embedding and Storage**
   - Vector generation
   - Document store writing
   - Index updating

The following cell implements these connections:

In [15]:

# --- Unstructured Data Branch (Web, TXT, PDF) ---
# Web data
indexing_pipeline.connect("link_fetcher.streams", "html_converter.sources")
indexing_pipeline.connect("html_converter.documents", "unstructured_doc_joiner.documents")

# Local file data (TXT, PDF)
indexing_pipeline.connect("file_type_router.text/plain", "text_converter.sources")
indexing_pipeline.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipeline.connect("text_converter.documents", "unstructured_doc_joiner.documents")
indexing_pipeline.connect("pdf_converter.documents", "unstructured_doc_joiner.documents")

# Processing for unstructured data
indexing_pipeline.connect("unstructured_doc_joiner", "text_cleaner")
indexing_pipeline.connect("text_cleaner", "text_splitter")
# Connect the split *text* docs to the *final* joiner
indexing_pipeline.connect("text_splitter.documents", "final_doc_joiner.documents")


#  Structured Data Branch (CSV) ---
# Route CSV files to the csv_converter
indexing_pipeline.connect("file_type_router.text/csv", "csv_converter.sources")
# Process the CSV documents
indexing_pipeline.connect("csv_converter.documents", "csv_cleaner.documents")
indexing_pipeline.connect("csv_cleaner.documents", "csv_splitter.documents")
# Connect the split *csv* docs to the *final* joiner
indexing_pipeline.connect("csv_splitter.documents", "final_doc_joiner.documents")


# --- Main Processing Path (Embedding and Writing) ---
# The final_doc_joiner now receives documents from *both* branches
indexing_pipeline.connect("final_doc_joiner", "doc_embedder")
indexing_pipeline.connect("doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x328391c10>
🚅 Components
  - link_fetcher: LinkContentFetcher
  - html_converter: HTMLToDocument
  - file_type_router: FileTypeRouter
  - text_converter: TextFileToDocument
  - pdf_converter: PyPDFToDocument
  - unstructured_doc_joiner: DocumentJoiner
  - text_cleaner: DocumentCleaner
  - text_splitter: DocumentSplitter
  - doc_embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
  - csv_converter: TextFileToDocument
  - csv_cleaner: CSVDocumentCleaner
  - csv_splitter: CSVDocumentSplitter
  - final_doc_joiner: DocumentJoiner
🛤️ Connections
  - link_fetcher.streams -> html_converter.sources (list[ByteStream])
  - html_converter.documents -> unstructured_doc_joiner.documents (list[Document])
  - file_type_router.text/plain -> text_converter.sources (list[Union[str, Path, ByteStream]])
  - file_type_router.application/pdf -> pdf_converter.sources (list[Union[str, Path, ByteStream]])
  - file_type_router.text/csv ->

Let's visualize the pipeline:

In [16]:
indexing_pipeline.draw(path="./images/indexing_pipeline.png")

![](./images/indexing_pipeline.png)

In [17]:
# --- 5. Run the Pipeline ---

print("Running unified indexing pipeline for web, local files, and CSV...")
# Note: The PDF path will be ignored if the file doesn't exist.
file_paths_to_process = [text_file_path]

if pdf_file_path.exists() and pdf_file_path.stat().st_size > 0:
    file_paths_to_process.append(pdf_file_path)
else:
    print(f"Skipping PDF file: {pdf_file_path}")

file_paths_to_process.append(csv_file_path)


indexing_pipeline.run({
    "link_fetcher": {"urls": [web_url]},
    "file_type_router": {"sources": file_paths_to_process}
})

Error processing document 1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0. Keeping it, but skipping cleaning. Error: Error tokenizing data. C error: Expected 5 fields in line 5, saw 7

Error processing document 1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0. Keeping it, but skipping splitting. Error: Error tokenizing data. C error: Expected 5 fields in line 5, saw 7



Running unified indexing pipeline for web, local files, and CSV...
Skipping PDF file: data_for_indexing/sample.pdf


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.14it/s]


{'writer': {'documents_written': 15}}

In [18]:
# --- 7. Verify the DocumentStore ---
doc_count = document_store.count_documents()
print(f"\nTotal documents in DocumentStore: {doc_count}")
print("Sample document from the store:")
print(document_store.filter_documents())


Total documents in DocumentStore: 15
Sample document from the store:
[Document(id=1384ec36dd6d99f90ab589732d5219b7371dac846d0f0bd89c6385189c4079c0, content: 'Company,Model,Release Year,,Notes
OpenAI,GPT-4,2023,,Generative Pre-trained Transformer 4
,,,
Google...', meta: {'file_path': 'llm_models.csv'}, embedding: vector of size 384), Document(id=be2fb4afe8f3e531ae2e97314778b92789c44d794c981d67bdea8658cf3fe51e, content: 'Haystack 2.0: The Composable Open-Source LLM Framework
Meet Haystack 2.0, a more flexible, customiza...', meta: {'content_type': 'text/html', 'url': 'https://haystack.deepset.ai/blog/haystack-2-release', 'source_id': '0b188f4690ab3496d2270baf378be9fde19707e9b1ece003123c0442af918bb7', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': 'ff9dcb8208955755eb458bdf693e12ba8c2dd7f27de0bcb51e47e474255ad47b', 'range': (0, 124)}]}, embedding: vector of size 384), Document(id=ff9dcb8208955755eb458bdf693e12ba8c2dd7f27de0bcb51e47e474255ad47b, conten