🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Building Custom Components with Warm-Up Methods

This notebook demonstrates how to create custom Haystack components that require initialization or pre-loading of resources. We'll focus on the `warm_up()` method, which is essential for components that need to load machine learning models, establish database connections, or perform other setup operations before processing begins.

## Learning Objectives

By the end of this notebook, you will understand:

1. **The purpose and importance of the `warm_up()` method** in Haystack components
2. **How to implement proper resource initialization** for machine learning models
3. **Pipeline integration** with components that require warm-up

## Why Warm-Up Methods Matter

### The Problem
Many machine learning models and external services require initialization before they can process data:
- Loading pre-trained models into memory
- Establishing database connections
- Downloading remote resources
- Performing one-time computations

### The Solution
The `warm_up()` method provides a standardized way to:
- **Separate initialization from processing logic**
- **Ensure components are ready before pipeline execution**
- **Avoid repeated loading of the same resources**
- **Handle initialization errors gracefully**

### Performance Benefits
- **Faster pipeline execution** after initial warm-up
- **Predictable memory usage** patterns
- **Reduced latency** for subsequent operations
- **Better error handling** during setup phase

## Component Architecture: The `LocalEmbedderText` Example

We'll start by building a custom component that embeds text using SentenceTransformers. This example illustrates the key principles of the warm-up pattern:

### Key Implementation Details

1. **Constructor (`__init__`)**: Sets up configuration but doesn't load heavy resources
2. **Warm-up method (`warm_up`)**: Loads the actual model when needed
3. **Processing method (`run`)**: Performs the main work using pre-loaded resources
4. **State management**: Tracks whether initialization has occurred

### The SentenceTransformers Use Case

SentenceTransformers models are perfect for demonstrating warm-up because:
- They require downloading and loading large model files
- Loading can take several seconds
- Once loaded, inference is fast
- The same model instance can be reused for multiple texts

In [1]:

from haystack import component, Document
from typing import List, Optional
from sentence_transformers import SentenceTransformer

@component
class LocalEmbedderText:
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model: Optional = None

    def warm_up(self):
        """
        Loads the SentenceTransformer model. This is called only once
        before the first run.
        """
        if self.model is None:
            self.model = SentenceTransformer(self.model_name)

    @component.output_types(embeddings=List[List[float]])
    def run(self, texts: List[str]):
        """
        Embeds a list of texts using the pre-loaded model.
        """
        if self.model is None:
            raise RuntimeError("The model has not been loaded. Please call warm_up() before running.")
        
        embeddings = self.model.encode(texts).tolist()
        return {"embeddings": embeddings}


  from .autonotebook import tqdm as notebook_tqdm


**Critical Design Decisions:**

1. **Lazy Loading**: The model is only loaded when `warm_up()` is called, not in `__init__()`
2. **State Checking**: The `run()` method validates that initialization has occurred
3. **Idempotency**: Multiple calls to `warm_up()` don't reload the model
4. **Error Handling**: Clear error messages guide users to proper usage

In [2]:
local_embedder = LocalEmbedderText()
local_embedder.warm_up()

## Testing the Basic Component

Let's test our text embedder component to understand the warm-up workflow:

### Step 1: Component Initialization

First, we create an instance of our component. Notice that this is fast because no heavy model loading occurs yet:

In [11]:
txt_embeddings = local_embedder.run(texts=["This is a test sentence.", "Another sentence to embed."])

In [12]:
txt_embeddings['embeddings'][0][0:5]  # Display first 5 values of the first embedding

[0.08429640531539917,
 0.05795368552207947,
 0.004493284970521927,
 0.1058211699128151,
 0.007083478849381208]

### Step 2: Component Processing

Now we can use our warmed-up component to embed text. The first call may be slightly slower due to model initialization, but subsequent calls will be fast:

## Adapting for Document Processing

The text-based embedder works well for simple strings, but in real-world Haystack pipelines, we typically work with `Document` objects. Let's create an adapted version that processes Document objects instead of raw text strings.

### Why Document Objects Matter

Haystack's `Document` class provides:
- **Structured content storage** with metadata
- **Pipeline compatibility** with other Haystack components  
- **Standardized interfaces** across different component types
- **Rich context preservation** through metadata fields

### Implementation Considerations

When adapting our component for Documents:
- Extract text content from Document objects
- Maintain the same warm-up pattern
- Preserve component state management
- Ensure compatibility with downstream components

In [4]:
@component
class LocalEmbedderDocs:
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model: Optional = None

    def warm_up(self):
        """
        Loads the SentenceTransformer model. This is called only once
        before the first run.
        """
        if self.model is None:
            self.model = SentenceTransformer(self.model_name)

    @component.output_types(embeddings=List[List[float]])
    def run(self, documents: List[Document]):
        """
        Embeds a list of texts using the pre-loaded model.
        """
        if self.model is None:
            raise RuntimeError("The model has not been loaded. Please call warm_up() before running.")

        texts = [doc.content for doc in documents]
        embeddings = self.model.encode(texts).tolist()
        return {"embeddings": embeddings}


### Comparing the Implementations

Notice the similarities and differences between our two embedder components:

**Similarities:**
- Identical warm-up logic and state management
- Same model loading and initialization pattern
- Consistent error handling for uninitialized components
- Identical embedding computation process

**Key Difference:**
- **Input processing**: `LocalEmbedderDocs` extracts text from Document objects before embedding
- **Type annotations**: Different input types (`List[str]` vs `List[Document]`)
- **Pipeline compatibility**: Document version integrates seamlessly with other Haystack components

**Design Pattern:**
This demonstrates a common pattern in Haystack development - creating component variants that handle different data types while maintaining the same core functionality.

In [5]:
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import (
    DocumentCleaner,
    DocumentSplitter,
)
from haystack.components.writers import DocumentWriter

## Building a Complete Processing Pipeline

Now let's integrate our custom embedder component into a comprehensive document processing pipeline. This demonstrates how components with warm-up methods work within larger workflows.

### Pipeline Architecture

Our pipeline will process PDF documents through several stages:

1. **PDF Conversion**: Extract text content from PDF files
2. **Document Cleaning**: Remove extra whitespace and empty lines
3. **Document Splitting**: Break large documents into manageable chunks
4. **Embedding Generation**: Convert text to vector representations using our custom component

### Component Integration Strategy

When building pipelines with custom components:
- **Initialization order matters**: Components must be warmed up before pipeline execution
- **Resource management**: Each component manages its own resources independently
- **Error propagation**: Component-level errors should bubble up clearly
- **Performance optimization**: Warm-up once, process many times

In [6]:
document_store = InMemoryDocumentStore()

# Initialize components
pdf_converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter()
doc_embedder = LocalEmbedderDocs()
writer = DocumentWriter(document_store=document_store)

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("pdf_converter", pdf_converter)
pipeline.add_component("document_cleaner", cleaner)
pipeline.add_component("document_splitter", splitter)
pipeline.add_component("document_embedder", doc_embedder)


pipeline.connect("pdf_converter.documents", "document_cleaner.documents")
pipeline.connect("document_cleaner.documents", "document_splitter.documents")
pipeline.connect("document_splitter.documents", "document_embedder.documents")


<haystack.core.pipeline.pipeline.Pipeline object at 0x35979b260>
🚅 Components
  - pdf_converter: PyPDFToDocument
  - document_cleaner: DocumentCleaner
  - document_splitter: DocumentSplitter
  - document_embedder: LocalEmbedderDocs
🛤️ Connections
  - pdf_converter.documents -> document_cleaner.documents (list[Document])
  - document_cleaner.documents -> document_splitter.documents (list[Document])
  - document_splitter.documents -> document_embedder.documents (list[Document])

### Understanding Pipeline Construction

Let's examine how our pipeline is structured:

**Component Initialization:**
```python
doc_embedder = LocalEmbedderDocs()  # Custom component requiring warm-up
```

**Pipeline Assembly:**
```python
pipeline = Pipeline()
pipeline.add_component("document_embedder", doc_embedder)
```

**Component Connections:**
```python
pipeline.connect("document_splitter.documents", "document_embedder.documents")
```

### Important Notes About Warm-Up in Pipelines

1. **Automatic Warm-Up**: Haystack automatically calls `warm_up()` on components when the pipeline runs
2. **Order Independence**: Components are warmed up before execution regardless of connection order  
3. **Error Handling**: Warm-up failures prevent pipeline execution and provide clear error messages
4. **Resource Efficiency**: Each component is warmed up only once per pipeline instance

In [7]:
pipeline.draw(path="./images/warmup_component_pipeline.png")

## Visualizing Pipeline Architecture

Let's generate a visual representation of our pipeline to understand the data flow and component relationships:

![](./images/warmup_component_pipeline.png)

### Pipeline Flow Analysis

The pipeline diagram shows the complete document processing workflow:

**Data Flow:**
1. **PDF Input**: Raw PDF files are fed into the pipeline
2. **Text Extraction**: PyPDFToDocument converts PDF to text
3. **Preprocessing**: DocumentCleaner removes formatting artifacts  
4. **Segmentation**: DocumentSplitter creates manageable chunks
5. **Embedding**: Our custom LocalEmbedderDocs generates vectors

**Component Dependencies:**
- Each component depends on the output of the previous stage
- The custom embedder is the final processing step
- No parallel processing branches in this linear pipeline

**Resource Management:**
- Only the custom embedder requires warm-up in this pipeline
- Other components are stateless and don't need initialization
- The embedding model is loaded once and reused for all document chunks

In [8]:
file_paths_to_process = "./data_for_indexing/howpeopleuseai.pdf"
embedded_docs = pipeline.run({"pdf_converter": {"sources": [file_paths_to_process]}})

## Running the Complete Pipeline

Now let's execute our pipeline with real PDF data to see how the warm-up component performs in practice:

### Execution Process

When we run the pipeline, several things happen automatically:

1. **Component Warm-Up Phase**: Haystack calls `warm_up()` on our custom embedder
2. **Model Loading**: The SentenceTransformer model is downloaded and loaded into memory  
3. **Document Processing**: The PDF is processed through each pipeline stage
4. **Embedding Generation**: Our warmed-up component processes all document chunks efficiently

### Performance Expectations

- **First Run**: Slower due to model loading and potential model downloads
- **Subsequent Runs**: Much faster since the model remains in memory
- **Memory Usage**: Higher after warm-up due to loaded model weights
- **Throughput**: Excellent for batch processing multiple documents

In [9]:
len(embedded_docs['document_embedder']['embeddings'])

77

In [10]:
embedded_docs['document_embedder']['embeddings'][0][0:5]

[-0.04715508595108986,
 -0.08272656798362732,
 -0.017641058191657066,
 0.03134286031126976,
 0.029913973063230515]

### Analyzing Pipeline Output

Let's examine the results to understand what our pipeline accomplished:

**Embedding Count Significance:**
- Each number represents one document chunk that was processed
- The count tells us how many text segments were created by the document splitter
- Each embedding is a high-dimensional vector representing the semantic content of its chunk

### Understanding the Processing Chain

The pipeline took our single PDF file and:
1. **Extracted** all text content from the PDF
2. **Cleaned** the text by removing formatting artifacts
3. **Split** the content into semantically coherent chunks
4. **Embedded** each chunk into a vector representation

This transformation enables:
- **Semantic search** across document content
- **Similarity comparisons** between different text segments  
- **Clustering** of related content
- **Integration** with vector databases and retrieval systems

## Summary and Best Practices

### What We've Learned

In this notebook, we explored the essential concepts of building Haystack components with proper initialization patterns:

1. **Warm-Up Method Implementation**: How to separate resource loading from component logic
2. **State Management**: Proper handling of initialized vs uninitialized component states  
3. **Pipeline Integration**: Seamless incorporation of custom components into larger workflows
4. **Performance Optimization**: Efficient resource utilization through proper initialization timing

### Key Design Principles

**Separation of Concerns:**
- `__init__()` for configuration setup
- `warm_up()` for resource loading  
- `run()` for core processing logic

**Error Handling:**
- Clear validation of component readiness
- Informative error messages for common mistakes
- Graceful handling of initialization failures

**Resource Efficiency:**
- Lazy loading of expensive resources
- Idempotent warm-up operations
- Proper cleanup when needed

### Further Learning

This example demonstrates a basic embedding component. For production use, consider exploring:
- The official [SentenceTransformersDocumentEmbedder](https://github.com/deepset-ai/haystack/blob/main/haystack/components/embedders/sentence_transformers_document_embedder.py) implementation
- Advanced error handling and retry mechanisms  
- Memory optimization techniques for large models
- Async initialization patterns for better performance