# Building blocks in Haystack: components and pipelines

Haystack's architecture leverages components as its core elements, each performing specific functions like text processing or summarization. These components are designed to be connected into pipelines, which orchestrate the flow of data and manage task execution in a structured manner. The Pipeline class facilitates this by allowing the addition and connection of components, which must have unique input and output points for data transfer.

Pipelines are the backbone of NLP applications in Haystack, functioning as directed graphs where nodes are components and edges dictate data flow. They ensure smooth data processing, handle errors, and support debugging through visualization tools that help developers trace and optimize the data journey.

Haystack emphasizes modularity and flexibility, providing a range of pre-built components while also supporting custom ones for specific needs. The framework's pipelines enable the assembly of sophisticated NLP applications, integrating various functionalities into a cohesive system. In this notebook we will explore key components. In  the pipelines.ipynb notebook we will see how to connect them into pipelines.

### Set up

If you have completed the following, you may discard this information. Otherwise, as a reminder and to ease installation, you can follow the instructions below.  

Throughout this book we will be using pip and conda for package management. We will also create an isolated conda environment with Python 3.10.  

The GitHub repository with exercises can be found here under the folder ch3 https://github.com/PacktPublishing/Building-Natural-Language-Pipelines.   

We recommend that you install Miniconda and VSCode. We also recommend that you install GitHub (GitBash for Windows or Git for Linux and Mac) to make the process of accessing the material locally easier.   

* Install Miniconda: https://docs.conda.io/projects/miniconda/en/latest/  

* Install VSCode: https://code.visualstudio.com/docs/setup/setup-overview  

To obtain the code and exercises, clone the repository: 

Open VSCode, Click File-> New Window, then Terminal ->New Terminal. Ensure your terminal is of type “Bash” or “Command line”.  


Within the terminal, type each of the commands (a command is identified by the $ sign) below, one by one. Then press enter.  

```bash

$ git clone https://github.com/PacktPublishing/Building-Natural-Language-Pipelines.git 

$ cd building-natural-language-pipelines/ 

$ conda create –-name llm-pipelines python==3.10 

$ conda activate llm-pipelines 

$ pip install haystack-ai 
```

Enable the Jupyter Notebook extension on VSCode through the extension marketplace. When you open a notebook, press on ‘Select Kernel’ and click on `llm-pipeline` as our environment. 

### Introduction to components

Within Haystack, we can find the following key ready-made components. There are more, but for now we will focus on these as we get started with Haystack's functionality.

![](./images/haystack-components.png)

### The `DocumentStore` class and the `Document` object

The `DocumentStore` class is an internal component of the Haystack library that serves as a registry for classes that are marked as document stores. A document store in Haystack is a place where documents are stored and retrieved, typically used as part of a pipeline to handle data for search and retrieval tasks. 

The `Document` object is a data structure that represents a document in Haystack. `Document` objects are stored in a `DocumentStore` and are used as input and output for the various components in Haystack.


In [8]:
from haystack.preview.dataclasses import Document, ByteStream
from haystack.preview.document_stores.in_memory.document_store import InMemoryDocumentStore
import pandas as pd

# Assuming 'binary_data' is your binary data, for example, read from a file:
binary_data = b'Your binary data here'  # This could be the actual binary content, such as PDF or image data

# Convert binary data to ByteStream object
binary_blob = ByteStream(data=binary_data, mime_type='application/pdf')  # MIME type should match your data

# Example metadata
metadata = {
    "source": "Wikipedia",
    "author": "John Doe",
    "date": "2021-07-21",
    "custom_field": "custom_value"
}

# Pandas dataframe for tabular data
df = pd.DataFrame.from_dict({'first_name': ['John', 'Jane'], 'last_name': ['Doe', 'Doe'], 'age': [35, 38]})

# Create documents
documents = [
    Document(content="The population of Germany is 100 million people.", id="1"),
    Document(content="About 65 million people live in France as of today.", id='2'),
    Document(dataframe=df, id='3'),
    Document(blob=binary_blob, meta={"file_name": "example.pdf", "file_type": "PDF"}, id='4'),
    Document(content="A sample text document with metadata.", meta=metadata, id='5')
]

# Write documents to document store
docstore = InMemoryDocumentStore()
docstore.write_documents(documents=documents)


### Filtering document information

Find all documents - no filter

In [44]:
docstore.filter_documents()

[Document(id='1', content='The population of Germany is 100 million people.', dataframe=None, blob=None, meta={}, score=None),
 Document(id='2', content='About 65 million people live in France as of today.', dataframe=None, blob=None, meta={}, score=None),
 Document(id='3', content=None, dataframe=  first_name last_name  age
 0       John       Doe   35
 1       Jane       Doe   38, blob=None, meta={}, score=None),
 Document(id='4', content=None, dataframe=None, blob=ByteStream(data=b'Your binary data here', metadata={}, mime_type='application/pdf'), meta={'file_name': 'example.pdf', 'file_type': 'PDF'}, score=None),
 Document(id='5', content='A sample text document with metadata.', dataframe=None, blob=None, meta={'source': 'Wikipedia', 'author': 'John Doe', 'date': '2021-07-21', 'custom_field': 'custom_value'}, score=None)]

To find a document using exact match


In [25]:
filters_exact_match = {
    "content": {"$eq": "The population of Germany is 100 million people."}
}
docstore.filter_documents(filters=filters_exact_match)


[Document(id='1', content='The population of Germany is 100 million people.', dataframe=None, blob=None, meta={}, score=None)]

In [51]:
df_doc = Document(dataframe=df, id='3')

df_doc.dataframe.age

0    35
1    38
Name: age, dtype: int64

Find entries that are not of content type

In [45]:
filters_exact_match = {
    "content": {"$eq": None}
}
docstore.filter_documents(filters=filters_exact_match)


[Document(id='3', content=None, dataframe=  first_name last_name  age
 0       John       Doe   35
 1       Jane       Doe   38, blob=None, meta={}, score=None),
 Document(id='4', content=None, dataframe=None, blob=ByteStream(data=b'Your binary data here', metadata={}, mime_type='application/pdf'), meta={'file_name': 'example.pdf', 'file_type': 'PDF'}, score=None)]