# Building blocks in Haystack: components and pipelines

In the [previous notebook](data_classes.ipynb), we learned how we can store structured and unstructured data through Documents objects, as well as dataframe, ByteStream, ChatMessage and StreamingChunk objects. We also learned how to store these objects into a Document Store. In this notebook, we will explore how to store and retrieve data from a Haystack Document store. Let's take a look at its architecture.

Haystack's architecture leverages components as its core elements, each performing specific functions like text processing or summarization. These components are designed to be connected into pipelines, which orchestrate the flow of data and manage task execution in a structured manner. The Pipeline class facilitates this by allowing the addition and connection of components, which must have unique input and output points for data transfer.

Pipelines are the backbone of NLP applications in Haystack, functioning as directed graphs where nodes are components and edges dictate data flow. They ensure smooth data processing, handle errors, and support debugging through visualization tools that help developers trace and optimize the data journey.

Haystack emphasizes modularity and flexibility, providing a range of pre-built components while also supporting custom ones for specific needs. The framework's pipelines enable the assembly of sophisticated NLP applications, integrating various functionalities into a cohesive system. In this notebook we will explore key components. In  the pipelines.ipynb notebook we will see how to connect them into pipelines.

### Introduction to components

Within Haystack, we can find the following key ready-made components. There are more, but for now we will focus on these as we get started with Haystack's functionality.

![](./images/haystack-components.png)

### Embedding components

In this example, `docs` is a list of `Document` objects with text content to be embedded. The `OpenAIDocumentEmbedder` is initialized with an OpenAI API key and is used to generate embeddings for each document. The embeddings are then printed out for each document in the `docs` list.

#### OpenAIDocumentEmbedder

In [1]:
from haystack.preview import Document
from haystack.preview.components.embedders import OpenAIDocumentEmbedder
from dotenv import load_dotenv
import os

# Load the .env file
load_dotenv("./../../.env")
api_key = os.getenv("OPENAI_API_KEY")

# List of documents to embed
docs = [Document(content="The quick brown fox jumps over the lazy dog."), 
        Document(content="To be or not to be, that is the question.")]

# Initialize the embedder with your OpenAI API key
document_embedder = OpenAIDocumentEmbedder(api_key=api_key)
""
# Run the embedder to get embeddings
result = document_embedder.run(docs)

# Access the embeddings stored in the documents
for doc in result['documents']:
    print(doc.embedding[0:2])

Calculating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.21s/it]

[0.001609589671716094, 0.005959704983979464]
[0.017629027366638184, -0.022774461656808853]





Taking a look at the result data structure

In [2]:
result

{'documents': [Document(id='2e3218009b01cfc57f865bbf81fa70de81b5ebae02c4cc7092e46ffde03f3c49', content='The quick brown fox jumps over the lazy dog.', dataframe=None, blob=None, meta={}, score=None),
  Document(id='63a06e3e867cb70e52a99c00b2de17fe531431c98e7d851268be01d341ea9f20', content='To be or not to be, that is the question.', dataframe=None, blob=None, meta={}, score=None)],
 'metadata': {'model': 'text-embedding-ada-002-v2',
  'usage': {'prompt_tokens': 22, 'total_tokens': 22}}}

The metadata shows the model and usage.

In [3]:
result['metadata']

{'model': 'text-embedding-ada-002-v2',
 'usage': {'prompt_tokens': 22, 'total_tokens': 22}}

#### OpenAITextEmbedder

In this snippet, `text_embedder` is created with an OpenAI API key and used to generate an embedding for the string "I love pizza!". The resulting embedding and associated metadata are then printed out.

In [4]:
from haystack.preview.components.embedders import OpenAITextEmbedder

# Initialize the text embedder with your OpenAI API key
text_embedder = OpenAITextEmbedder(api_key=api_key)

# Text you want to embed
text_to_embed = "I love pizza!"

# Embed the text and print the result
result_text= text_embedder.run(text_to_embed)

In [5]:
result_text.keys()

dict_keys(['embedding', 'metadata'])

As before, we can access the embeddings through the embedding key

In [6]:
result_text['embedding'][0:2]

[0.017020374536514282, -0.023255806416273117]

In [7]:
result_text['metadata']

{'model': 'text-embedding-ada-002-v2',
 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}

### SentenceTransformersDocumentEmbedder

In [16]:
from haystack.preview.components.embedders import SentenceTransformersDocumentEmbedder

# Initialize the document embedder with a model from the Sentence Transformers library
doc_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2")
doc_embedder.warm_up()

# Create a document to embed
doc = Document(content="I love pizza!")

# Embed the document and print the embedding
result = doc_embedder.run([doc])
print(result['documents'][0].embedding[0:2])


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[-0.07804739475250244, 0.14989925920963287]


In [20]:
result['documents'][0]

Document(id='ac2bc369f8115bb5bdee26d31f642520041e731da70d578ef116d3f67ad50c69', content='I love pizza!', dataframe=None, blob=None, meta={}, score=None)

### SentenceTransformersTextEmbedder

In [12]:
from haystack.preview.components.embedders import SentenceTransformersTextEmbedder

# Initialize the text embedder with a specific model from Sentence Transformers
text_embedder = SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2")

# Warm up the model before use
text_embedder.warm_up()

# Define the text you want to embed
text_to_embed = "I love pizza!"

# Embed the text and retrieve the embedding
result = text_embedder.run(text_to_embed)

# Print the embedding vector
print(result['embedding'][0:2])
# Output: List of floats representing the embedded vector


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[-0.07804739475250244, 0.14989925920963287]


In [14]:
result.keys()

dict_keys(['embedding'])

## Writing content into a Document Store

### `DocumentWriter`

#### Writing regular documents

We can write `Document` objects into a Document Store using the `DocumentWriter` class. In this example, we create a `DocumentStore` and write a `Document` object into it.

In [22]:
from haystack.preview.components.writers import DocumentWriter
from haystack.preview.document_stores import InMemoryDocumentStore
from haystack.preview.dataclasses import Document

# Initialize an in-memory document store
doc_store = InMemoryDocumentStore()

# Create the DocumentWriter component with the document store
document_writer = DocumentWriter(document_store=doc_store)

# Define a list of documents to write
documents_to_write = [
    Document(content="Document 1 content"),
    Document(content="Document 2 content"),
]

# Use the DocumentWriter component to write documents to the store
result = document_writer.run(documents=documents_to_write)

# Print the number of documents written
print(f"Documents written: {result['documents_written']}")


Documents written: 2


In [28]:
doc_store.count_documents()

2

In [29]:
doc_store.filter_documents()

[Document(id='10b329f15a2de8355bd9d538c759c45eb6193f51c7576f310400d14a9475deb8', content='Document 1 content', dataframe=None, blob=None, meta={}, score=None),
 Document(id='8d5435d9fd98ef235133c6c0bf4977595b69f10683fcc27a31e56fb15a024ff7', content='Document 2 content', dataframe=None, blob=None, meta={}, score=None)]

#### Writing embedded documents

There may be times in which, either due to the size of the data, or to preserve semantic meaning while leveraging embedding models, that we may want to work with embeddings instead. 

We can follow the next key steps.

* Compute Embeddings: Use either the `OpenAIDocumentEmbedder` or `SentenceTransformersDocumentEmbedder`, or other Haystack embedding model integration, to compute the embeddings for your documents.

* Store Embeddings: The computed embeddings are stored in the embedding field of the Document objects.

* Write to DocumentStore: Use the DocumentWriter component to write these Document objects, now with embeddings, into a DocumentStore.

Here's an example code snippet that demonstrates how to use the SentenceTransformersDocumentEmbedder to write embeddings into a document store:



In [32]:
from haystack.preview.document_stores import InMemoryDocumentStore
from haystack.preview.components.writers import DocumentWriter
from haystack.preview.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.preview.dataclasses import Document

# Initialize document store and components
doc_store = InMemoryDocumentStore()
doc_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2")
document_writer = DocumentWriter(document_store=doc_store)

# Example document
documents = [
    Document(content="The quick brown fox jumps over the lazy dog."),
    Document(content="When it comes to natural language processing, context is key.")
]

# Warm up the embedder and compute embeddings
doc_embedder.warm_up()
embedded_docs = doc_embedder.run(documents)['documents']

# Write documents with embeddings to the document store
document_writer.run(documents=embedded_docs)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'documents_written': 2}

Showing the document content and their embeddings

In [47]:
# Retrieve all documents
all_documents = doc_store.filter_documents()

# Print details of each document, including the embedding if it exists
for doc in all_documents:
    print(f"Document ID: {doc.id}")
    print(f"Content: {doc.content}")
    if doc.embedding:
        print(f"Embedding: {doc.embedding[:5]}...")  # Displaying first 5 values of the embedding for brevity
    print("\n")


Document ID: 2e3218009b01cfc57f865bbf81fa70de81b5ebae02c4cc7092e46ffde03f3c49
Content: The quick brown fox jumps over the lazy dog.
Embedding: [-0.03429264575242996, -0.0013394346460700035, 0.004336129408329725, -0.0018683503149077296, 0.025440821424126625]...


Document ID: 8baba41960a8807c42da6783a39dbbf50873f9700ff861844ec8ccce65d4f50e
Content: When it comes to natural language processing, context is key.
Embedding: [0.049897201359272, -0.023004200309515, -0.03653186932206154, 0.05246769264340401, -0.01983010210096836]...


