# Building blocks in Haystack: Data classes 

When building data pipelines, a core component involved is the use of data structures. With data structures, we can store, manipulate and manage data through code. Having a solid foundation for data structures is key to ease NLP pipeline development, particularly when an LLM is involved.  With Haystack, we can leverage the following built-in data classes: 

* Data classes to represent Documents

* Data classes to represent Byte stream data

* Data classes to represent StreamingChunk data

* Data classes to represent chat messages 

* Data classes to represent question and answer data 

We can store text, dataframe objects and byte stream objects into Documents. 

![](./images/data-structures.png)

Each of these classes act as data structures that can be used to store and process data. We can use these classes to store data in a standardized format, and then use the Haystack API to process the data through data pipelines.

In the next section, we will provide examples of each.

In [None]:
!pip install --upgrade haystack-ai

### Haystack `Document` data class 

The Document is a foundational data class in Haystack that encapsulates a variety of data types that can be queried, such as text snippets, tables, and binary data.

Let's import it and take a look at its functionality.

In [2]:
from haystack.dataclasses import Document


Let's create a simple Document object.

In [4]:
sample_document = Document(content="This is a simple document", meta={"name": "test_doc"})
sample_document

Document(id=ca53157e450d009adb4c2217111faadc9e7c02aefb22717c4901e1c1c1ba314a, content: 'This is a simple document', meta: {'name': 'test_doc'})

In [5]:
sample_document.id

'ca53157e450d009adb4c2217111faadc9e7c02aefb22717c4901e1c1c1ba314a'

We see that an id was automatically generated for the document. Let's access the content and metadata.

In [6]:
sample_document.content

'This is a simple document'

In [7]:
sample_document.meta

{'name': 'test_doc'}

If we prefer to  ID, we control the ID by passing it in as a parameter.

In [8]:
# Create a simple text-based Document with a custom ID
sample_document = Document(
    content="This is a simple document",
    meta={"name": "test_doc"},
    id="custom_doc_id"  
)

sample_document.id

'custom_doc_id'

#### Let's create a dataframe-based Document

In [9]:
import pandas as pd
from sklearn.datasets import load_iris

# Load some example data
iris_df = load_iris(as_frame=True)["frame"]

# Save each row as a Document Object
iris_docs = [Document(dataframe=row.to_frame().T) for _, row in iris_df.iterrows()]

We see that each row was converted into a Document object, each with its own id. Let's access the first Document  and attributes.

In [10]:
iris_docs[0]

Document(id=22cf9396b67c1929c273ed65a6fcea5b8ba8b384ae45d5164be9ca7b6827c66c, dataframe: (1, 5))

In [11]:
iris_docs[0].id

'22cf9396b67c1929c273ed65a6fcea5b8ba8b384ae45d5164be9ca7b6827c66c'

In [12]:
iris_docs[0].dataframe

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0


#### Let's create a ByteStream-based data structure

The ByteStream class in Haystack represents a binary object that can be used within the API.

In [13]:
from haystack.dataclasses import  ByteStream

Let's read an image file.

In [14]:
with open("./images/data-struct2.png" ,"rb") as image:
    image_data=image.read()

After we read the image to memory, we can create a `ByteStream` object, a `Document` object and access its attributes.

In [15]:
# Convert binary data to ByteStream object
binary_image = ByteStream(data=image_data, mime_type='application/image')  # MIME type should match your data


Methods we can use

* `data` - returns the binary data as a byte string
* `from_file_path` - creates a ByteStream object from a file path
* `from_string` - creates a ByteStream object from a string
* `metadata` - returns the metadata associated with the ByteStream object
* `mime_type` - returns the mime type of the ByteStream object
* `to_file` - writes the ByteStream object to a file

Let's save the binary data to a `Document` object.

In [16]:
binary_document_im = Document(blob=binary_image, meta={"file_name": "data-strcut2.png", "file_type": "image"})

Let's take a look at the content of the blob through the data structure properties. 

In [17]:
binary_document_im.blob.data[0:23]

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xa9\x00\x00\x00'

In [18]:
binary_document_im.meta

{'file_name': 'data-strcut2.png', 'file_type': 'image'}

In [19]:
binary_document_im.id

'd59124d4c7495fdd763dc35e434cd4ae69cb198fed6b75270e388467e1c1688d'

Let's read a PDF file.

In [20]:
with open("./images/Sample PDF.pdf" ,"rb") as pdf:
    pdf_data=pdf.read()

In [21]:
# Convert binary data to ByteStream object
binary_pdf = ByteStream(data=pdf_data, mime_type='application/pdf')  # MIME type should match your data
binary_document_pdf = Document(blob=binary_pdf, meta={"file_name": "Sample PDF.pdf", "file_type": "PDF"})

In [22]:
binary_document_pdf.blob.data[0:23]

b'%PDF-1.4\n%\xd3\xeb\xe9\xe1\n1 0 obj\n'

In [23]:
binary_document_pdf.meta

{'file_name': 'Sample PDF.pdf', 'file_type': 'PDF'}

In [24]:
binary_document_pdf.id

'3165cf5c70a2d635022a6aa264b9fea45a05675d3d632263a7603627b46316c9'

### Ranking the Document objects

The next exercise demonstrates how to rank `Document` objects using the iris dataset as an example. 

In [25]:
# Recalling the iris_df dataframe
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Let's sort the rows by sepal length (cm) in descending order.

In [26]:
sorted_df = iris_df.sort_values(by=["sepal length (cm)"], ascending=False)
sorted_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
131,7.9,3.8,6.4,2.0,2
135,7.7,3.0,6.1,2.3,2
122,7.7,2.8,6.7,2.0,2
117,7.7,3.8,6.7,2.2,2
118,7.7,2.6,6.9,2.3,2


In [27]:
# Let's assume we want to use 'sepal length (cm)' as the score for ranking
sorted_docs = []
for  _, row in sorted_df.iterrows():
    doc = Document(
        dataframe=row.to_frame().T,
        score=row["sepal length (cm)"]  
    )
    sorted_docs.append(doc)

# Let's check the first document
sorted_docs[1]

Document(id=fefcdfd715c4f2fc66bdab55a84db31b23f30726b6f593acbf432f312f76a832, dataframe: (1, 5), score: 7.7)

In [28]:
# Let's check the last document
sorted_docs[-1]

Document(id=5063934be73ae00dbddcc65b093948dd6af9bbfd13947c1906509898c0295460, dataframe: (1, 5), score: 4.3)

When we learn about components, we'll see how to use the ranker components to rank documents based on a specific metadata field using the score field.

### Storing the `Documents`: introducing the `DocumentStore` class 

The `DocumentStore` class is an internal component of the Haystack library that serves as a registry for classes that are marked as document stores. A document store in Haystack is a place where documents are stored and retrieved, typically used as part of a pipeline to handle data for search and retrieval tasks. 

Let's begin saving our documents into a DocumentStore.


In [29]:
from haystack.document_stores.in_memory.document_store import InMemoryDocumentStore

We will initialize an InMemoryDocumentStore and save our documents into it. From its documentation:

* Stores data in-memory. It's ephemeral and cannot be saved to disk.
* Uses the BM25 algorithm for document search by default.
* Useful for testing and quick prototyping.


In [32]:
# Write documents to document store
sample_docstore = InMemoryDocumentStore()
sample_docstore.write_documents(documents=sorted_docs)


150

Counting total number of documents in the DocumentStore

In [33]:
sample_docstore.count_documents()

150

Transform DocumentStore into dictionary.

In [34]:
sample_docstore.to_dict()

{'type': 'haystack.document_stores.in_memory.document_store.InMemoryDocumentStore',
 'init_parameters': {'bm25_tokenization_regex': '(?u)\\b\\w\\w+\\b',
  'bm25_algorithm': 'BM25L',
  'bm25_parameters': {},
  'embedding_similarity_function': 'dot_product'}}

Let's add the `Document` associated to the blobs with binary data  to the same `DocumentStore`. Let's add the sample `Document` objects to the `DocumentStore`.

In [35]:
sample_docstore.write_documents(documents=[binary_document_im, binary_document_pdf, sample_document])

3

Let's verify the new record was added.

In [36]:
sample_docstore.count_documents()

153

Let's verify the IDs are unique.

In [37]:
ids = [item.id for item in sample_docstore.filter_documents()]

len(set(ids))

153

## What if the data changes over time, like in a chatbot or in an audio or video file?

Haystack provides a `ChatMessage` class that can be to represent a chat message. This is useful for chatbots, where the data is constantly changing. It also provides a `StreamingChunk` class that can be used to represent a chunk of data that is streamed in real-time.

![](./images/data-struct2.png)

Both data structures can be used to enhance the functionality of our LLM based pipelines. Let's take a look at each.

#### Let's create a `ChatMessage` data structure

`ChatMessage` comes built in with the following roles

```python
class ChatRole(str, Enum):
    """Enumeration representing the roles within a chat."""

    ASSISTANT = "assistant"
    USER = "user"
    SYSTEM = "system"
    FUNCTION = "function"
```

These can be mapped to the roles present in OpenAI's GPT models. 

Read more https://help.openai.com/en/articles/7042661-chatgpt-api-transition-guide

In [38]:
from haystack.dataclasses import ChatMessage

In [39]:
# Create a message from the assistant
assistant_msg = ChatMessage.from_assistant(content="Hello, how can I assist you today?")

print(assistant_msg)


ChatMessage(content='Hello, how can I assist you today?', role=<ChatRole.ASSISTANT: 'assistant'>, name=None, meta={})


In [40]:
# Create a message from the user
user_msg = ChatMessage.from_user(content="Can you show me the weather forecast?")

print(user_msg)

ChatMessage(content='Can you show me the weather forecast?', role=<ChatRole.USER: 'user'>, name=None, meta={})


In [41]:
# Create a system message, for instance, to indicate that a user has joined the chat
system_msg = ChatMessage.from_system(content="A new user has joined the chat.")

print(system_msg)

ChatMessage(content='A new user has joined the chat.', role=<ChatRole.SYSTEM: 'system'>, name=None, meta={})


In [42]:
# Create a function message, for example, to execute a command to retrieve weather data
function_msg = ChatMessage.from_function(content="Retrieving weather data...", name="fetch_weather")

print(function_msg)

ChatMessage(content='Retrieving weather data...', role=<ChatRole.FUNCTION: 'function'>, name='fetch_weather', meta={})


#### `StreamingChunk`  data class

Additionally, Haystack provides a `StreamingChunk` class that can be used to represent a segment of streamed content. This is useful for streaming data, such as audio or video, where the data is constantly changing.

The `StreamingChunk` class is designed to manage segments of streamed content, which could be part of a larger message or data transfer in a streaming context.

In [43]:
from haystack.dataclasses import StreamingChunk

Here's an example of how to create an instance of the StreamingChunk data class, which might represent a segment of a live video stream or an ongoing audio broadcast:

In [44]:
# Metadata for the streaming chunk
stream_metadata = {
    "timestamp": "2023-11-08T12:00:00Z",
    "stream_id": "stream123",
    "segment_number": 1
}

# Content of the streaming chunk
stream_content = "This is the first segment of the live stream."

# Create the StreamingChunk instance
streaming_chunk = StreamingChunk(content=stream_content, metadata=stream_metadata)

print(streaming_chunk)


TypeError: StreamingChunk.__init__() got an unexpected keyword argument 'metadata'

### What about data structures to validate responses in a Q&A system?

We will now turn our attention to a data structure focused on question and answer systems.

#### `Answer`, `ExtractedAnswer` and `GeneratedAnswer` data classes

These classes can be used as additional tooling when building in natural language processing (NLP) pipelines, particularly in the context of question answering systems.

The `Answer`, `ExtractedAnswer`, and `GeneratedAnswer` are data classes in Haystack that represent the structure of answers obtained from different components in a search or question-answering pipeline.

![](./images/qa-data-structures.png)

In [45]:
from haystack.dataclasses import Answer, GeneratedAnswer, ExtractedAnswer

#### `Answer` 

This is a base data class used to encapsulate the answer data along with its associated query and metadata. It's a generic class that can be used in different contexts where an answer object is required.

* data: The content of the answer. 

* query: The original question or query that prompted the answer. 

* metadata: A dictionary containing any additional information about the answer. 

Use Cases for `Answer`

* As a return type for components that generate answers to a query, ensuring a consistent interface.
* To encapsulate answers for further processing in a pipeline, such as ranking or formatting.



In [49]:
# Assume we have a document that contains the answer to a question
doc = Document(content="Berlin is the capital of Germany.", id="123")

answer = Answer(data='Berlin',
                 query='What is the capital of Germany?',
                 meta={})


answer

TypeError: Protocols cannot be instantiated

#### `ExtractedAnswer` 

This is a specialized version of the Answer class for scenarios where an answer is extracted from a document. It's typically used in extractive question-answering systems.

* data: The text of the answer extracted from a document. 

* document: The Document object from which the answer was extracted. 

* probability: A float representing the confidence score of the extracted answer being correct. 

* start: The start index of the answer in the content of the Document. 

* end: The end index of the answer in the content of the Document. 

Use Cases for `ExtractedAnswer`

* In extractive QA systems where answers are directly pulled from the content of documents.
* When there is a need to trace back the answer to its source for validation or display purposes.



In [51]:
# After processing a query, we find the answer and create an ExtractedAnswer object
extracted_answer = ExtractedAnswer(
    data="Berlin",
    query="What is the capital of Germany?",
    meta={},
    document=doc,
    score=0.95,
    start=0,
    end=6
)


# These objects can then be used to present answers, log results, or further processing
print(f"Extracted Answer: {extracted_answer.data} with probability {extracted_answer.probability}")


TypeError: ExtractedAnswer.__init__() got an unexpected keyword argument 'start'

#### `GeneratedAnswer` 

This class is used when an answer is generated by a model, as in generative question-answering systems, and is not a direct excerpt from any document.

* data: The generated text of the answer. 

* documents: A list of Document objects that were used as context or reference to generate the answer. 

Use Cases for `GeneratedAnswer`

* In generative QA systems where answers are composed by the model based on information from multiple documents.
* In dialogue systems where the response is generated based on the context provided by previous conversation turns.

In [48]:
# In another scenario, we might have a generated answer, not directly extracted from a specific location in a document
generated_answer = GeneratedAnswer(
    data="Berlin is the capital of Germany.",
    documents=[doc],
    query="What is the capital of Germany?",
    metadata={},
)

print(f"Generated Answer: {generated_answer.data}")

TypeError: GeneratedAnswer.__init__() got an unexpected keyword argument 'metadata'

In the next section, we will begin to get familiar with components and pipelines. This will enable us to process the data further, and connect the document store to a retriever and an LLM for data extraction using Natural Language.

[Follow next notebook](components.ipynb)