# Building blocks in Haystack: Data classes 

When building data pipelines, a core component involved is the use of data structures. With data structures, we can store, manipulate and manage data through code. Having a solid foundation for data structures is key to ease NLP pipeline development, particularly when an LLM is involved.  With Haystack, we can leverage the following built-in data classes: 

* Haystack Documents data class 

* Haystack ByteStream data class 

* Haystack ChatMessage data class 

* Haystack StreaminhChunk data class 

It also provides support for dataframe objects as well as dictionaries and JSON objects. 

![](./images/data-structures.png)

Each of these classes act as data structures that can be used to store and process data. We can use these classes to store data in a standardized format, and then use the Haystack API to process the data through data pipelines.

In the next section, we will provide examples of each.

### Haystack Documents data class 

The Document is a foundational data class in Haystack that encapsulates a variety of data types that can be queried, such as text snippets, tables, and binary data.

Let's import it and take a look at its functionality.

In [4]:
from haystack.preview.dataclasses import Document

Using the `help` function lets us see what parameters it accepts. 

### Let's create a simple Document object.

In [5]:
sample_document = Document(content="This is a simple document", meta={"name": "test_doc"})
sample_document

Document(id='ca53157e450d009adb4c2217111faadc9e7c02aefb22717c4901e1c1c1ba314a', content='This is a simple document', dataframe=None, blob=None, meta={'name': 'test_doc'}, score=None)

In [6]:
sample_document.id

'ca53157e450d009adb4c2217111faadc9e7c02aefb22717c4901e1c1c1ba314a'

We see that an id was automatically generated for the document. Let's access the content and metadata.

In [7]:
sample_document.content

'This is a simple document'

In [8]:
sample_document.meta

{'name': 'test_doc'}

### Let's create a dataframe-based Document

In [9]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups, load_iris

# Load some example data
iris_df = load_iris(as_frame=True)["frame"]
news_df = pd.DataFrame(fetch_20newsgroups(subset="train").data, columns=["text"])

# Save each row as a Document Object
iris_docs = [Document(dataframe=row.to_frame().T) for _, row in iris_df.iterrows()]

We see that each row was converted into a Document object, each with its own id. Let's access the first Document  and attributes.

In [10]:
iris_docs[0]

Document(id='22cf9396b67c1929c273ed65a6fcea5b8ba8b384ae45d5164be9ca7b6827c66c', content=None, dataframe=   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
0                5.1               3.5  ...               0.2     0.0

[1 rows x 5 columns], blob=None, meta={}, score=None)

In [11]:
iris_docs[0].id

'22cf9396b67c1929c273ed65a6fcea5b8ba8b384ae45d5164be9ca7b6827c66c'

In [12]:
iris_docs[0].dataframe

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0


### Let's create a ByteStream-based data structure

The ByteStream class in Haystack represents a binary object that can be used within the API.

In [13]:
from haystack.preview.dataclasses import  ByteStream

In [14]:
# Assuming 'binary_data' is your binary data, for example, read from a file:
binary_data = b'Your binary data here'  # This could be the actual binary content, such as PDF or image data

# Convert binary data to ByteStream object
binary_blob = ByteStream(data=binary_data, mime_type='application/pdf')  # MIME type should match your data
binary_document = Document(blob=binary_blob, meta={"file_name": "example.pdf", "file_type": "PDF"})

In [15]:
binary_document.blob

ByteStream(data=b'Your binary data here', metadata={}, mime_type='application/pdf')

In [16]:
binary_document.id

'93c323201fc3b8509e51056dff8baee6ca9dec1c22cf2ce2f6cfc0bb04397c14'

### Let's create a ChatMessage Document

ChatMessage comes built in with the following roles

```python
class ChatRole(str, Enum):
    """Enumeration representing the roles within a chat."""

    ASSISTANT = "assistant"
    USER = "user"
    SYSTEM = "system"
    FUNCTION = "function"
```

These can be mapped to the roles present in OpenAI's GPT models. 

Read more https://help.openai.com/en/articles/7042661-chatgpt-api-transition-guide

In [17]:
from haystack.preview.dataclasses import ChatMessage

In [18]:
# Create a message from the assistant
assistant_msg = ChatMessage.from_assistant(content="Hello, how can I assist you today?")

print(assistant_msg)


ChatMessage(content='Hello, how can I assist you today?', role=<ChatRole.ASSISTANT: 'assistant'>, name=None, metadata={})


In [19]:
# Create a message from the user
user_msg = ChatMessage.from_user(content="Can you show me the weather forecast?")

print(user_msg)

ChatMessage(content='Can you show me the weather forecast?', role=<ChatRole.USER: 'user'>, name=None, metadata={})


In [20]:
# Create a system message, for instance, to indicate that a user has joined the chat
system_msg = ChatMessage.from_system(content="A new user has joined the chat.")

print(system_msg)

ChatMessage(content='A new user has joined the chat.', role=<ChatRole.SYSTEM: 'system'>, name=None, metadata={})


In [21]:
# Create a function message, for example, to execute a command to retrieve weather data
function_msg = ChatMessage.from_function(content="Retrieving weather data...", name="fetch_weather")

print(function_msg)

ChatMessage(content='Retrieving weather data...', role=<ChatRole.FUNCTION: 'function'>, name='fetch_weather', metadata={})


Let's populate a Document object with a ChatMessage object.

In [22]:
user_message_doct = Document(content = user_msg)

user_message_doct.id

'740dcdab24b6c171e89af1f1158056d6f09c6cd238a39866dfe7160a47eeba9a'

In [23]:
user_message_doct.content

ChatMessage(content='Can you show me the weather forecast?', role=<ChatRole.USER: 'user'>, name=None, metadata={})

### StreamingChunk  data class

The StreamingChunk class is designed to manage segments of streamed content, which could be part of a larger message or data transfer in a streaming context.

In [24]:
from haystack.preview.dataclasses import StreamingChunk

Here's an example of how to create an instance of the StreamingChunk data class, which might represent a segment of a live video stream or an ongoing audio broadcast:

In [25]:
# Metadata for the streaming chunk
stream_metadata = {
    "timestamp": "2023-11-08T12:00:00Z",
    "stream_id": "stream123",
    "segment_number": 1
}

# Content of the streaming chunk
stream_content = "This is the first segment of the live stream."

# Create the StreamingChunk instance
streaming_chunk = StreamingChunk(content=stream_content, metadata=stream_metadata)

print(streaming_chunk)


StreamingChunk(content='This is the first segment of the live stream.', metadata={'timestamp': '2023-11-08T12:00:00Z', 'stream_id': 'stream123', 'segment_number': 1})


We can generate metadata for the StreamingChunk object and capture changes in the stream. We can also store streaming content into Documents.

In [26]:
streaming_document = Document(content = stream_content, meta = stream_metadata)

streaming_document.id

'55b3a7072cc30c752c726922b929f073bf377fb72dbe89431c323031cf5360cd'

In [27]:
streaming_document.content

'This is the first segment of the live stream.'

### The `DocumentStore` class 

The `DocumentStore` class is an internal component of the Haystack library that serves as a registry for classes that are marked as document stores. A document store in Haystack is a place where documents are stored and retrieved, typically used as part of a pipeline to handle data for search and retrieval tasks. 

Let's begin saving our documents into a DocumentStore.


In [28]:
from haystack.preview.document_stores.in_memory.document_store import InMemoryDocumentStore

Recall our iris dataframe collection of Documents

In [29]:
iris_docs[0:5]

[Document(id='22cf9396b67c1929c273ed65a6fcea5b8ba8b384ae45d5164be9ca7b6827c66c', content=None, dataframe=   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
 0                5.1               3.5  ...               0.2     0.0
 
 [1 rows x 5 columns], blob=None, meta={}, score=None),
 Document(id='c4852f58c6c65daaa7b11d7c009d8cbf7198c52c55f63fe27bf888beec64b673', content=None, dataframe=   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
 1                4.9               3.0  ...               0.2     0.0
 
 [1 rows x 5 columns], blob=None, meta={}, score=None),
 Document(id='109c0409cdbcf2343ee97efd3ec334e74e73b5eeed3ecc362cbcff8adda10603', content=None, dataframe=   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
 2                4.7               3.2  ...               0.2     0.0
 
 [1 rows x 5 columns], blob=None, meta={}, score=None),
 Document(id='3eef63e56ef7174a490478bd4147b70c113521fee93a13f430e14641a330fff3', content

We will initialize an InMemoryDocumentStore and save our documents into it. From its documentation:

* Stores data in-memory. It's ephemeral and cannot be saved to disk.
* Uses the BM25 algorithm for document search by default.
* Useful for testing and quick prototyping.


In [30]:
# Write documents to document store
iris_docstore = InMemoryDocumentStore()
iris_docstore.write_documents(documents=iris_docs)


Counting total number of documents in the DocumentStore

In [31]:
iris_docstore.count_documents()

150

Transform DocumentStore into dictionary.

In [32]:
iris_docstore.to_dict()

{'type': 'InMemoryDocumentStore',
 'init_parameters': {'bm25_tokenization_regex': '(?u)\\b\\w\\w+\\b',
  'bm25_algorithm': 'BM25Okapi',
  'bm25_parameters': {},
  'embedding_similarity_function': 'dot_product'}}

We will now turn our attention to a data structure focused on question and answer systems.

### Answer, ExtractedAnswer and GeneratedAnswer

The Answer, ExtractedAnswer, and GeneratedAnswer classes are data structures commonly used in natural language processing (NLP) pipelines, particularly in the context of question answering systems.


In [33]:
from haystack.preview.dataclasses import Answer, GeneratedAnswer, ExtractedAnswer

#### `Answer` is a base data class that represents a generic answer structure. It contains the fields: 

* data: The content of the answer. 

* query: The original question or query that prompted the answer. 

* metadata: A dictionary containing any additional information about the answer. 

#### `ExtractedAnswer` inherits from Answer and is more specific to scenarios where the answer is extracted from a text. It includes additional fields: 

* data: The text of the answer extracted from a document. 

* document: The Document object from which the answer was extracted. 

* probability: A float representing the confidence score of the extracted answer being correct. 

* start: The start index of the answer in the content of the Document. 

* end: The end index of the answer in the content of the Document. 

#### `GeneratedAnswer` also inherits from Answer and is used when the answer is generated (for example, by a language model) rather than extracted. Its fields are: 

* data: The generated text of the answer. 

* documents: A list of Document objects that were used as context or reference to generate the answer. 

In [37]:
# Assume we have a document that contains the answer to a question
doc = Document(content="Berlin is the capital of Germany.", id="123")

answer = Answer(data='Berlin',
                 query='What is the capital of Germany?',
                 metadata={})

# After processing a query, we find the answer and create an ExtractedAnswer object
extracted_answer = ExtractedAnswer(
    data="Berlin",
    query="What is the capital of Germany?",
    metadata={},
    document=doc,
    probability=0.95,
    start=0,
    end=6
)

# In another scenario, we might have a generated answer, not directly extracted from a specific location in a document
generated_answer = GeneratedAnswer(
    data="Berlin is the capital of Germany.",
    documents=[doc],
    query="What is the capital of Germany?",
    metadata={},
)

# These objects can then be used to present answers, log results, or further processing
print(f"Extracted Answer: {extracted_answer.data} with probability {extracted_answer.probability}")
print(f"Generated Answer: {generated_answer.data}")

Extracted Answer: Berlin with probability 0.95
Generated Answer: Berlin is the capital of Germany.


In the next section, we will begin to get familiar with components and pipelines. This will enable us to process the data further, and connect the document store to a retriever and an LLM for data extraction using Natural Language.

[Follow next notebook](components.ipynb)