# Exploring Haystack Components: A Pedagogical Walkthrough
This notebook demonstrates how to use Haystack's core components for document cleaning, splitting, prompt building, and LLM generation. Each step is explained with code and commentary to help you understand the pipeline construction process.

## 1. Environment Setup
We start by loading environment variables (such as API keys) from a `.env` file. This is a best practice for keeping sensitive information out of your codebase.

In [2]:
import os
from dotenv import load_dotenv

# Load environment variables from.env file
load_dotenv(".env")

# You can now access the API key using os.getenv
# openai_api_key = os.getenv("OPENAI_API_KEY")


True

## 2. Creating Documents
We use Haystack's `Document` class to represent pieces of text along with optional metadata. This is the basic unit that flows through Haystack pipelines.

In [9]:
# Import the Document class from Haystack's data structures
from haystack.dataclasses import Document

# Create a list of Document objects
# Each Document has 'content' (the text) and optional 'meta' (a dictionary for metadata)
documents =[
    Document(
        content="Haystack   is an open-source framework for building     search systems.",
        meta={"source": "haystack_docs", "author": "deepset"},
    ),
    Document(
        content="Transformers     provide state-of-the-art natural     language processing capabilities.",
        meta={"source": "transformers_docs", "author": "huggingface"},
    ),
]

# Print the content of the first document to verify
print(documents[0])


Document(id=5dd82d78fe24f63940aee355fb069d6d2fc21c48b9a8ac14bc6ee12611c68f45, content: 'Haystack   is an open-source framework for building     search systems.', meta: {'source': 'haystack_docs', 'author': 'deepset'})


## 3. Cleaning Documents
Text data is often messy. The `DocumentCleaner` component helps by removing empty lines and extra whitespace, making the text easier to process in later steps.

In [10]:
# Import the DocumentCleaner component
from haystack.components.preprocessors import DocumentCleaner

# 1. Initialize the DocumentCleaner
# We can configure its behavior with parameters. Here, we're telling it to:
# - remove_empty_lines: Delete lines that contain only whitespace.
# - remove_extra_whitespaces: Collapse multiple whitespace characters into a single space.
cleaner = DocumentCleaner(remove_empty_lines=True, remove_extra_whitespaces=True)

# 2. Run the component
# The.run() method takes a list of documents as input under the 'documents' key.
# It returns a dictionary where the cleaned documents are under the 'documents' key.
result = cleaner.run(documents=documents)
cleaned_documents = result["documents"]

# Let's inspect the content of the third document, which was messy before.
print("Original unclean document content:")
print(f"'{documents[1].content}'")

print("\nCleaned document content:")
print(f"'{cleaned_documents[1].content}'")




Original unclean document content:
'Transformers     provide state-of-the-art natural     language processing capabilities.'

Cleaned document content:
'Transformers provide state-of-the-art natural language processing capabilities.'


## 4. Splitting Documents into Chunks
Long documents are often split into smaller chunks for processing by language models. The `DocumentSplitter` component allows you to break up text by word count, character count, or other strategies, with optional overlap for context retention.

In [None]:
# Import the DocumentSplitter component
from haystack.components.preprocessors import DocumentSplitter

# 1. Initialize the DocumentSplitter
# - split_by="word": We want to split the text based on word count.
# - split_length=5: Each chunk should have a maximum of 5 words.
# - split_overlap=2: Each chunk will share the last 2 words of the previous chunk.
splitter = DocumentSplitter(split_by="word", split_length=5, split_overlap=2)


# 2. Run the component on our cleaned documents
result = splitter.run(documents=cleaned_documents)
split_documents = result["documents"]

# Let's see how the first document was split
print(f"Original document had {len(cleaned_documents)} document(s).")
print(f"After splitting, we have {len(split_documents)} documents (chunks).")

print("\nChunks from the first document:")
for doc in split_documents:
    # We can check the metadata to see which original document a chunk came from
    if doc.meta['source'] == "transformers_docs":
        print(f"- '{doc.content}'")


Original document had 2 document(s).
After splitting, we have 5 documents (chunks).

Chunks from the first document:
- 'Transformers provide state-of-the-art natural language '
- 'natural language processing capabilities.'


## 5. Building Prompts for LLMs
Prompt engineering is crucial for getting good results from language models. The `PromptBuilder` component uses Jinja2 templates to dynamically construct prompts from your data and questions.

In [15]:
# Import the PromptBuilder component
from haystack.components.builders import PromptBuilder

# 1. Define a Jinja2 template string
# Placeholders are enclosed in double curly braces, like {{ query }} and {{ documents }}.
# Jinja2 also supports control structures like for-loops.
prompt_template = """
Answer the following question based on the provided context.

Context:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}

Question: {{ query }}
Answer:
"""

# 2. Initialize the PromptBuilder with the template
prompt_builder = PromptBuilder(template=prompt_template)

# 3. Run the component
# We provide the values for the placeholders ('documents' and 'query') as keyword arguments.
# The component will render the template into a single string.
result = prompt_builder.run(
    query="What is the climate like in Islamabad?",
    documents=split_documents  # Using the chunks from the previous step
)

# The final, rendered prompt is in the 'prompt' key of the output dictionary
final_prompt = result["prompt"]

print(final_prompt)


PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.



Answer the following question based on the provided context.

Context:

- Haystack is an open-source framework 

- open-source framework for building search 

- building search systems.

- Transformers provide state-of-the-art natural language 

- natural language processing capabilities.


Question: What is the climate like in Islamabad?
Answer:


## 6. Generating Answers with an LLM
Finally, we use the `OpenAIGenerator` component to send our prompt to a language model (like GPT-4) and receive a generated answer. This step requires an API key and internet access.

In [16]:
from haystack.utils import Secret
from haystack.components.generators import OpenAIGenerator


# Check if the API key is available
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# 1. Initialize the OpenAIGenerator
# We provide the API key securely using the Secret class.
# We also specify the model we want to use.
generator = OpenAIGenerator(
    api_key=Secret.from_env_var("OPENAI_API_KEY"),
    model="gpt-4o-mini"
)

# 2. Run the component with the prompt from the PromptBuilder
# The input key is 'prompt'.
result = generator.run(prompt=final_prompt)

# The generated text is in a list under the 'replies' key.
# The 'meta' key contains additional information like token usage.
llm_reply = result["replies"]
llm_meta = result["meta"]

print("LLM Reply:")
print(llm_reply)

print("\nLLM Meta:")
print(llm_meta)

LLM Reply:
['The provided context does not contain any information regarding the climate in Islamabad. Therefore, I cannot answer that question based on the given context.']

LLM Meta:
[{'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 27, 'prompt_tokens': 70, 'total_tokens': 97, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}}]


---
## Summary and Next Steps
In this notebook, you learned how to use Haystack components to clean, split, and process documents, build prompts, and generate answers with an LLM. You can now experiment with your own data and templates, or extend the pipeline with more advanced components!