<a href="https://colab.research.google.com/github/RTVIENNA/1450-RAG-Preprocessing/blob/main/1450_preprocessing_index_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Preprocessing Different File Types

- **Level**: Beginner
- **Time to complete**: 15 minutes
- **Goal**: After completing this tutorial, you'll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the `FileTypeRouter`.

> 💡 (Optional): After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) for this section

## Components Used

- [`FileTypeRouter`](https://docs.haystack.deepset.ai/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components
- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents
- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents
- [`TextFileToDocument`](https://docs.haystack.deepset.ai/docs/textfiletodocument): This component will help you convert text files into Haystack Documents
- [`DocumentJoiner`](https://docs.haystack.deepset.ai/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline
- [`DocumentCleaner`](https://docs.haystack.deepset.ai/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.
- [`DocumentSplitter`](https://docs.haystack.deepset.ai/docs/documentsplitter): This component will help you to split your Document into chunks
- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.
- [`DocumentWriter`](https://docs.haystack.deepset.ai/docs/documentwriter): This component will help you write Documents into the DocumentStore

## Overview

In this tutorial, you'll build an indexing pipeline that preprocesses different types of files (markdown, txt and pdf). Each file will have its own `FileConverter`. The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store.

Optionally, you can keep going to see how to use these documents in a query pipeline as well.

## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installing dependencies


In [None]:
%%bash

nvidia-smi



Wed Mar 19 20:34:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [None]:
%%bash
pip install haystack-ai
pip install "sentence-transformers>=3.0.0" "huggingface_hub>=0.23.0"
pip install markdown-it-py mdit_plain pypdf
pip install gdown

Collecting haystack-ai
  Downloading haystack_ai-2.11.2-py3-none-any.whl.metadata (14 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting posthog!=3.12.0 (from haystack-ai)
  Downloading posthog-3.21.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting monotonic>=1.5 (from posthog!=3.12.0->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting backoff>=1.10.0 (from posthog!=3.12.0->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Downloading haystack_ai-2.11.2-py3-none-any.whl (451 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 451.6/451.6 kB 12.9 MB/s eta 0:00:00
Downloading posthog-3.21.0-py2.py3-none-any.whl (79 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.6/79.6 kB 7.2 MB/s eta 0:00:00
Downloading haystack_experimenta

In [None]:
import logging
from haystack import tracing
from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)

tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs)
tracing.enable_tracing(LoggingTracer(tags_color_strings={"haystack.component.input": "\x1b[1;31m", "haystack.component.name": "\x1b[1;34m"}))

### Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/enabling-telemetry) for more details.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(30)

INFO:haystack.telemetry._telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


## Download All Files

Files that you will use in this tutorial are stored in a [GDrive folder](https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj). Either download files directly from the GDrive folder or run the code below. If you're running this tutorial on colab, you'll find the downloaded files under "/recipe_files" folder in "files" tab on the left.

Just like most real life data, these files are a mishmash of different types.

In [None]:
import gdown

url = "https://drive.google.com/drive/u/0/folders/1YrBIqbbi5uXjR-fuEAMBHL-TwpjtViXu"
output_dir = "1450_files"

gdown.download_folder(url, quiet=True, output=output_dir)

['1450_files/Manchester Triage System_ Notaufnahmen Campus Charité Mitte und Campus Virchow-Klinikum.pdf',
 '1450_files/pflegenetz.magazin_Kovacevic.pdf']

## Create a Pipeline to Index Documents

Next, you'll create a pipeline to index documents. To keep things uncomplicated, you'll use an `InMemoryDocumentStore` but this approach would also work with any other flavor of `DocumentStore`.

You'll need a different file converter class for each file type in our data sources: `.pdf`, `.txt`, and `.md` in this case. Our `FileTypeRouter` connects each file type to the proper converter.

Once all our files have been converted to Haystack Documents, we can use the `DocumentJoiner` component to make these a single list of documents that can be fed through the rest of the indexing pipeline all together.

In [None]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown"])
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()

DEBUG:haystack.core.component.component:Registering <class 'haystack.components.writers.document_writer.DocumentWriter'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.writers.document_writer.DocumentWriter'>
DEBUG:haystack.core.component.component:Registering <class 'haystack.components.converters.markdown.MarkdownToDocument'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.converters.markdown.MarkdownToDocument'>
DEBUG:haystack.core.component.component:Registering <class 'haystack.components.converters.pypdf.PyPDFToDocument'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.converters.pypdf.PyPDFToDocument'>
DEBUG:haystack.core.component.component:Registering <class 'haystack.components.converters.txt.TextFileToDocument'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.compo

From there, the steps to this indexing pipeline are a bit more standard. The `DocumentCleaner` removes whitespace. Then this `DocumentSplitter` breaks them into chunks of 150 words, with a bit of overlap to avoid missing context.

In [None]:
document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)

Now you'll add a `SentenceTransformersDocumentEmbedder` to create embeddings from the documents. As the last step in this pipeline, the `DocumentWriter` will write them to the `InMemoryDocumentStore`.


In [None]:
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)

After creating all the components, add them to the indexing pipeline.

In [None]:
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

DEBUG:haystack.core.pipeline.base:Adding component 'file_type_router' (<haystack.components.routers.file_type_router.FileTypeRouter object at 0x7c53cad2f150>

Inputs:
  - sources: List[Union[str, Path, ByteStream]]
  - meta: Union[Dict[str, Any], List[Dict[str, Any]]]
Outputs:
  - unclassified: List[Union[str, Path, ByteStream]]
  - text/plain: List[Union[str, Path, ByteStream]]
  - application/pdf: List[Union[str, Path, ByteStream]]
  - text/markdown: List[Union[str, Path, ByteStream]])
DEBUG:haystack.core.pipeline.base:Adding component 'text_file_converter' (<haystack.components.converters.txt.TextFileToDocument object at 0x7c5298e13fd0>

Inputs:
  - sources: List[Union[str, Path, ByteStream]]
  - meta: Union[Dict[str, Any], List[Dict[str, Any]]]
Outputs:
  - documents: List[Document])
DEBUG:haystack.core.pipeline.base:Adding component 'markdown_converter' (<haystack.components.converters.markdown.MarkdownToDocument object at 0x7c5298281e50>

Inputs:
  - sources: List[Union[str, Path

Next, connect them 👇

In [None]:
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.text/markdown", "markdown_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("markdown_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

DEBUG:haystack.core.pipeline.base:Connecting 'file_type_router.text/plain' to 'text_file_converter.sources'
DEBUG:haystack.core.pipeline.base:Connecting 'file_type_router.application/pdf' to 'pypdf_converter.sources'
DEBUG:haystack.core.pipeline.base:Connecting 'file_type_router.text/markdown' to 'markdown_converter.sources'
DEBUG:haystack.core.pipeline.base:Connecting 'text_file_converter.documents' to 'document_joiner.documents'
DEBUG:haystack.core.pipeline.base:Connecting 'pypdf_converter.documents' to 'document_joiner.documents'
DEBUG:haystack.core.pipeline.base:Connecting 'markdown_converter.documents' to 'document_joiner.documents'
DEBUG:haystack.core.pipeline.base:Connecting 'document_joiner.documents' to 'document_cleaner.documents'
DEBUG:haystack.core.pipeline.base:Connecting 'document_cleaner.documents' to 'document_splitter.documents'
DEBUG:haystack.core.pipeline.base:Connecting 'document_splitter.documents' to 'document_embedder.documents'
DEBUG:haystack.core.pipeline.base:

<haystack.core.pipeline.pipeline.Pipeline object at 0x7c529828fa10>
🚅 Components
  - file_type_router: FileTypeRouter
  - text_file_converter: TextFileToDocument
  - markdown_converter: MarkdownToDocument
  - pypdf_converter: PyPDFToDocument
  - document_joiner: DocumentJoiner
  - document_cleaner: DocumentCleaner
  - document_splitter: DocumentSplitter
  - document_embedder: SentenceTransformersDocumentEmbedder
  - document_writer: DocumentWriter
🛤️ Connections
  - file_type_router.text/plain -> text_file_converter.sources (List[Union[str, Path, ByteStream]])
  - file_type_router.application/pdf -> pypdf_converter.sources (List[Union[str, Path, ByteStream]])
  - file_type_router.text/markdown -> markdown_converter.sources (List[Union[str, Path, ByteStream]])
  - text_file_converter.documents -> document_joiner.documents (List[Document])
  - markdown_converter.documents -> document_joiner.documents (List[Document])
  - pypdf_converter.documents -> document_joiner.documents (List[Docume

Let's test this pipeline with a few recipes I've written. Are you getting hungry yet?

In [None]:
from pathlib import Path

preprocessing_pipeline.run({"file_type_router": {"sources": list(Path(output_dir).glob("**/*"))}})

INFO:haystack.core.pipeline.base:Warming up component document_splitter...
INFO:haystack.core.pipeline.base:Warming up component document_embedder...
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO:haystack.core.pipeline.pipeline:Running component file_type_router
DEBUG:haystack.tracing.logging_tracer:Operation: haystack.component.run
DEBUG:haystack.tracing.logging_tracer:[1;34mhaystack.component.name=file_type_router[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.type=FileTypeRouter[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_types={'sources': 'list', 'meta': 'NoneType'}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_spec={'sources': {'type': 'typing.List[typing.Union[str, pathlib.Path, haystack.dataclasses.byte_stream.ByteStream]]', 'senders': []}, 'meta': {'type': 'typing.Union[typing.Dict[str, typing.Any], typing.List[typing.Dict[str, typing.Any]], NoneType]', 'senders': []}}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.output_spec={'unclassified': {'type': 'typing.List[typing.Union[str, pathlib.Path, haystack.dataclasses.byte_stream.ByteStream]]', 'receivers': []}, 'text/plain': {'type': 'typing.L

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:haystack.tracing.logging_tracer:Operation: haystack.component.run
DEBUG:haystack.tracing.logging_tracer:[1;34mhaystack.component.name=document_embedder[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.type=SentenceTransformersDocumentEmbedder[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_types={'documents': 'list'}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_spec={'documents': {'type': 'typing.List[haystack.dataclasses.document.Document]', 'senders': ['document_splitter']}}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.output_spec={'documents': {'type': 'typing.List[haystack.dataclasses.document.Document]', 'receivers': ['document_writer']}}[0m
DEBUG:haystack.tracing.logging_tracer:[1;31mhaystack.component.input={'documents': [Document(id=ebfe22a7bab9c1fb15b4550c4378711ca749982f267b7e86782c0c95d0e609fb, content: 'wwww.pflegenetz.at www.wundplattform.com pflegenetz.02/11>	1514	>	pflegenetz.02/11 www.wundplat

{'document_writer': {'documents_written': 17}}


**💻PUSH THE DATA TO DATABASE IN HUGGUNGSFACE**


> Blockzitat einfügen



In [None]:
documents = document_store.get_all_documents()
# Extrahiere die verarbeiteten Dokumente aus dem DocumentStore

In [None]:
# Konvertiere die Dokumente in ein DataFrame
import pandas as pd
df = pd.DataFrame([{"content": doc.content, "meta": doc.meta} for doc in documents])

# Erstelle ein Hugging Face Dataset aus dem DataFrame
from datasets import Dataset
dataset = Dataset.from_pandas(df)

In [None]:
# Push das Dataset in dein Repository auf Hugging Face Hub
dataset.push_to_hub("RTVIENNA/1450-RAG-Preprocessing-Data", token=os.environ["HF_API_TOKEN"])

**💻END OF: PUSH THE DATA TO DATABASE IN HUGGUNGSFACE**


🎉 If you only wanted to learn how to preprocess documents, you can stop here! If you want to see an example of using those documents in a RAG pipeline, read on.  

## (Optional) Build a pipeline to query documents

Now, let's build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the [`HuggingFaceAPIChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) so must have a [Hugging Face API Key](https://huggingface.co/settings/tokens) for this section. We will be using the `HuggingFaceH4/zephyr-7b-beta` model.

In [None]:
import os
from getpass import getpass

if "HF_API_TOKEN" not in os.environ:
    os.environ["HF_API_TOKEN"] = getpass("Enter Hugging Face token:")

Enter Hugging Face token:··········


In this step you'll build a query pipeline to answer questions about the documents.

This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer.

> ⚠️ Notice how we used `sentence-transformers/all-MiniLM-L6-v2` to create embeddings for our documents before. This is why we will be using the same model to embed incoming questions.

In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator

template = [
    ChatMessage.from_user(
        """
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""
    )
]
pipe = Pipeline()
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("chat_prompt_builder", ChatPromptBuilder(template=template))
pipe.add_component(
    "llm",
    HuggingFaceAPIChatGenerator(
        api_type="serverless_inference_api", api_params={"model": "HuggingFaceH4/zephyr-7b-beta"}
    ),
)

pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "chat_prompt_builder.documents")
pipe.connect("chat_prompt_builder.prompt", "llm.messages")

DEBUG:haystack.core.component.component:Registering <class 'haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder'>
DEBUG:haystack.core.component.component:Registering <class 'haystack.components.retrievers.in_memory.embedding_retriever.InMemoryEmbeddingRetriever'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.retrievers.in_memory.embedding_retriever.InMemoryEmbeddingRetriever'>
DEBUG:haystack.core.component.component:Registering <class 'haystack.components.builders.chat_prompt_builder.ChatPromptBuilder'> as a component
DEBUG:haystack.core.component.component:Registered Component <class 'haystack.components.builders.chat_prompt_builder.ChatPromptBuilder'>
DEBUG:haystack.core.component.component:Register

<haystack.core.pipeline.pipeline.Pipeline object at 0x7c53f215bfd0>
🚅 Components
  - embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - chat_prompt_builder: ChatPromptBuilder
  - llm: HuggingFaceAPIChatGenerator
🛤️ Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> chat_prompt_builder.documents (List[Document])
  - chat_prompt_builder.prompt -> llm.messages (List[ChatMessage])

Try it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄

In [None]:
question = (
    "What does the color blue indicate?"
)

pipe.run({"embedder": {"text": question}, "chat_prompt_builder": {"question": question}})

INFO:haystack.core.pipeline.base:Warming up component embedder...
INFO:haystack.core.pipeline.pipeline:Running component embedder


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:haystack.tracing.logging_tracer:Operation: haystack.component.run
DEBUG:haystack.tracing.logging_tracer:[1;34mhaystack.component.name=embedder[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.type=SentenceTransformersTextEmbedder[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_types={'text': 'str'}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.input_spec={'text': {'type': 'str', 'senders': []}}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.output_spec={'embedding': {'type': 'typing.List[float]', 'receivers': ['retriever']}}[0m
DEBUG:haystack.tracing.logging_tracer:[1;31mhaystack.component.input={'text': 'What does the color blue indicate?'}[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.visits=1[0m
DEBUG:haystack.tracing.logging_tracer:haystack.component.output={'embedding': [-0.012778927572071552, 0.05554584041237831, 0.026603369042277336, 0.04106936603784561, 0.047354623675346375, 0.04058205336332321

{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='In the context provided, the color blue indicates that the patient\'s condition is "nicht dringend" or not urgent, and their treatment should begin within 120 minutes, as per the Manchester Triage System classification.')], _name=None, _meta={'model': 'HuggingFaceH4/zephyr-7b-beta', 'finish_reason': 'stop', 'index': 0, 'usage': {'prompt_tokens': 4808, 'completion_tokens': 49}})]}}

ALternative Testung zum hoachlden der Daten.




In [None]:
pip install datasets==2.13.1

In [None]:
from datasets import Dataset

# Assuming your data is in a list of dictionaries called 'my_data'
dataset = Dataset.from_dict(my_data)

In [None]:
# Replace 'your_username/your_dataset_name' with your desired dataset name
dataset.push_to_hub('your_username/your_dataset_name')

In [None]:
Example


from datasets import Dataset

# Sample data
my_data = [
    {"text": "This is the first sample.", "label": 0},
    {"text": "Another example data point.", "label": 1},
]

# Create a Hugging Face dataset
dataset = Dataset.from_dict(my_data)

# Push to the Hub (replace with your details)
dataset.push_to_hub('RTVIENNA/1450-RAG-Preprocessing-Data')

## What's next

Congratulations on building an indexing pipeline that can preprocess different file types. Go forth and ingest all the messy real-world data into your workflows. 💥

If you liked this tutorial, you may also enjoy:
- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)
-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates). Thanks for reading!