<a href="https://colab.research.google.com/github/Ashish-Soni08/Playground/blob/main/haystack/Advent_of_Haystack_Preprocessing(Ashish_Soni).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack - Day 6
_Make a copy of this Colab to start!_


In this challenge, you will help Elf Bilge to preprocess the winter reports before indexing them to a DocumentStore for RAG applications.

Your task is to complete the code in **Section 1**

- [`FileTypeRouter`](https://docs.haystack.deepset.ai/v2.0/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components

- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents

- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents

- [`TextFileToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/textfiletodocument): This component will help you convert text files into Haystack Documents

- [`DocumentJoiner`](https://docs.haystack.deepset.ai/v2.0/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline

- [`DocumentCleaner`](https://docs.haystack.deepset.ai/v2.0/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.

- [`DocumentSplitter`](https://docs.haystack.deepset.ai/v2.0/docs/documentsplitter): This component will help you to split your Document into chunks

- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.

- [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter): This component will help you write Documents into the DocumentStore

#Installation
**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [1]:
%%bash
pip install haystack-ai
pip install transformers[torch,sentencepiece]==4.32.1 sentence-transformers>=2.2.0
pip install markdown-it-py mdit_plain
pip install pypdf

Collecting haystack-ai
  Downloading haystack_ai-2.0.0b3-py3-none-any.whl (189 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 189.7/189.7 kB 1.7 MB/s eta 0:00:00
Collecting boilerpy3 (from haystack-ai)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai<1.0.0 (from haystack-ai)
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.0/77.0 kB 5.5 MB/s eta 0:00:00
Collecting posthog (from haystack-ai)
  Downloading posthog-3.1.0-py2.py3-none-any.whl (37 kB)
Collecting rank-bm25 (from haystack-ai)
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Installing collected packages: monotonic, rank-bm25,

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.


### Enabling Telemetry

Knowing you’re running this challenge helps us know whether Advent of Haystack is helping people learn about Haystack 2.0-Beta. But you can always opt out by commenting the following line.

In [2]:
# from haystack.telemetry import tutorial_running

# tutorial_running("challenge_6")

## Download All Winter Reports

All required files will be downloaded into this Colab notebook. You can see these files in "files" tab on the left.

In [3]:
!gdown https://drive.google.com/drive/folders/1vNeCG0Vgnri9DvIr_MRURV0S8QNWs08r -O /content --folder

Retrieving folder list
Processing file 1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0 winter_report_one.txt
Processing file 1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV winter_report_three.md
Processing file 1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd winter_report_two.pdf
Retrieving folder list completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0
To: /content/winter_report_one.txt
100% 2.39k/2.39k [00:00<00:00, 13.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV
To: /content/winter_report_three.md
100% 2.51k/2.51k [00:00<00:00, 13.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd
To: /content/winter_report_two.pdf
100% 61.1k/61.1k [00:00<00:00, 65.2MB/s]
Download completed


## 1) Create a Pipeline to Index Documents

In [4]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter, DocumentJoiner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.pipeline import Pipeline
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
######## Initialize the necessary components with relevant parameters #############

text_file_converter = TextFileToDocument()

markdown_converter = MarkdownToDocument()

pdf_converter = PyPDFToDocument()

cleaner = DocumentCleaner(remove_empty_lines=True, remove_extra_whitespaces=True, remove_repeated_substrings=False)

splitter = DocumentSplitter(split_by="passage", split_length=200, split_overlap=50)

joiner = DocumentJoiner(join_mode="merge")

####################################################################################
document_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)

In [6]:
document_store.count_documents()

0

### Add components to the preprocessing pipeline

In [9]:
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
######## Add new components to the pipeline #############
preprocessing_pipeline.add_component(instance=cleaner, name="cleaner")
preprocessing_pipeline.add_component(instance=splitter, name="splitter")
preprocessing_pipeline.add_component(instance=joiner, name="joiner")

preprocessing_pipeline.add_component(instance=document_embedder, name="text_embedder")

preprocessing_pipeline.add_component(instance=document_writer, name="writer")
##########################################################

### Connect all components

In [11]:
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.unclassified", "markdown_converter.sources")
######## Complete this section with the rest of the connections #############

preprocessing_pipeline.connect("text_file_converter", "joiner")
preprocessing_pipeline.connect("pypdf_converter", "joiner")
preprocessing_pipeline.connect("markdown_converter", "joiner")


preprocessing_pipeline.connect("joiner.documents", "cleaner")
preprocessing_pipeline.connect("cleaner", "splitter")



preprocessing_pipeline.connect("splitter", "text_embedder")
preprocessing_pipeline.connect("text_embedder", "writer")

#############################################################################

In [12]:
preprocessing_pipeline.draw("preprocessing_pipeline.png")

In [13]:
preprocessing_pipeline.run({
    "file_type_router": {"sources":["/content/winter_report_one.txt",
                                    "/content/winter_report_two.pdf",
                                    "/content/winter_report_three.md"]}
})

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Converting markdown files to Documents: 100%|██████████| 1/1 [00:00<00:00, 356.75it/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {'documents_written': 3}}

In [14]:
document_store.count_documents()

3

In [17]:
document_store.storage

{'8de849214d1d4fafb079fd37a2ced1d0fd7da9a15c18c48fc1ff269e923eb59f': Document(id=8de849214d1d4fafb079fd37a2ced1d0fd7da9a15c18c48fc1ff269e923eb59f, content: 'In the middle of our magical forest, where the tall trees talk to the wind, our Elf community got re...', meta: {'file_path': '/content/winter_report_one.txt', 'source_id': '36281bffc183e06f4b73d0d972c97fc6e06c2731b903b252cc99251a0a8c9bb8'}, embedding: vector of size 384),
 '4f56e71c374e5eda977496f4dfbd85fb305e3b67d44322a440b7fa430770c717': Document(id=4f56e71c374e5eda977496f4dfbd85fb305e3b67d44322a440b7fa430770c717, content: 'In the heart of our enchanted forest, where the ancient trees share secrets with the breeze, our Elf...', meta: {'file_path': '/content/winter_report_two.pdf', 'source_id': '5efc89eeb10ab87b9a5f0ffcbc822677b1ecfca944fd01b563c9a77f59abfaca'}, embedding: vector of size 384),
 'e5a48899b82f83dae5370c9d403f2cfd2555606842f66ea8fd75be0d4b0e9f27': Document(id=e5a48899b82f83dae5370c9d403f2cfd2555606842f66ea8fd75be0d4

## 2) Test Your System

Run this code and you’ll be prompted to enter your openAI credentials. If you don’t have a key, [follow these instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

In [18]:
from getpass import getpass

api_key = getpass("OpenAI API Key: ")

OpenAI API Key: ··········


In [19]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import GPTGenerator

template = """
You are a wise elf living in the forest with other elves.
You will be provided with some context from Elves' yearly winter reports.
Answer the questions from other elves based on the given context as if you are an elf as well.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""
pipe = Pipeline()
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", GPTGenerator(api_key=api_key))
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

In [20]:
query = "What should we do against water scarcity?"
# query = "Give me one example of nice moment they we had in past winters"
# query = "Which foods should we collect?"

pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': ['In response to the water scarcity, we Elves have taken proactive measures to ensure a stable water supply. We gathered snow during the winter and stored it for future use as drinking water. This allowed us to preserve our water resources and ensure that we have enough water during times of uncertainty. Additionally, it would be beneficial for us to explore other water conservation methods, such as collecting rainwater or finding natural springs within the forest. By being mindful of our water usage and finding sustainable solutions, we can help alleviate the water scarcity issue and ensure that we have enough water for all of us in the enchanted forest.'],
  'metadata': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'prompt_tokens': 1589,
     'completion_tokens': 120,
     'total_tokens': 1709}}]}}

In [21]:
query = "Give me one example of nice moment they we had in past winters"


pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': ['One delightful memory from past winters was a spirited snowball fight that the Elves engaged in. Laughter filled the forest as we dodged and tossed snowballs, enjoying the camaraderie and the simple pleasure of a carefree day. It was a moment of unity and shared joy that brought the community closer together.'],
  'metadata': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'prompt_tokens': 1595,
     'completion_tokens': 63,
     'total_tokens': 1658}}]}}

In [22]:
query = "Which foods should we collect?"

pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': ["We should collect a variety of foods to ensure a balanced and plentiful supply for the winter. Some of the foods we can gather include winter berries, wild nuts, and mushrooms from the ground. We are also skilled at growing magical fruit trees, whose fruits not only provide us with nourishment but also a touch of enchantment. So, let's make sure to gather these fruits as well. Additionally, it would be wise to collect and dry herbs for both culinary and medicinal purposes. These herbs will not only add flavor to our winter meals but also provide us with health benefits. Remember, it is important to be fair and make sure that every elf, regardless of age, gets their fair share of collected food. We should also save extra food, just in case winter becomes particularly tough."],
  'metadata': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'prompt_tokens': 1587,
     'completion_tokens': 157,
     'total_tokens': 1744}}]}}

In [None]:
query = "What should we do against water scarcity?"
# query = "Give me one example of nice moment they we had in past winters"
# query = "Which foods should we collect?"

pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': ['To combat water scarcity, we should take proactive measures to ensure a stable water supply. One approach we can take is to gather snow during the winter and store it for future use as drinking water. This way, we can preserve our water resources and have a reliable source of water in times of uncertainty. It is important for each elf to contribute to this effort and be mindful of conserving water throughout the year.'],
  'metadata': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'prompt_tokens': 1702,
     'completion_tokens': 82,
     'total_tokens': 1784}}]}}