<a href="https://colab.research.google.com/github/Ashish-Soni08/Playground/blob/main/haystack/Advent_of_Haystack_Prompt_Engineering_Challenge(Ashish_Soni).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack - Day 3
_Make a copy of this Colab to start!_

Here, you'll be provided a nearly complete RAG pipeline that is supposed to do QA on a number of URLs. Our aim is to create a [`PromptBuilder`](https://docs.haystack.deepset.ai/v2.0/docs/promptbuilder) that uses a template which can produce answers with references as to where the answer is coming from.

1. **Run the indexing pipeline:** This is already complete. Here, we are writing the contents of various haystack documentation pages into an `InMemoryDocumentStore`. We are also creating embeddings for our documents with a `SentenceTransformersDocumentEmbedder`
2. **Your task is to complete step 2 👇**

#Installation
**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [1]:
%%capture

!pip install haystack-ai
!pip install boilerpy3
!pip install transformers accelerate bitsandbytes sentence_transformers

## 1) Write Documents to InMemoryDocumentStore

Here, we are writing the contents of a few URLs into an `InMemoryDocumentStore`

In [2]:
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter


document_store = InMemoryDocumentStore()

link_fetcher = LinkContentFetcher()
converter = HTMLToDocument()
splitter = DocumentSplitter(split_length=100, split_overlap=5)
embedder = SentenceTransformersDocumentEmbedder()
writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("link_fetcher", link_fetcher)
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("writer", writer)

indexing_pipeline.connect("link_fetcher", "converter")
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

In [3]:
indexing_pipeline.run(data={"link_fetcher":{"urls": ["https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformerstextembedder", "https://docs.haystack.deepset.ai/v2.0/docs/openaidocumentembedder"]}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {'documents_written': 7}}

## 2) Build a RAG Pipeline
Here, we have provided a nearly complete RAG pipeline, but the `PromptBuilder` is mising. Create one and add it to the pipeline. Make sure your `PromptBuilder` is able to use the `url` from the documents metadata. That way, you can ask for a response that includes references!


In [4]:
from getpass import getpass

api_key = getpass("Enter OpenAI Api key: ")

Enter OpenAI Api key: ··········


In [5]:
from pprint import pprint

print("*" * 70)

print(f"The count of Documents is {document_store.count_documents()} and are stored as type -> {type(document_store.storage)}")

print("*" * 70)

pprint(f"The following documents are stored in the InMemory Document Store: \n {document_store.storage}")

print("*" * 70)

**********************************************************************
The count of Documents is 7 and are stored as type -> <class 'dict'>
**********************************************************************
('The following documents are stored in the InMemory Document Store: \n'
 " {'0da0842b0156b3c3f55b4f1144fcec9413be1838cceaa3708843e9eb9f26951a': "
 'Document(id=0da0842b0156b3c3f55b4f1144fcec9413be1838cceaa3708843e9eb9f26951a, '
 "content: 'Enabling GPU Acceleration\n"
 'SentenceTransformersTextEmbedder\n'
 "SentenceTransformersTextEmbedder transfor...', meta: {'content_type': "
 "'text/html', 'url': "
 "'https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformerstextembedder', "
 "'source_id': "
 "'5f6b19c3f8b8faccc7033b2fb49ea55f7dba4214ad63281c99defac513947f8b'}, "
 'embedding: vector of size 768), '
 "'667bbf1de076e41d2b30dd44ccedf0f0512d8effce907b9ebb57e616430e9956': "
 'Document(id=667bbf1de076e41d2b30dd44ccedf0f0512d8effce907b9ebb57e616430e9956, '
 "content: 'the comp

In [6]:
import torch

from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import GPTGenerator

######## Complete this section #############
prompt_template = """ According to these documents,
answer the given question in a structured and comprehensive way.

{% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
{% endfor %}


If the answer is contained in the documents, also report the source URL.
If the answer cannot be deduced from the documents, do not give an answer.

Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(prompt_template)
############################################

query_embedder = SentenceTransformersTextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=2)
gpt_llm = GPTGenerator(api_key=api_key)

In [7]:
# Creating the Pipeline
gpt_pipeline = Pipeline()

gpt_pipeline.add_component(name="prompt_builder", instance=prompt_builder)
gpt_pipeline.add_component(name="query_embedder", instance=query_embedder)
gpt_pipeline.add_component(name="retriever", instance=retriever)
gpt_pipeline.add_component(name="llm", instance=gpt_llm)

In [8]:
# Connect the components in the Pipeline

gpt_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
gpt_pipeline.connect("retriever.documents", "prompt_builder.documents")
gpt_pipeline.connect("prompt_builder", "llm")

In [9]:
question = "How do I enable GPU acceleration?"

result = gpt_pipeline.run(data={"query_embedder": {"text": question}, "prompt_builder": {"question": question}})

print(result['llm']['replies'][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The documents do not contain information on how to enable GPU acceleration.


In [10]:
question = "How do I use the openai embedder?"

result = gpt_pipeline.run(data={"query_embedder": {"text": question}, "prompt_builder": {"question": question}})

print(result['llm']['replies'][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

To use the OpenAIDocumentEmbedder, you can follow these steps:

1. Import the necessary modules:
```python
from haystack import Document
from haystack.components.embedders import OpenAIDocumentEmbedder
```

2. Create a Document object with the text and metadata you want to embed:
```python
doc = Document(text="some text",
               metadata={"title": "relevant title", "page number": 18})
```
Note: You can include any metadata that is distinctive and semantically meaningful to improve retrieval.

3. Instantiate the OpenAIDocumentEmbedder with the desired configuration:
```python
embedder = OpenAIDocumentEmbedder(metadata_fields_to_embed=["title"])
```
Note: In this example, we are specifying to embed only the "title" metadata field. You can include multiple metadata fields if needed.

4. Embed the document using the embedder:
```python
docs_w_embeddings = embedder.run(documents=[doc])["documents"]
```
Note: The `run()` method takes a list of documents as input and returns a diction

In [11]:
gpt_pipeline.draw("/content/gpt_pipeline_day_3.png")

Haystack is model-agnostic, which also means you can easily switch between different model providers. For example, instead of using an OpenAI model via an API, you can also try using an open source model running in this colab notebook. You can replace the `llm` with the one below. This might take up more resources in Colab. You might notice that models don't perform the same way, which can mean you need to change your prompt. It's ok to change the task from doing referenced QA to someting else. For example, we're also happy with a poem about the Haystack docs 🤗
```python
from haystack.components.generators import HuggingFaceLocalGenerator
llm = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                 huggingface_pipeline_kwargs={"device_map":"auto",
                                               "model_kwargs":{"load_in_4bit":True,
                                                "bnb_4bit_use_double_quant":True,
                                                "bnb_4bit_quant_type":"nf4",
                                                "bnb_4bit_compute_dtype":torch.bfloat16}},
                                 generation_kwargs={"max_new_tokens": 350})
llm.warm_up()
```

In [12]:
from haystack.components.generators import HuggingFaceLocalGenerator
llm = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                 huggingface_pipeline_kwargs={"device_map":"auto",
                                               "model_kwargs":{"load_in_4bit":True,
                                                "bnb_4bit_use_double_quant":True,
                                                "bnb_4bit_quant_type":"nf4",
                                                "bnb_4bit_compute_dtype":torch.bfloat16}},
                                 generation_kwargs={"max_new_tokens": 350})
llm.warm_up()

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [13]:
zep_prompt_template = """<|system|>Using the information contained in the context,
give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
"""
prompt_builder = PromptBuilder(zep_prompt_template)

In [17]:
zep_pipeline = Pipeline()
zep_pipeline.add_component(instance=query_embedder, name="query_embedder")
zep_pipeline.add_component(instance=retriever, name="retriever")
zep_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
zep_pipeline.add_component(instance=llm, name="llm")

# zep_pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
zep_pipeline.connect("retriever.documents", "prompt_builder.documents")
zep_pipeline.connect("prompt_builder", "llm")

In [18]:
query = "How do I use the openai embedder?"
result = zep_pipeline.run(data={"query_embedder": {"text": query}, "prompt_builder": {"query": query}})
print(result['llm']['replies'][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]




To use the OpenAI Embedder, you first need to sign up for an OpenAI account and obtain an API key. Then, you can make requests to the OpenAI API using the `openai` Python library or other programming languages with their respective OpenAI client libraries.

Here's an example of how to use the `openai` Python library to embed a text using the OpenAI Embedder:

```python
from openai import Embedding

embedder = Embedding(model="text-embedding-ada-002")

text = "The quick brown fox jumps over the lazy dog."
embedding = embedder.embed(text)

print(embedding)
```

In this example, we're using the `text-embedding-ada-002` model, which is a pre-trained language model that can be used for text embedding. The `embed` method takes the text as an argument and returns a list of floating-point numbers, which represent the text's embedding in a high-dimensional space.

You can find more information about the OpenAI Embedder, including documentation and examples, on the OpenAI website: https://platf

In [19]:
zep_pipeline.draw("/content/zep_pipeline_day_3.png")