<a href="https://colab.research.google.com/github/TuanaCelik/anthropic-hackathon/blob/main/Workshop_Antrhopic_Hakathon_Haystacl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Customized Retrieval-Augmented Pipelines with Haystack and Claude

![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/webretriever_promptnode.png)
![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/hackernews_promptnode.png)

In [None]:
!pip install farm-haystack[inference]

## 1. Using Claude with Haystack

Haystack has 2 main components that define how it interacts with LLMs:
- The `PromptTemplate`: Describe how you want to interact with an LLM.
- The `PromptNode`: This is the components that prompts the degfined LLM. Here, we'll be using "claude-2"

![PromptNode with Anthropic](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/anthropic_prompt_node.png)
#### PromptTemplate
You have 2 options:
1. Define your own prompt template with the desired text and inputs
2. Use one of the predefined ones from [PromptHub](https://prompthub.deepset.ai/). For example:
```python
prompt_node = PromptNode(
    model_name_or_path = "claude-2",
    default_prompt_template="deepset/question-answering",
    api_key=anthropic_key,
    max_length=768,
    model_kwargs={"stream": True},
)
```

In [None]:
from getpass import getpass

anthropic_key = getpass("Enter Anthropic API key:")

In [None]:
from haystack.nodes import PromptTemplate, PromptNode

prompt_text = """
Answer the following question.
Question: {query}
Answer:
"""

prompt_template = PromptTemplate(prompt=prompt_text)

prompt_node = PromptNode(
    model_name_or_path = "claude-2",
    default_prompt_template=prompt_template,
    api_key=anthropic_key,
    max_length=768,
    model_kwargs={"stream": True},
)

In [None]:
prompt_node.run("What is the capital of the UK?")

## Building a RAG Pipeline

Here, we're building a simple retrieval-augmented generative pipeline that uses the web as it's source of knowledge. You can set the 'source' to be a document store, another API, or a custom built data fetcher too.

What does this pipeline need?

- A [`WebRetriever`](https://docs.haystack.deepset.ai/reference/retriever-api#webretriever): This is a tool designed to extract relevant documents from the web. Depending on the operation mode, this text can be further broken down into smaller documents with the help of a PreProcessor. Here, we will be using Serper Dev and you can use the follwing API Key: `394722eca5375ac54854c62cef993d9f2768a0e3`
- (Optionally) A ranker like [`DiversityRanker`](https://docs.haystack.deepset.ai/reference/ranker-api#diversityranker): This ranker reranks the documents in a way that includes the highest level of diversity.
- A [`PromptNode`](https://docs.haystack.deepset.ai/docs/prompt_node): Which uses a `PromptTemplate` of our choice and prompts "claude-2"

`WebRetriever`->`PromptNode`| `WebRetriever`->`Ranker` ->`PromptNode`
:-------------------------:|:-------------------------:

![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/webretriever_promptnode.png)|![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/webretriever_ranker_promptnode.png)

In [None]:
from getpass import getpass

search_key = getpass("Enter Serperdev API key:")

In [None]:
from haystack.nodes import WebRetriever, PromptNode, PromptTemplate

web_retriever = WebRetriever(api_key=search_key, top_search_results=10, mode="preprocessed_documents", top_k=20)

In [None]:
from haystack.nodes.ranker import DiversityRanker

diversity_ranker = DiversityRanker()

In [None]:
prompt_text = """
Using the provided paragraphs and question, craft a comprehensive answer in full sentences.\n
Don't use bullet points or lists.\n
Paragraphs: {join(documents)} \n\nQuestion: {query} \n\nAnswer:
"""

prompt_node = PromptNode(
    model_name_or_path = "claude-2",
    default_prompt_template=PromptTemplate(prompt_text),
    api_key=anthropic_key,
    max_length=768,
    model_kwargs={"stream": True},
)

In [None]:
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="retriever", inputs=["Query"])
pipeline.add_node(component=diversity_ranker, name="ranker", inputs=["retriever"])
pipeline.add_node(component=prompt_node, name="prompter", inputs=["ranker"])

In [None]:
pipeline.run("What are the effects of climate change on the environment, politics and more?")

## Building Your Own Custom Components with Haystack

One core value of Haystack is the custom components API. This allows you to build your own nodes that you can then slot into a pipeline. The full guide on how to do this is [here](https://docs.haystack.deepset.ai/docs/custom_nodes):

```python
from haystack.nodes.base import BaseComponent

class NodeTemplate(BaseComponent):
    # If it's not a decision component, there is only one outgoing edge
    outgoing_edges = 1

    def run(self, query: str, my_arg: Optional[int] = 10):
        # Insert code here to manipulate the input and produce an output dictionary
        ...
        output={
            "documents": ...,
        }
        return output, "output_1"

    def run_batch(self, queries: List[str], my_arg: Optional[int] = 10):
        # Insert code here to manipulate the input and produce an output dictionary
        ...
        output={
            "documents": ...,
        }
        return output, "output_1"
```

#### Building a 'Hacker News Fetcher'

Below we use the template above to build a fetcher that will fetch the latest posts from Hacker News and create Haystack `Document` types. We can then add this node into a RAG pipeline to act as the data source.


In [None]:
!pip install newspaper3k

In [None]:
import requests
from haystack.nodes import BaseComponent
from haystack.schema import Document
from typing import Optional
from newspaper import Article

class HackernewsNewestFetcher(BaseComponent):
    outgoing_edges = 1

    def __init__(self, last_k: Optional[int] = 5):
        self.last_k = last_k

    def run(self, last_k: Optional[int] = None):
        if last_k is None:
            last_k = self.last_k

        newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
        articles = []

        for id in newest_list.json()[0:last_k]:
          article = requests.get(url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty")
          if 'url' in article.json():
            articles.append(article.json()['url'])

        docs = []
        for url in articles:
          try:
            article = Article(url)
            article.download()
            article.parse()
            docs.append(Document(content=article.text, meta={'title': article.title, 'url': url}))
          except:
            print(f"Couldn't download {url}, skipped")

        output = {"documents": docs}
        return output, "output_1"

    def run_batch(self):
        pass

In [None]:
fetcher = HackernewsNewestFetcher()

In [None]:
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

prompt_text = """
You will be provided a few of the latest posts in HakcerNews, followed by their URL.
For each post, provide a brief summary followed by the URL the full post can be found in.

Posts:{join(documents, delimiter=new_line, pattern='---'+new_line+'$content'+new_line+'URL: $url', str_replace={new_line: ' ', '[': '(', ']': ')'})}
"""

prompt_template = PromptTemplate(
    prompt=prompt_text,
    output_parser=AnswerParser(),
)

prompt_node = PromptNode(
    model_name_or_path = "claude-2",
    default_prompt_template=prompt_template,
    api_key=anthropic_key,
    max_length=768,
    model_kwargs={"stream": True},
)


![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/hackernews_promptnode.png)

In [None]:
from haystack.pipelines import Pipeline

pipe = Pipeline()
pipe.add_node(component=fetcher, name="fetcher", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["fetcher"])

In [None]:
results = pipe.run(params={"fetcher":{"last_k":2}}, debug=True)

## Some Examples by the community

Some custom nodes by the community have been packaged and made available on the [Haystack Integrations](https://haystack.deepset.ai/integrations) page. Some useful ones 👇

- [Notion Extractor](https://haystack.deepset.ai/integrations/notion-extractor)
- [ReadMe Docs Fetcehr](https://haystack.deepset.ai/integrations/readmedocs-fetcher)
- [Masdodon Fetcher](https://haystack.deepset.ai/integrations/mastodon-fetcher)

[🐤 Should I follow? (demo for inspiration)](https://huggingface.co/spaces/deepset/should-i-follow)

# Indexing for Documents to a Document Store

Indexing pipelines are used to prepare, preprocess, split and store your data in a `DocumentStore`.

You can see the available Document Stores for Haystack [here](https://haystack.deepset.ai/integrations?type=Document+Store).

Indexing Pipelines:
1. Convert your data from a given filetype to a Haystack `Document` with one of the [Converters](https://docs.haystack.deepset.ai/docs/file_converters)
2. Preprocess your documents into smaller chunks with overlap by creating a [PreProcessor](https://docs.haystack.deepset.ai/docs/preprocessor)
3. (Optionally) Use the [Retriever](https://docs.haystack.deepset.ai/docs/retriever) you intend to use in your RAG pipelines to also create and store embeddings of your documents in your `DocumentStore`.

![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/indexing.png)

Below, we use a `WeaviateDocumentStore` for demonstration purposes, using their "Embedded Weaviate" functionality, also accessible with Haystack.

Otherwise, the simplest `DocumentStore` to get started with is the `InMemoryDocumentStore` which requires no setup.

In [None]:
!pip install farm-haystack[weaviate,inference,file-conversion,preprocessing]

In [None]:
import weaviate
from weaviate.embedded import EmbeddedOptions
from haystack.document_stores import WeaviateDocumentStore

client = weaviate.Client(
  embedded_options=weaviate.embedded.EmbeddedOptions()
)

In [None]:
document_store = WeaviateDocumentStore(use_embedded=True, port=6666)

In [None]:
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import LinkContentFetcher, PreProcessor, EmbeddingRetriever

# document_store = InMemoryDocumentStore(embedding_dim=768)
link_content_fetcher = LinkContentFetcher()
preprocessor = PreProcessor()
retriever = EmbeddingRetriever(document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",)


In [None]:
from haystack import Pipeline

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=link_content_fetcher, name="Fetcher", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["Fetcher"])
indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["PreProcessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])

In [None]:
indexing_pipeline.run(params={"Fetcher":{"query": "https://docs.haystack.deepset.ai/docs/retriever"}})

In [None]:
document_store.get_document_count()

# Use Your Pipelines and Components as Agent Tools

Here is one of our[ Agent Tutorials to answer Multihop Questions](https://haystack.deepset.ai/tutorials/23_answering_multihop_questions_with_agents)

![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/agent_simple.png)![](https://raw.githubusercontent.com/TuanaCelik/anthropic-hackathon/main/agent_detailed.png)

In [None]:
from haystack.agents import Agent
from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="claude-2", api_key=anthropic_key, stop_words=["Observation:"])
agent = Agent(prompt_node=prompt_node)

```python
from haystack.agents import Tool

my_tool = Tool(
    name="Name_Of_Your_Tool",
    pipeline_or_node=your_pipeline_or_node,
    description="Description of what it's useful for",
    output_variable="answers",
)
agent.add_tool(my_tool)
```