# Clarifai Doc-Retrieval using React Docstore

The notebook has the gives a walkthrough to build a Doc Q/A using clarifai vectorstore and langchain's React Docstore with Webscraped docs of Clarifai. This enables the user to have a Q&A regarding the Docs of Clarifai.


The steps are as follows:

- Websracping from Clarifai Docs Website.
- Processing and Storing the Docs in Clarifai Vectorstore.
- Building a React Agent to search in the Clarifai vectorstore.
- Using the Agent to answer for User queries related to Clarifai.

## Agents

The core idea of agents is to use a language model to choose a sequence of actions to take. In agents, a language model is used as a reasoning engine to determine which actions to take and in which order.



### Setup

In [None]:
!pip install -U langchain
!pip install clarifai

You can use several language models from [clarifai](https://clarifai.com/explore/models?filterData=%5B%7B%22field%22%3A%22use_cases%22%2C%22value%22%3A%5B%22llm%22%5D%7D%5D&page=1&perPage=24) platform. Sign up and get your [PAT](https://clarifai.com/settings/security) to access it.

In [1]:
#Note we also have an option to pass the PAT key directly while calling the classes, so user can either intialize it as env variable or arguments.
import os
os.environ["CLARIFAI_PAT"]="c7bea6e7b8244afe930b8f8cb218f2b1"

### Web Scraping

Extracting Docs form https://docs.clarifai.com/ using BeautifulSoup

Note: Storing only some pages(Portal Guide) of the website for demo purpose

In [3]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://docs.clarifai.com/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

urls = []
for link in soup.find_all('a', attrs={'href': re.compile("^/portal")}):
    portal_url = 'https://docs.clarifai.com'+link.get('href')
    sub_reqs = requests.get(portal_url)
    soup_1 = BeautifulSoup(sub_reqs.text, 'html.parser')
    re_match = portal_url.split('/')[-2]
    for sublink in soup_1.find_all('a', attrs={'href': re.compile("^/portal-guide/"+re_match)}):
        portal_sub_url = sublink.get('href')
        if portal_sub_url.startswith('/'):
            urls.append('https://docs.clarifai.com'+portal_sub_url)

In [4]:
urls[:5],len(urls)

(['https://docs.clarifai.com/portal-guide/data/',
  'https://docs.clarifai.com/portal-guide/data/supported-formats',
  'https://docs.clarifai.com/portal-guide/data/explorer/',
  'https://docs.clarifai.com/portal-guide/data/collectors',
  'https://docs.clarifai.com/portal-guide/datasets/'],
 78)

In [9]:
urls = ['https://docs.clarifai.com/portal-guide/data/',
       'https://docs.clarifai.com/portal-guide/annotate/',
       'https://docs.clarifai.com/portal-guide/model/training-basics',
       'https://docs.clarifai.com/portal-guide/model/training-faqs',
       'https://docs.clarifai.com/portal-guide/clarifai-organizations/']

Using Langchain's HTMLHeaderTextSplitter to split the docs based on the headers

In [10]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.text_splitter import HTMLHeaderTextSplitter


headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

def parse_website(urls):
    final_docs = []
    loader = AsyncHtmlLoader(urls)
    docs = loader.load()
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    #looping the URLS
    for doc in docs:
        html_header_splits = html_splitter.split_text(doc.page_content)
        for header_doc in html_header_splits:
            if len(header_doc.metadata)>0:
                header_doc.metadata.update(doc.metadata)
                final_docs.append(header_doc)
    return final_docs

In [11]:
parsed_docs = parse_website(urls)

Fetching pages: 100%|##############################################################################################| 5/5 [00:01<00:00,  3.36it/s]


## Uploading the Docs to Clarifai VectorStore

In [14]:
#importing Clarifai Vectorstore from langchain
from langchain.vectorstores import Clarifai as Clarifaivectorstore

In [15]:
clarifai_vector_db = Clarifaivectorstore.from_documents(
    user_id="sanjaychelliah",
    app_id= "langchain_docstore",
    documents = parsed_docs,
    number_of_docs=1
)

## Retrievar function(Custom Search function for Docstore)

Refer: https://python.langchain.com/docs/modules/agents/agent_types/react_docstore

In [16]:
from langchain.llms import Clarifai as Clarifaillm
from langchain.retrievers.multi_query import MultiQueryRetriever

In [17]:
#Using Clarifai LLM for retriever

In [18]:
#Model URL from Clarifai Community
MODEL_URL = "https:/clarifai.com/openai/chat-completion/models/GPT-4"

llm=Clarifaillm(model_url= MODEL_URL)

### MultiQueryRetriever

- The MultiQueryRetriever automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query.
- By generating multiple perspectives on the same question, the MultiQueryRetriever might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results.

In [19]:
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=clarifai_vector_db.as_retriever(), llm=llm
)

Custom Lookup function for the React agent

In [32]:
def doc_lookup(search_query):
    unique_docs = retriever_from_llm.get_relevant_documents(query=search_query)
    return unique_docs[0]

In [33]:
lookup_function = doc_lookup

## React Docstore Agent

This agent uses the ReAct framework to interact with a docstore. This agent is equivalent to the original [ReAct paper](https://arxiv.org/pdf/2210.03629.pdf), specifically the Wikipedia example.


In [None]:
#Import necessary libraries.
import langchain
from langchain.agents import AgentType, Tool, initialize_agent
from langchain.agents.react.base import DocstoreExplorer
from langchain.docstore import DocstoreFn
from langchain.llms import Clarifai as Clarifaillm

In [34]:
docstore = DocstoreExplorer(DocstoreFn(lookup_fn=lookup_function))
tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup",
    ),
]

Initializing the agent with Clarifai LLM

In [35]:
#Model URL from Clarifai Community
MODEL_URL = "https:/clarifai.com/openai/chat-completion/models/GPT-4"

llm=Clarifaillm(model_url=MODEL_URL)

In [36]:
react = initialize_agent(tools, llm, agent=AgentType.REACT_DOCSTORE, verbose=True,handle_parsing_errors=True)

## Interacting with the Docstore

In [37]:
question = "How to label data in clarifai?"
react.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To answer this question, I need to search for the steps or process of labeling data in Clarifai.
Action: Search[how to label data in clarifai][0m


Observation: [36;1m[1;3mClarifai offers fully managed data annotation services for creating high-quality training datasets. Scale your AI initiatives quickly with high-quality, human-annotated data. You can find out more here.[0m
Thought:[32;1m[1;3mThe information provided does not directly provide the steps for labeling data in Clarifai. I need to search for more specific instructions.
Action: Search[Clarifai data labeling tutorial][0m


Observation: [36;1m[1;3mLabel your data  
Labeling (also known as annotating) refers to the process of adding one or more relevant tags, or keywords—usually referred to as concepts—that best describe the state of your inputs.  
For example, annotations might indicate whether an image contains a jogger or a cyclist, which words were spoken in a recorded audio file, or if a concrete block contains cracks.  
Successfully labeling data is a key ingredient to any custom AI solution. You can use a concept to annotate an input if that input has that entity. That’s how you prepare training data to teach your models to recognize new concepts.  
After training your model using the annotated concepts, the model can learn to recognize them when it encounters data without those tags.  
Clarifai Labeler offers custom tools for labeling image and video datasets, and helps you delegate and manage labeling projects of any size.[0m
Thought:[32;1m[1;3mCould not parse LLM Output: The process of labe