# Building a reliable Website QA Bot Builder 🛠️
This notebook shows how you can use Langchain, Chroma and reliableGPT 💪 to reliably spin up QA bots for your users websites.

In [1]:
#@title Environment Set-up
!pip install langchain openai reliableGPT chromadb unstructured sentence_transformers gdown

!gdown 1ovmdu43JnkrwaY6KakaSNTKYcHApOEq2

Collecting langchain
  Downloading langchain-0.0.221-py3-none-any.whl (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.2 MB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting reliableGPT
  Downloading reliableGPT-0.2.957-py3-none-any.whl (26 kB)
Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured
  Downloading unstructured-0.7.12-py3-non

## Accept user input

Allow your users to pass in their website for qa

In [2]:
user_input = "https://stripe.com/docs/india-accept-international-payments" #@param {type:"string"}

website_urls = [user_input] + ["https:\\/\\/test.hosteeva.com\\/properties\\/available\\/details\\/451-peters-unit-401-test-1", "reddit.com/r/reddevils"]

## Load their data into ChromaDB
This can often throw unexpected errors (malformed url's, no data returned, etc.). Let's do this in a reliable way.

* Fix malformed urls
* Try different data loaders
* Alert us if there's errors

In [3]:
#@markdown # 😱 Oh no! Langchain silently fails for 2 of our URLs!
#Initialize langchain document loaders
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000,
                                               chunk_overlap=200,
                                               length_function=len)


# Initializing our chromadb instance
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# Load the data from the urls
def ingest(url):
  loader = UnstructuredURLLoader(urls=[url])
  chunks = loader.load_and_split(text_splitter)
  return chunks

#Load the url data as chunks
for idx, url in enumerate(website_urls):
  chunks = ingest(url)
  if len(chunks) == 0:
    print(f"🚨 Langchain failed to load url - {url}")
    continue

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
ERROR:langchain.document_loaders.url:Error fetching or processing https:\/\/test.hosteeva.com\/properties\/available\/details\/451-peters-unit-401-test-1, exception: Invalid URL 'https:\\/\\/test.hosteeva.com\\/properties\\/available\\/details\\/451-peters-unit-401-test-1': No host supplied
ERROR:langchain.document_loaders.url:Error fetching or processing reddit.com/r/reddevils, exception: Invalid URL 'reddit.com/r/reddevils': No scheme supplied. Perhaps you meant https://reddit.com/r/reddevils?


🚨 Langchain failed to load url - https:\/\/test.hosteeva.com\/properties\/available\/details\/451-peters-unit-401-test-1
🚨 Langchain failed to load url - reddit.com/r/reddevils


In [4]:
#@markdown # Let's wrap our ingest with reliableGPT 💪 and handle these
from reliablegpt import reliableData

# initialize reliableData object. Pass in your email for failed ingestion alerts, any metadata you want to receive in your email alerts, and your initialized langchain text splitter
rDL = reliableData(user_emails=["krrish@berri.ai"], metadata={"environment": "local"}, text_splitter=text_splitter)

# identify the impacted user (can be email/id/etc.)
rDL.set_user("ishaan@berri.ai")

#Load the url data as chunks
chunks_all_up = []
for idx, url in enumerate(website_urls):
  chunks = rDL.reliableDataLoaders(ingest(url), filepath=None, web_url=url)
  if len(chunks) == 0:
    print(f"🚨 Langchain failed to load data from url - {url}")
    continue
  else:
    print(f"✅ Successfully loaded data from url - {url}")
    # add to chromadb document list
    chunks_all_up.extend(chunks)



Import error: No module named 'pypdf'
Installing required packages...
Successfully installed langchain
Successfully installed resend
Successfully installed pypdf
Successfully installed pymupdf
Successfully installed pdfminer
Successfully installed pdfminer.six
Successfully installed unstructured


ERROR:langchain.document_loaders.url:Error fetching or processing https:\/\/test.hosteeva.com\/properties\/available\/details\/451-peters-unit-401-test-1, exception: Invalid URL 'https:\\/\\/test.hosteeva.com\\/properties\\/available\\/details\\/451-peters-unit-401-test-1': No host supplied


✅ Successfully loaded data from url - https://stripe.com/docs/india-accept-international-payments


ERROR:langchain.document_loaders.url:Error fetching or processing reddit.com/r/reddevils, exception: Invalid URL 'reddit.com/r/reddevils': No scheme supplied. Perhaps you meant https://reddit.com/r/reddevils?


✅ Successfully loaded data from url - https:\/\/test.hosteeva.com\/properties\/available\/details\/451-peters-unit-401-test-1
✅ Successfully loaded data from url - reddit.com/r/reddevils


In [5]:
#@markdown # 🚀 Add our documents, and test if it worked
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(chunks_all_up, embedding_function)

# query it
query = "Are international payments accepted in India?" #@param {type:"string"}
docs = db.similarity_search(query)
print(docs)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

[Document(page_content='Submit your importer/exporter code (IEC) The IEC is a code issued by the Indian Director General of Foreign Trade (DGFT) to Indian companies that intend to export from India. You can apply for an IEC at the DGFT website. An IEC is required under certain conditions.If you plan to accept Visa or Mastercard, an IEC is required only if you sell physical goods.If you plan to accept AMEX international payments for all export transactions, including selling physical goods and services. This is described by India’s Foreign Trade Policy\n\nSpecify a transaction purpose code. The transaction purpose code describes the nature of a payment received in foreign currency. The list of valid transaction purpose codes is maintained by the Reserve Bank of India (RBI). You must select the code which is closest to your product from the drop-down on the account application.\n\nThe list of transaction purpose codes supported by Stripe is copied below.\n\nOpting in or updating export d

## Create a reliable Langchain QA Bot

OpenAI can often error out. Let's create a qa bot that can reliably answer user questions.

* Model fallback
* Automatic retries
* Error Monitoring

In [6]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY" #@param {type:"string"}

def docQA(question):
  retriever = db.as_retriever()
  qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", retriever=retriever)
  return qa.run(question)


# #test if it worked
# docQA(query)

# now let's try putting a large document into our prompt
with open('./sample.txt', 'r') as f:
  text = f.read()

docQA(text + "\n Who is Sherlock Holmes?")

InvalidRequestError: ignored

In [7]:
#@markdown now let's retry this same query, but this time provide reliableGPT with a model fallback strategy (if 3.5 fails, try 3.5-turbo-16k)
from reliablegpt import reliableGPT
import openai

openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, user_email='ishaan@berri.ai', fallback_strategy=["gpt-3.5-turbo-16k"])

docQA(text + "\n Who is Sherlock Holmes?")

ReliableGPT: Got Exception This model's maximum context length is 4097 tokens. However, your messages resulted in 12769 tokens. Please reduce the length of the messages.
ReliableGPT: invalid request error - context_length_exceeded
ReliableGPT: Checking request model gpt-3.5-turbo-16k {'messages': [{'role': 'system', 'content': 'Use the following pieces of context to answer the users question. \nIf you don\'t know the answer, just say that you don\'t know, don\'t try to make up an answer.\n----------------\nOther Technical Services including scientific/space services.\n\nP1101\n\nAudio-visual and related services like Motion picture and video tape production, distribution and projection services.\n\nP1103\n\nRadio and television production, distribution and transmission services\n\nP1104\n\nEntertainment services\n\nP1105\n\nMuseums, library and archival services\n\nP1106\n\nRecreation and sporting activity services\n\nP1107\n\nEducational services (e.g. fees received for correspondence

'Sherlock Holmes is a fictional detective created by Sir Arthur Conan Doyle. He is known for his keen powers of observation and deduction, as well as his ability to solve complex mysteries. Holmes is often accompanied by his loyal friend and assistant, Dr. John Watson, as they work together to solve cases in London.'