An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. You can store embeddings and documents, then use them again later.

This notebook is adapted from: https://github.com/hwchase17/chroma-langchain/blob/master/persistent-qa.ipynb

In [9]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader

from llm_challenge.utils.misc import read_dict_from_json, set_openai_api_key, write_dict_to_json
from pathlib import Path
from tqdm import tqdm


ModuleNotFoundError: No module named 'llm_challenge'

In [3]:
set_openai_api_key()

NameError: name 'set_openai_api_key' is not defined

Load and process documents
Load documents to do question answering over

In [12]:
dset = "tutorial"
DATA_DIR = "../data"
qas_dict = read_dict_from_json(f"{DATA_DIR}/qas/qas_{dset}.json")
doc_fnames = list(set([str(DATA_DIR + "/datasheets/" + datum["datasheet"])  for datum in qas_dict.values()]))

In [13]:
# Load and process the text
loaders = [TextLoader(_) for _ in doc_fnames]
documents = [d for loader in loaders for d in loader.load()]

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [14]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The persist_directory argument tells ChromaDB where to store the database when it's persisted.

In [21]:
# Supplying a persist_directory will store the embeddings on disk
persist_directory = f'{DATA_DIR}/qas/db_{dset}'
embedding = OpenAIEmbeddings()

In [23]:
# Embed and store the texts
# Persist the Database
# In a notebook, we should call persist() to ensure the embeddings are written to disk. 
# This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

if Path(persist_directory).exists():
    print("Database exists... loading from disk")
    # Load the Database from disk, and create the chain
    # Be sure to pass the same persist_directory and embedding_function as 
    # you did when you instantiated the database. Initialize the chain we will use for question answering.
    # Now we can load the persisted database from disk, and use it as normal. 
    vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
else:
    print("Database does not exist. Do you want to recompute it?")
    print("If yes, set `is_generate_db` to True in the next cell.")

Database exists... loading from disk


In [24]:
# to save from token-expensive ops (set to False)
is_generate_db = False
if is_generate_db:
    vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)
    vectordb.persist()

In [34]:
# create the retrieval QA chain with `stuff` mode
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectordb.as_retriever())

Ask questions!

Now we can use the chain to ask questions!



In [35]:
from langchain.callbacks import get_openai_callback

answers_with_retrieval = {}
with get_openai_callback() as cb:
    for q_id, qa_dict in (pbar := tqdm(qas_dict.items())):
        pbar.set_description(f"Answering question {q_id}")
        answer = qa.run(qa_dict["question"])
        answers_with_retrieval[q_id] = answer
    print(cb)

Answering question 50: 100%|███████████████████████████████████████████████████████████████████████| 5/5 [00:14<00:00,  2.87s/it]

Tokens Used: 6334
	Prompt Tokens: 6205
	Completion Tokens: 129
Successful Requests: 5
Total Cost (USD): $0.12668





In [40]:
# https://github.com/hwchase17/langchain/blob/57f370cde97fb3f2c836cbc0dc764be55a571d5a/langchain/callbacks/openai_info.py#L7
# text-davinci-003
COST_PER_1K_TOKENS_DAV = 0.02
print(f"${6334 * COST_PER_1K_TOKENS_DAV * 1e-3:.5f}")

$0.12668


This is more expensive than the approach discussed in nb02 though this approach uses less tokens:
```
        NUMPY   | LANGCHAIN
tokens: 37798   |  6334
cost  : $0.0416 | $0.126
```

Store the result

In [27]:
write_dict_to_json(f"./langchainRetriever_{dset}.json",answers_with_retrieval)

Evaluating with

```
python scripts/evaluate.py --submission_json_path ./notebooks/langchainRetriever_tutorial.json --reference_json_fname ./data/qas/qas_tutorial.json --submission_evaluation_json_fname ./langchainRetriever_evaluation.json
```

```
python scripts/evaluate.py --submission_json_path ./notebooks/langchainRetriever_train.json --reference_json_fname ./data/qas/qas_train.json --submission_evaluation_json_fname ./langchainRetriever_evaluation.json
```
gets us 73.3 %