# Document Question Answering

An example of using Chroma DB and LangChain to do question answering over documents.

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

## Load documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

In [4]:
loader = TextLoader('../data/cointelegraph_20230221_trunc.json')
documents = loader.load()

## Split documents

Split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [50]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

## Initialize ChromaDB

Create embeddings for each chunk and insert into the Chroma vector database.

In [51]:
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(texts, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


## Create the chain

Initialize the chain we will use for question answering.

In [7]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)

## Ask questions!

Now we can use the chain to ask questions!

In [9]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

" I don't know."

In [21]:
query = "Generate 10 questions that a retail crypto investor might want to ask a chatbot for investing research in cypto and web3 space, focused on topics of ETH, ZK, layer2 that are related to news and events happened on 2021?"
result = qa.run(query)
print(result)


1. What is the current price of ETH? 
2. What is the current price of ZK? 
3. What is the best way to invest in ETH? 
4. What are the risks associated with investing in ZK? 
5. What are the main features of layer2 scaling? 
6. What is the most recent news about ETH? 
7. What is the most recent news about ZK? 
8. What upcoming events should I know about in the crypto and web3 space? 
9. What are the benefits of investing in layer2 solutions? 
10. How have ETH, ZK, and layer2 solutions performed in 2021?


In [22]:
query = "What is Scrypto?"
result = qa.run(query)
print(result)

 I don't know.


# Query Assisted Generation

In [52]:
hits = vectordb.similarity_search(query=query)

In [53]:
hits_page_content = [h.page_content for h in hits]
hits_sources = [h.metadata['source'] for h in hits]

In [54]:
hits_sources

['../data/cointelegraph_20230221_trunc.json',
 '../data/cointelegraph_20230221_trunc.json',
 '../data/cointelegraph_20230221_trunc.json',
 '../data/cointelegraph_20230221_trunc.json']

In [55]:
from langchain import PromptTemplate


template = """
I want you to act as a financial analyst working in coinbase.

Base yor answer on the following articles:
{article_1}
{article_2}
{article_3}
{article_4}

Answer the following question:
{question}
"""

prompt = PromptTemplate(
    input_variables=["question", "article_1", "article_2", "article_3", "article_4"],
    template=template,
)

In [56]:
prompt_data = {
    "question": query,
    "article_1": hits_page_content[0],
    "article_2": hits_page_content[1],
    "article_3": hits_page_content[2],
    "article_4": hits_page_content[3],

}
prompt.format(**prompt_data)



In [57]:
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003", temperature=0.5, best_of=10, n=3, max_tokens=200)

llm(prompt.format(**prompt_data))

"\nScrypto is a type of malware designed to steal cryptocurrency from victims. It is designed to target wallets that work as Chromium browser extensions, such as MetaMask, Binance Chain Wallet, or Coinbase Wallet. The malware is designed to convert the victim's stolen timezone data to Moscow Standard Time (MSK) when the data is sent back to the attackers."

In [58]:
hits_page_content

['Related: Bitcoin stealing malware: Bitter reminder for crypto users to stay vigilantInterestingly, the malware is designed to stop itself if it finds out the victim is based in Russia, Ukraine, Belarus and Kazakhstan. Cyble also found that the malware converts the victim’s stolen timezone data to Moscow Standard Time (MSK) when the data is sent back to the attackers. In February, malware named Mars Stealer was identified as targeting crypto wallets that work as Chromium browser extensions such as MetaMask, Binance Chain Wallet or Coinbase Wallet.Chainalysis warned in January that even low-skilled cybercriminals are now using malware to take funds from crypto hodlers, with cryptojacking accounting for 73% of the total value received by malware-related addresses between 2017 and 2021."}',
 'rumored to be connected with the recent death of its founder, as some have posited that the private keys to users’ funds have been lost. A reported $190 million of user funds is allegedly inaccessib