# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [9]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-0.3.21-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 2.8 MB/s eta 0:00:011
[?25hCollecting sentence-transformers>=2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 10.9 MB/s eta 0:00:01
[?25hCollecting duckdb>=0.7.1
  Downloading duckdb-0.7.1-cp39-cp39-macosx_10_9_x86_64.whl (14.6 MB)
[K     |████████████████████████████████| 14.6 MB 2.8 MB/s eta 0:00:01
Collecting posthog>=2.4.0
  Downloading posthog-3.0.1-py2.py3-none-any.whl (37 kB)
Collecting hnswlib>=0.7
  Downloading hnswlib-0.7.0.tar.gz (33 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting clickhouse-connect>=0.5.7
  Downloading clickhouse_connect-0.5.21-cp39-cp39-macosx_10_9_x86_64.whl (235 kB)
[K     |████████████████████████████████| 235 kB 8.8 MB/s eta 0:00:01
Collecting 

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [18]:
from langchain.document_loaders import UnstructuredPDFLoader, UnstructuredURLLoader, OnlinePDFLoader, PDFMinerLoader,BSHTMLLoader,SeleniumURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [21]:
url = ["https://www.nature.com/articles/s41565-023-01366-7",
       'https://www.nature.com/articles/s43586-023-00211-4',
       'https://www.nature.com/articles/s44161-023-00232-y',
       'https://www.nature.com/articles/s42004-023-00852-2',
       'https://www.nature.com/articles/s41467-022-32349-2',
       'https://www.nature.com/articles/s44161-023-00232-y',
       'https://www.nature.com/articles/s41565-023-01382-7',
       'https://www.nature.com/articles/s44222-023-00035-7'
       ]

In [22]:
url_loader = SeleniumURLLoader(urls=url)
urldata = url_loader.load()

In [23]:
# Load and process the text
loader1 = OnlinePDFLoader("https://downloads.hindawi.com/journals/bmri/2022/2426960.pdf")
documents = loader1.load()+urldata

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [24]:
OPENAI_API_KEY = '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'

In [25]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

embedding = OpenAIEmbeddings( openai_api_key=OPENAI_API_KEY)
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: db


# Embedding End！Query

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [26]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [27]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key=OPENAI_API_KEY), chain_type="stuff", vectorstore=vectordb)

Using embedded DuckDB with persistence: data will be stored in: db


## Ask questions!

Now we can use the chain to ask questions!

In [28]:
query = "What is the AI nanomedicine?"
qa.run(query)

' AI nanomedicine is the utilization of artificial intelligence (AI) and nanotechnology to improve the delivery of drugs, diagnostics, and treatments for cancer and cardiovascular diseases.'

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [10]:
# # To cleanup, you can delete the collection
# vectordb.delete_collection()
# vectordb.persist()

# # Or just nuke the persist directory
# !rm -rf db/

Persisting DB to disk, putting it in the save folder db
