## Trial on using LangChain for Document based Question Answering

* Project structure
    * documents
        * Contains all documents you want to do QA over
    * DocSearch.ipynb
        * Contains LangChain + OpenAI implementation of QA over documents

* Future steps :
    * Include utility to scrape image from pdf (Ex: tool measurements)
    * Perform Image to Text on all images and insert text in appropriate positions

In [1]:
# Define Env var OPENAI_API_KEY that we use for the llm
import os
os.environ["OPENAI_API_KEY"] = "Insert Key Here"

In [19]:
# Install dependencies (add others based on errors - can ignore allennlp version warnings)

!pip install --upgrade langchain openai -q
!pip install unstructured -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!pip install tiktoken -q
!pip install pdf2image
!pip install pytesseract



In [21]:
# Import libraries pertaining to QA & Text loading
import openai
import os

# QA & Text loader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.chains.question_answering import load_qa_chain

# Import vectorstore retriever
from langchain.indexes import VectorstoreIndexCreator

# parse through directory (can use pdf or text files)
from langchain.document_loaders import DirectoryLoader

# for text chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

# emb for similarity search
from langchain.embeddings.openai import OpenAIEmbeddings

# Chat model
from langchain.chat_models import ChatOpenAI

In [22]:
# Parse through documents

directory = 'documents/'

def load_documents(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_documents(directory)

In [23]:
# Split documents based on a specific chunk size

def split_documents(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_documents(documents)

In [24]:
embeddings = OpenAIEmbeddings()

In [25]:
# Use ChromaDB as vectorstore https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html

from langchain.vectorstores import Chroma
index = Chroma.from_documents(docs, embeddings)

Using embedded DuckDB without persistence: data will be transient


In [26]:
# Apply similarity search with the query (and the docs in the db)

def get_similiar_documents(query, k=2, score=False):
  if score:
    similar_docs = index.similarity_search_with_score(query, k=k)
  else:
    similar_docs = index.similarity_search(query, k=k)
  return similar_docs

In [27]:
llm = ChatOpenAI(temperature = 0)
chain = load_qa_chain(llm, chain_type="stuff")

# Use similar documents as context to generate an answer with the llm
def get_answer(query):
  similar_docs = get_similiar_documents(query)
  answer = chain.run(input_documents=similar_docs, question=query)
  return answer

In [29]:
query = "Who is Gautham?"
get_answer(query)

'Gautham is a person who introduced himself as an MS CSE student at Georgia Tech and mentioned that he loves NLP.'