<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/DocQA_with_PDFs_with_LangChain_and_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### DocQA with PDFs with LangChain and OpenAI






###### Install and setup api key 

In [None]:
# RUN THIS CELL FIRST!
!pip install -U langchain pypdf tiktoken openai 

In [None]:
!pip install chromadb

In [1]:
import os
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [2]:
os.environ["OPENAI_API_KEY"] = "<your-openai-api-key"

######  Load PDFs from directory and split them into chunks

In [23]:
!ls /content/my_pdf_dir

'SG221111OTHRYBSC_Stamford Land Corporation Ltd_20221111172130_00_FS_2Q_20220930.pdf'
'SG221111OTHRYC5J_Dragon Group Intl Limited_20221111173823_00_FS_3Q_20220930.pdf'
'SG221114OTHRKF8W_Hs Optimus Holdings Limited_20221114213742_00_FS_2Q_20220930.pdf'


In [5]:
loader = PyPDFDirectoryLoader("/content/my_pdf_dir")
docs = loader.load()

In [6]:
type(docs)

list

In [24]:
# like a list of namedTuple
docs[0]

Document(page_content=' \n \n \n \n \n \n \n \n \n \n \n \n \nCompany No. 199306761C \n \n  \nDragon International Limited and its Subsidiaries \n \n \nCondensed Financial Statements  \nFor the Nine Months Ended 30 September 2022 \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ', metadata={'source': '/content/my_pdf_dir/SG221111OTHRYC5J_Dragon Group Intl Limited_20221111173823_00_FS_3Q_20220930.pdf', 'page': 0})

In [26]:
docs[7].page_content

'Dragon Group International Limited   •   Page 8 \n \n \nNOTES TO THE CONDENSED FINANCIAL STATEMENTS \nFOR THE NINE MONTHS ENDED 30 SEPTEMBER 2022 \n \n 1. CORPORATION INFORMATION \n \nDragon Group International Limited (the “Company”) is a limited liability company which is domiciled a nd incorporated \nin Singapore and listed on the Singapore Exchange Secu rities Trading Limited (“SGX-ST”). The immediate and \nultimate holding company is ASTI Holdings Limited (“ ASTI”), also incorporated in Singapore. \n \nThe Company was placed on the watch-list under fina ncial entry criteria pursuant to Rule 1311(1) of th e Listing \nManual of the SGX-ST on 4 March 2015, and under minim um trading price criteria pursuant to Rule 1311(2) of the \nListing Manual of SGX-ST on 3 March 2016. The deadlin e for the Company to meet the financial exit criter ia set out \nin Rule 1314(1) of the Listing Manual (the “Financi al Exit Criteria”) was 3rd March 2017 pursuant to Ru le 1315 of the \nListing Manual.

In [27]:
docs[7].metadata["page"]

7

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=100)
texts = text_splitter.split_documents(docs)

In [10]:
type(texts), len(texts)

(list, 374)

###### Embed text and store embeddings

In [12]:
persist_directory = "./my_pdf_db"
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embeddings,
                                 persist_directory=persist_directory)
vectordb.persist()

###### Setup retrieval function

In [21]:
retriever = vectordb.as_retriever()
llm = ChatOpenAI(model_name='gpt-3.5-turbo') #gpt-4 or gpt-3.5-turbo
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

In [29]:
while True:
        user_input = input("Enter a query: ")
        if user_input == "exit":
            break

        query = f"###Prompt {user_input}"
        try:
            llm_response = qa(query)
            print(llm_response["result"])
            print("\n")
        except Exception as err:
            print('Exception occurred. Please try again', str(err))

Enter a query: Refer to Dragon Group International financial report, what is the reported "Cash and cash equivalent" amount as of 30 September 2022?
I'm sorry, I cannot find the reported "Cash and cash equivalent" amount as of 30 September 2022 in the given context. The condensed consolidated balance sheet available on page 20 provides some financial information, but it does not include this specific figure.


Enter a query: Refer to the Dragon Group financial report, where is the company located?
According to the financial report, Dragon Group International Limited is domiciled and incorporated in Singapore.


Enter a query: Has the company HS Optimus Holdings Limited proposed any interim dividend for the periods ending on September 30, 2022, and September 30, 2021, based on the given context
According to the given context, there is no proposal or declaration of any dividend by HS Optimus Holdings Limited for the six months ended 30 September 2022 or the corresponding period of the im

KeyboardInterrupt: ignored

In [None]:
# Questions
"""
Refer to Dragon Group International financial report, what is the reported "Cash and cash equivalent" amount as of 30 September 2022?
Refer to the Dragon Group financial report, where is the company located?
Has the company HS Optimus Holdings Limited proposed any interim dividend for the periods ending on September 30, 2022, and September 30, 2021, based on the given context?

Refer to the HS Optimus Holding Limited report, what is the reported net asset value per share as of 30 September 2022?
Refer to the HS Optimus Holding Limited report, what is the reported net asset value per share as of 31 Mar 2022?
What was the net book value of the assets disposed of by the Group during the six months ended September 30, 2022?
"""