## Implementing LLM Querying over a PDF File with Langchain
Good tutorial at: https://www.youtube.com/watch?v=ZzgUqFtxgXI

### Navigating

In [3]:
cd /media/pranshumaan/TOSHIBA\ EXT/Dev/Project_Finmin

/media/pranshumaan/TOSHIBA EXT/Dev/Project_Finmin


### Passing secrets

In [4]:
import os
with open('../Secrets/open_ai_api_key.txt', 'r') as file:
    key = file.read().rstrip()

os.environ['OPENAI_API_KEY'] = key

### Architecture

![architecture](attachment:architecture)

### Importing dependencies

In [5]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

### Reading in the PDF

In [6]:
doc_reader = PdfReader("/media/pranshumaan/TOSHIBA EXT/Dev/Project_Finmin/Budget_Speeches_PDF/bs199192.pdf")

In [7]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [8]:
raw_text[:1000]

'1\nBudget 1991-92\nSpeech of\nShri Manmohan Singh\nMinister of Finance\n24th July, 1991\nPART A\nSir,\nI rise to present the budget for 1991-92. As I rise, I am overpowered by a\nstrange feeling of loneliness. I miss a handsome, smiling, face listening intently tothe Budget Speech. Shri Rajiv Gandhi is no more. But his dream lives on; his\ndream of ushering India into the twenty-first century; his dream of a strong, united,\ntechnologically sophisticated but humane India. I dedicate this budget to his inspiringmemory.\n2 . The new Government, which assumed office barely a month ago, inherited\nan economy in deep crisis. The balance of payments situation is precarious.\nInternational confidence in our economy was strong until November 1989 when\nour Party was in office. However, due to the combined impact of political instabilitywitnessed thereafter, the accentuation of fiscal imbalances and the Gulf crisis,\nthere was a great weakening of international confidence. There has been a sha

In [9]:
len(raw_text)

110663

### Text splitter

In [10]:
# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200, #striding over the text
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

Created a chunk of size 1445, which is longer than the specified 1000
Created a chunk of size 1344, which is longer than the specified 1000
Created a chunk of size 1054, which is longer than the specified 1000
Created a chunk of size 1093, which is longer than the specified 1000
Created a chunk of size 1101, which is longer than the specified 1000


In [11]:
len(texts)

140

In [12]:
texts[10]

'investment, to ensure that India’s financial sector is rapidly modernised, and toimprove the performance of the public sector, so that the key sectors of our economy\nare enabled to attain an adequate technological and competitive edge in a fast\nchanging global economy. I am confident that, after a successful implementation of\nstabilisation measures and the essential structural and policy reforms, our economy\nwould return to a path of a high sustained growth with reasonable price stabilityand greater social equity.\n10. Thanks to the efforts of Pandit Jawaharlal Nehru, Indira Gandhi and Rajiv\nGandhi, we have developed a well diversified industrial structure. This constitutesa great asset as we begin to implement various structural reforms. However, barriers\nto entry and limits on growth in the size of firms, have often led to a proliferation\nof licensing and an increase in the degree of monopoly. This has put shackles on'

In [13]:
type(texts[0])

str

### Making the embeddings 

In [14]:
embeddings = OpenAIEmbeddings()

In [15]:
docsearch = FAISS.from_texts(texts, embeddings)
# This is the vector store

In [16]:
docsearch.embedding_function

<bound method OpenAIEmbeddings.embed_query of OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None)>

In [17]:
dir(docsearch)

['_FAISS__add',
 '_FAISS__from',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_normalize_L2',
 '_similarity_search_with_relevance_scores',
 'aadd_documents',
 'aadd_texts',
 'add_documents',
 'add_embeddings',
 'add_texts',
 'afrom_documents',
 'afrom_texts',
 'amax_marginal_relevance_search',
 'amax_marginal_relevance_search_by_vector',
 'as_retriever',
 'asearch',
 'asimilarity_search',
 'asimilarity_search_by_vector',
 'asimilarity_search_with_relevance_scores',
 'docstore',
 'embedding_function',
 'from_documents',
 'from_embeddings',
 'from_texts',
 'index',
 'index_to_docstore_id',
 'load_local',
 'max_margi

In [19]:
query = "What is the situation of India's foreign exchange reserves?"
docs = docsearch.similarity_search(query)

In [20]:
len(docs) #Default is 4 documents

4

In [21]:
type(docs[0])

langchain.schema.Document

In [22]:
dir(docs[0].page_content)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


In [23]:
type(docs[0].page_content)

str

In [24]:
docs[0].page_content.replace("\n", "")

'there was a great weakening of international confidence. There has been a sharpdecline in capital inflows through commercial borrowing and non-resident deposits.As a result, despite large borrowings from the International Monetary Fund in July1990 and January 1991, there was a sharp reduction in our foreign exchangereserves. We have been at the edge of a precipice since December 1990 and moreso since April 1991. The foreign exchange crisis constitutes a serious threat to thesustainability of growth processes and orderly implementation of our developmentprogrammes. Due to the combination of unfavourable internal and external factors,the inflationary pressures on the price level have increased very substantially sincemid-1990. The people of India have to face double digit inflation which hurts mostthe poorer sections of our society. In sum, the crisis in the economy is both acuteand deep. We have not experienced anything similar in the history of independentIndia.'

### Creating a function to pretty print

In [25]:
def printify(docs_item):
    print(docs_item.page_content.replace("\n", ""))

In [26]:
printify(docs[0])

there was a great weakening of international confidence. There has been a sharpdecline in capital inflows through commercial borrowing and non-resident deposits.As a result, despite large borrowings from the International Monetary Fund in July1990 and January 1991, there was a sharp reduction in our foreign exchangereserves. We have been at the edge of a precipice since December 1990 and moreso since April 1991. The foreign exchange crisis constitutes a serious threat to thesustainability of growth processes and orderly implementation of our developmentprogrammes. Due to the combination of unfavourable internal and external factors,the inflationary pressures on the price level have increased very substantially sincemid-1990. The people of India have to face double digit inflation which hurts mostthe poorer sections of our society. In sum, the crisis in the economy is both acuteand deep. We have not experienced anything similar in the history of independentIndia.


**Observation**: Note that this looks really good. Answering the query "What is the situation of India's foreign exchange reserves?", the chunk is relevant.
It might be useful to have GPT transcribe / paraphrase this answer with reference to this paragraph, rather than giving out the paragraph itself. (Turns out that is what is being done here)

### Plain QA Chain

In [27]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [28]:
chain = load_qa_chain(OpenAI(), 
                      chain_type="stuff") # we are going to stuff all the docs in at once

In [29]:
# check the prompt
chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

In [30]:
query = "Who is making this speech?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Shri Manmohan Singh, Minister of Finance.'

In [31]:
query = "What are some key numbers in this speech?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The total non-plan expenditure in 1991-92 is estimated to be Rs. 79,697 crores. There is a reduction of 4.9 per cent in non-plan expenditure compared to the revised estimates of 1990-91. The Congress Party committed to reduce non-plan expenditure by 10 per cent in their election manifesto.'

### RetrievalQA
Accomplishes the task of getting answer to the query, along with actual references to the source

In [32]:
from langchain.chains import RetrievalQA

# set up FAISS as a generic retriever 
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":4})

# create the chain to answer questions 
rqa = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [33]:
response = rqa("What are some key numbers in this speech?")

In [34]:
response

{'query': 'What are some key numbers in this speech?',
 'result': ' The total non-plan expenditure in 1991-92 is Rs. 79,697 crores. The provision for non-plan expenditure, excluding interest payments, in the current year represents a reduction of 4.9 per cent compared with the provisions in the revised estimates for 1990-91, and a reduction of almost 15 per cent in relation to what we would have had to provide this year.',
 'source_documents': [Document(lc_kwargs={'page_content': '153.Sir, I do not minimise the difficulties that lie ahead on the long and\narduous journey on which we have embarked. But as Victor Hugo once said, “nopower on earth can stop an idea whose time has come.” I suggest to this august\nHouse that the emergence of India as a major economic power in the world happens\nto be one such idea. Let the whole world hear it loud and clear. India is now wideawake. We shall prevail. We shall overcome.\n154.W ith these words, I commend the budget to this august House.\n[24th 

In [35]:
response['source_documents'][0].page_content.replace('\n', "")

'153.Sir, I do not minimise the difficulties that lie ahead on the long andarduous journey on which we have embarked. But as Victor Hugo once said, “nopower on earth can stop an idea whose time has come.” I suggest to this augustHouse that the emergence of India as a major economic power in the world happensto be one such idea. Let the whole world hear it loud and clear. India is now wideawake. We shall prevail. We shall overcome.154.W ith these words, I commend the budget to this august House.[24th July, 1991]'

In [46]:
reversed_response = list(response['source_documents']).reverse()

### Creating a function to pretty print the query, response and reference text chunks

In [48]:
def responder(query: str):
    response = rqa(query)
    print(response['query'])
    print(f"\n")
    print(response['result'])
    print("___")
    index = 1
    for source_document in response['source_documents']:
        print(f"\n")
        print(f"Source {index}\n")
        print(source_document.page_content.replace('\n', ""))
        index += 1

In [49]:
responder("What are some key numbers in this speech?")

What are some key numbers in this speech?


 The total non-plan expenditure in 1991-92 is estimated to be Rs. 79,697 crores. The provision for non-plan expenditure, excluding interest payments, is estimated to be a 4.9% reduction compared to the revised estimates for 1990-91 and a 15% reduction compared to what would have been provided this year without corrective measures.
___


Source 1

153.Sir, I do not minimise the difficulties that lie ahead on the long andarduous journey on which we have embarked. But as Victor Hugo once said, “nopower on earth can stop an idea whose time has come.” I suggest to this augustHouse that the emergence of India as a major economic power in the world happensto be one such idea. Let the whole world hear it loud and clear. India is now wideawake. We shall prevail. We shall overcome.154.W ith these words, I commend the budget to this august House.[24th July, 1991]


Source 2

1Budget 1991-92Speech ofShri Manmohan SinghMinister of Finance24th July, 1991PA

### To do

1. Extend to include all speeches
2. Create a front end