<a href="https://colab.research.google.com/github/Kokit0/PDF-Extraction-and-Querying-using-LangChain-and-OpenAI-Embeddings/blob/main/Kokkes_pdf_query_langchain_v1.0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PDF Query Using Langchain

## Install necessary packages for our PDF querying adventure

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Collecting langchain
  Downloading langchain-0.0.323-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.43 (from langchain)
  Downloading langsmith-0.0.51-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langch

## # Import the heroic libraries that power our PDF detective tools

In [None]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
import os
from typing_extensions import Concatenate

In [None]:
# Equip your trusty sidekick with the right keys
os.environ["OPENAI_API_KEY"] = ""
os.environ["SERPAPI_API_KEY"] = ""

In [None]:
# Prepare to decode the hidden messages in your PDF documents
pdfreader = PdfReader('budget_speech.pdf')

In [None]:
# Unleash your digital detective to extract clues from the PDF pages
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

### Let's see a sample

In [None]:
raw_text

"GOVERNMENT OF INDIA\nBUDGET 2023-2024\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2023CONTENTS \nPART-A \n Page No.  \n\uf0b7 Introduction 1 \n\uf0b7 Achievements since 2014: Leaving no one behind 2 \n\uf0b7 Vision for Amrit Kaal  – an empowered and inclusive economy 3 \n\uf0b7 Priorities of this Budget 5 \ni. Inclusive Development  \nii. Reaching the Last Mile \niii. Infrastructure and Investment \niv. Unleashing the Potential \nv. Green Growth \nvi. Youth Power  \nvii. Financial Sector  \n \n \n \n \n \n \n \n \n\uf0b7 Fiscal Management 24 \nPART B  \n  \nIndirect Taxes  27 \n\uf0b7 Green Mobility  \n\uf0b7 Electronics   \n\uf0b7 Electrical   \n\uf0b7 Chemicals and Petrochemicals   \n\uf0b7 Marine products  \n\uf0b7 Lab Grown Diamonds  \n\uf0b7 Precious Metals  \n\uf0b7 Metals  \n\uf0b7 Compounded Rubber  \n\uf0b7 Cigarettes  \n  \nDirect Taxes  30 \n\uf0b7 MSMEs and Professionals   \n\uf0b7 Cooperation  \n\uf0b7 Start-Ups  \n\uf0b7 Appeals  \n\uf0b7 Better ta

In [None]:
# We need to split the text using Character Text Split such that it should not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
len(texts)

149

In [None]:
# Download the superpowers from OpenAI for our digital companion
embeddings = OpenAIEmbeddings()

In [None]:
# Create a digital magnifying glass to search for textual clues
document_search = FAISS.from_texts(texts, embeddings)

In [None]:
document_search

<langchain.vectorstores.faiss.FAISS at 0x7f7abd1445b0>

### Load the powerful question-answering chain to solve the mysteries

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [None]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

##Time to ask questions
* now we have ready our embedding and the chain, lets ask our virtual detective to find some information. Let's ask a random topic we know from the PDF we're testing.
 * query "Vision for Amrit Kaal"

In [None]:
query = "Vision for Amrit Kaal"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Our vision for the Amrit Kaal includes technology-driven and knowledge-based economy with strong public finances, and a robust financial sector. To achieve this, Jan Bhagidari through Sabka Saath Sabka Prayas is essential. The economic agenda for achieving this vision focuses on three things: first, facilitating ample opportunities for citizens, especially the youth, to fulfil their aspirations; second, providing strong impetus to growth and job creation; and third, strengthening macro-economic stability.'

* Now, let's embark on another quest to uncover details about agriculture targets

In [None]:
query = "How much the agriculture target will be increased to and what the focus will be"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The agriculture credit target will be increased to ` 20 lakh crore with focus on animal husbandry, dairy and fisheries.'

Amazing!

## Fetching from the web
* Expand your horizons beyond local PDFs – fetch documents from the web
* Lets have a look from one of the most ground breaking paper from ther last yeatrs regarding what a game-changer is attention for Gen AI in general.

In [None]:
from langchain.document_loaders import OnlinePDFLoader

In [None]:
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")

* Equip yourself with additional packages to handle unstructured data

In [None]:
!pip install unstructured

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unstructured
  Downloading unstructured-0.7.5-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting argilla (from unstructured)
  Downloading argilla-1.9.0-py3-none-any.whl (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting msg-parser (from unstructured)
  Downloading msg_parser-1.2.0-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting pdf2image (from unstructured)
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting pdfminer.six (from unstructured)
  Downloading pdfm

In [None]:
data = loader.load()

 Let's have a look inside

In [None]:
data

[Document(page_content='A WEAK (k, k)-LEFSCHETZ THEOREM FOR PROJECTIVE TORIC ORBIFOLDS\n\nWilliam D. Montoya\n\n3 2 0 2\n\nInstituto de Matem´atica, Estat´ıstica e Computa¸c˜ao Cient´ıﬁca, Universidade Estadual de Campinas (UNICAMP),\n\nb e F 7\n\nRua S´ergio Buarque de Holanda 651, 13083-859, Campinas, SP, Brazil\n\n]\n\nFebruary 9, 2023\n\nG A . h t a m\n\nAbstract\n\nFirstly we show a generalization of the (1, 1)-Lefschetz theorem for projective toric orbifolds and secondly we prove that on 2k-dimensional quasi-smooth hyper- surfaces coming from quasi-smooth intersection surfaces, under the Cayley trick, every rational (k, k)-cohomology class is algebraic, i.e., the Hodge conjecture holds on them.\n\n[\n\n1 v 3 0 8 3 0 . 2 0 3 2 : v i X r a\n\n1\n\nIntroduction\n\nIn [3] we proved that, under suitable conditions, on a very general codimension s quasi- smooth intersection subvariety X in a projective toric orbifold Pd Σ with d + s = 2(k + 1) the Hodge conjecture holds, that is, every

## Create an index and query for information in the online PDF and Download more digital magnifying glasses from OpenAI

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

* Install a tool for improved performance

In [None]:
!pip install chromadb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.28 (from chromadb)
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting hnswlib>=0.7 (from chromadb)
  Downloading hnswlib-0.7.0.tar.gz (33 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting clickhouse-connect>=0.5.7 (from chromadb)
  Downloading clickhouse_connect-0.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m965.1/965.1

* Create an index to search for information in the online PDF

In [None]:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])

* Now, lets ask away about "Attention" from that publication. Perform a text query on the index

In [None]:
query = "Explain me about Attention is all you need"
index.query(query)

' Attention is All You Need is a paper published in 2017 by researchers from Google Brain. The paper introduces the Transformer, a model architecture that relies entirely on an attention mechanism to draw global dependencies between input and output, instead of using recurrence. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. Additionally, self-attention could yield more interpretable models.'

# **There you go!**

Now our tool is ready to be used in whatever you want to ask!

**Enjoy!** ~Kokke

