#### Using PyPDF
##### Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.

In [None]:
#!pip install pypdf

In [1]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("documents/updated_cv.pdf")
pages = loader.load_and_split()

Could not import azure.core python package.


##### An advantage of this approach is that documents can be retrieved with page numbers.

In [2]:
pages[0]

Document(page_content='DEEP AK JAIS WAL\nNear sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,\nJharkhandsj.deepak.jaiswal@gmail.com\n9304161106\nDOB 01/10/1997\nin\nhttps://www.linkedin.com/in/deepak-\njaiswal-34b0b3174\nObjective Seeking an entry-level position to begin my career in a high-level professional\nenvironment.\nEducation\nSkills c++\nDigital Electronics\nEmbedded and Robotics\nJavascript\nReact.Js\nNode.Js\nProjects\nHobbies\nPersonal\nStrengthsUniversity College of engineering and technology\nB.Tech (Electronics and communication engineering)\n2019 — 7.6\nIndian school of Learning\nIntermediate\n2015 — 82%\nIndian school of Learning\nMatriculation\n2013 — 8 CGPA\nLine following land rover\nWhen robot is placed on the ﬁxed path,it follows the path b y detecting the\nline. The robot direction of motion depends on the two sensors outputs.\nWhen the two sensors are on the line of path, robot moves forward. If the left\nsensor moves awa y from the line, robot move

##### We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.

In [2]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


In [4]:
pip install faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-win_amd64.whl (10.8 MB)
     ---------------------------------------- 10.8/10.8 MB 3.6 MB/s eta 0:00:00
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4
Note: you may need to restart the kernel to use updated packages.


In [5]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("who is deepak?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

0: DEEP AK JAIS WAL
Near sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,
Jharkhandsj.deepak.jaiswal@gmail.com
9304161106
DOB 01/10/1997
in
https://www.linkedin.com/in/deepak-
jaiswal-34b0b3174
Objective Seeking an entry-level position to begin my career in a high-level professional
environ


#### Using Unstructured

In [7]:
from langchain.document_loaders import UnstructuredPDFLoader

In [14]:
loader = UnstructuredPDFLoader("documents/updated_cv.pdf")

In [15]:
data = loader.load()

In [19]:
loader = UnstructuredPDFLoader("documents/Pride-and-Prejudice.pdf",  mode="elements")

In [20]:
data = loader.load()

In [22]:
data[1]

Document(page_content='This\teBook\tis\tfor\tthe\tuse\tof\tanyone\tanywhere\tat\tno\tcost\tand\twith almost\tno\trestrictions\twhatsoever.\t\tYou\tmay\tcopy\tit,\tgive\tit\taway\tor re', metadata={'source': 'documents/Pride-and-Prejudice.pdf', 'filename': 'Pride-and-Prejudice.pdf', 'file_directory': 'documents', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'NarrativeText'})

#### Fetching remote PDFs using Unstructured
##### This covers how to load online pdfs into a document format that we can use downstream

##### Note: all other pdf loaders can also be used to fetch remote PDFs, but OnlinePDFLoader is a legacy function, and works specifically with UnstructuredPDFLoader.



In [3]:
from langchain.document_loaders import OnlinePDFLoader

In [None]:
loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf")

In [12]:
data = loader.load()

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\tmpe91r5gta'

In [None]:
print(data)

#### Using PyPDFium2

In [28]:
!pip install pypdfium2

Collecting pypdfium2
  Downloading pypdfium2-4.13.0-py3-none-win_amd64.whl (2.7 MB)
     ---------------------------------------- 2.7/2.7 MB 3.3 MB/s eta 0:00:00
Installing collected packages: pypdfium2
Successfully installed pypdfium2-4.13.0


In [26]:
from langchain.document_loaders import PyPDFium2Loader

In [29]:
loader = PyPDFium2Loader("documents/updated_cv.pdf")

In [30]:
data = loader.load()

In [31]:
data[0]

Document(page_content='DEEPAK JAISWAL\r\nNear sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,\r\nJharkhand\r\nsj.deepak.jaiswal@gmail.com\r\n9304161106\r\nDOB 01/10/1997\r\nin\r\nhttps://www.linkedin.com/in/deepak\ufffejaiswal-34b0b3174\r\nObjective Seeking an entry-level position to begin my career in a high-level professional\r\nenvironment.\r\nEducation\r\nSkills c++\r\nDigital Electronics\r\nEmbedded and Robotics\r\nJavascript\r\nReact.Js\r\nNode.Js\r\nProjects\r\nHobbies\r\nPersonal\r\nStrengths\r\nUniversity College of engineering and technology\r\nB.Tech (Electronics and communication engineering)\r\n2019 — 7.6\r\nIndian school of Learning\r\nIntermediate\r\n2015 — 82%\r\nIndian school of Learning\r\nMatriculation\r\n2013 — 8 CGPA\r\nLine following land rover\r\nWhen robot is placed on the fixed path,it follows the path by detecting the\r\nline. The robot direction of motion depends on the two sensors outputs.\r\nWhen the two sensors are on the line of path, robot m

#### Using PDFMiner

In [32]:
from langchain.document_loaders import PDFMinerLoader

In [33]:
loader = PDFMinerLoader("documents/updated_cv.pdf")

In [34]:
data = loader.load()

In [35]:
data[0]

Document(page_content='sj.deepak.jaiswal@gmail.com\n9304161106\nDOB 01/10/1997\nin\nhttps://www.linkedin.com/in/deepak-\njaiswal-34b0b3174\n\nDEEPAK JAISWAL\nNear sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,\nJharkhand\n\nObjective\n\nEducation\n\nSeeking an entry-level position to begin my career in a high-level professional\nenvironment.\n\nUniversity College of engineering and technology\nB.Tech (Electronics and communication engineering)\n2019 — 7.6\n\nIndian school of Learning\nIntermediate\n2015 — 82%\n\nIndian school of Learning\nMatriculation\n2013 — 8 CGPA\n\nSkills\n\nc++\n\nDigital Electronics\n\nEmbedded and Robotics\n\nJavascript\n\nReact.Js\n\nNode.Js\n\nProjects\n\nLine following land rover\nWhen robot is placed on the ﬁxed path,it follows the path by detecting the\nline. The robot direction of motion depends on the two sensors outputs.\nWhen the two sensors are on the line of path, robot moves forward. If the left\nsensor moves away from the line, robot 

#### Using PDFMiner to generate HTML text
##### This can be helpful for chunking texts semantically into sections as the output html content can be parsed via BeautifulSoup to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.

In [36]:
from langchain.document_loaders import PDFMinerPDFasHTMLLoader

In [37]:
loader = PDFMinerPDFasHTMLLoader("documents/updated_cv.pdf")

In [38]:
data = loader.load()[0]   # entire pdf is loaded as a single Document

In [39]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data.page_content,'html.parser')
content = soup.find_all('div')

In [40]:
import re
cur_fs = None
cur_text = ''
snippets = []   # first collect all snippets that have the same font size
for c in content:
    sp = c.find('span')
    if not sp:
        continue
    st = sp.get('style')
    if not st:
        continue
    fs = re.findall('font-size:(\d+)px',st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text,cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text,cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)

In [41]:
from langchain.docstore.document import Document
cur_idx = -1
semantic_snippets = []
# Assumption: headings have higher font size than their respective content
for s in snippets:
    # if current snippet's font size > previous section's heading => it is a new heading
    if not semantic_snippets or s[1] > semantic_snippets[cur_idx].metadata['heading_font']:
        metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
        metadata.update(data.metadata)
        semantic_snippets.append(Document(page_content='',metadata=metadata))
        cur_idx += 1
        continue
    
    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create
    # a tree like structure for sub sections if needed but that may require some more thinking and may be data specific)
    if not semantic_snippets[cur_idx].metadata['content_font'] or s[1] <= semantic_snippets[cur_idx].metadata['content_font']:
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata['content_font'] = max(s[1], semantic_snippets[cur_idx].metadata['content_font'])
        continue
    
    # if current snippet's font size > previous section's content but less tha previous section's heading than also make a new 
    # section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
    metadata.update(data.metadata)
    semantic_snippets.append(Document(page_content='',metadata=metadata))
    cur_idx += 1

In [42]:
semantic_snippets[0]

Document(page_content='', metadata={'heading': 'sj.deepak.jaiswal@gmail.com\n9304161106\nDOB 01/10/1997\nin\nhttps://www.linkedin.com/in/deepak-\njaiswal-34b0b3174\n', 'content_font': 0, 'heading_font': 8, 'source': 'documents/updated_cv.pdf'})

#### Using PyMuPDF
##### This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

In [43]:
from langchain.document_loaders import PyMuPDFLoader




In [44]:
loader = PyMuPDFLoader("documents/updated_cv.pdf")

In [45]:
data = loader.load()

In [46]:
data[0]

Document(page_content='DEEPAK JAISWAL\nNear sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,\nJharkhand\nsj.deepak.jaiswal@gmail.com\n9304161106\nDOB 01/10/1997\nin\nhttps://www.linkedin.com/in/deepak-\njaiswal-34b0b3174\nObjective\nSeeking an entry-level position to begin my career in a high-level professional\nenvironment.\nEducation\nSkills\nc++\nDigital Electronics\nEmbedded and Robotics\nJavascript\nReact.Js\nNode.Js\nProjects\nHobbies\nPersonal\nStrengths\nUniversity College of engineering and technology\nB.Tech (Electronics and communication engineering)\n2019 — 7.6\nIndian school of Learning\nIntermediate\n2015 — 82%\nIndian school of Learning\nMatriculation\n2013 — 8 CGPA\nLine following land rover\nWhen robot is placed on the ﬁxed path,it follows the path by detecting the\nline. The robot direction of motion depends on the two sensors outputs.\nWhen the two sensors are on the line of path, robot moves forward. If the left\nsensor moves away from the line, robot mo

#### PyPDF Directory
##### Load PDFs from directory

In [47]:
from langchain.document_loaders import PyPDFDirectoryLoader



In [48]:
loader = PyPDFDirectoryLoader("documents/")

In [49]:
docs = loader.load()

In [50]:
docs

[Document(page_content='', metadata={'source': 'documents\\Pride-and-Prejudice.pdf', 'page': 0}),
 Document(page_content='The\tProject\tGutenberg\tEBook\tof\tPride\tand\tPrejudice,\tby\tJane\tAusten\nThis\teBook\tis\tfor\tthe\tuse\tof\tanyone\tanywhere\tat\tno\tcost\tand\twith\nalmost\tno\trestrictions\twhatsoever.\t\tYou\tmay\tcopy\tit,\tgive\tit\taway\tor\nre-use\tit\tunder\tthe\tterms\tof\tthe\tProject\tGutenberg\tLicense\tincluded\nwith\tthis\teBook\tor\tonline\tat\twww.gutenberg.org\nTitle:\tPride\tand\tPrejudice\nAuthor:\tJane\tAusten\nRelease\tDate:\tAugust\t26,\t2008\t[EBook\t#1342]\nLast\tUpdated:\tNovember\t12,\t2019\nLanguage:\tEnglish\n***\tSTART\tOF\tTHIS\tPROJECT\tGUTENBERG\tEBOOK\tPRIDE\tAND\tPREJUDICE\t***\nProduced\tby\tAnonymous\tVolunteers,\tand\tDavid\tWidger\nTHERE\tIS\tAN\tILLUSTRATED\tEDITION\tOF\tTHIS\tTITLE\tWHICH\tMAY\nVIEWED\tAT\tEBOOK\t\n[#\t42671\t]', metadata={'source': 'documents\\Pride-and-Prejudice.pdf', 'page': 1}),
 Document(page_content='Pride\tand\t

#### Using pdfplumber
##### Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

In [53]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.9.0-py3-none-any.whl (46 kB)
     -------------------------------------- 46.1/46.1 kB 568.9 kB/s eta 0:00:00
Collecting Wand>=0.6.10
  Downloading Wand-0.6.11-py2.py3-none-any.whl (143 kB)
     -------------------------------------- 143.6/143.6 kB 2.2 MB/s eta 0:00:00
Installing collected packages: Wand, pdfplumber
Successfully installed Wand-0.6.11 pdfplumber-0.9.0


In [51]:
from langchain.document_loaders import PDFPlumberLoader

In [54]:
loader = PDFPlumberLoader("documents/updated_cv.pdf")

In [55]:
data = loader.load()

In [56]:
data[0]

Document(page_content='sj.deepak.jaiswal@gmail.com\n9304161106\nDOB 01/10/1997\nin\nhttps://www.linkedin.com/in/deepak-\njaiswal-34b0b3174\nDEEPAK JAISWAL\nNear sitla mandir, H.E. School Road, Vistipara, Hirapur, Dhanbad,\nJharkhand\nObjective Seeking an entry-level position to begin my career in a high-level professional\nenvironment.\nEducation University College of engineering and technology\nB.Tech (Electronics and communication engineering)\n2019 — 7.6\nIndian school of Learning\nIntermediate\n2015 — 82%\nIndian school of Learning\nMatriculation\n2013 — 8 CGPA\nSkills c++\nDigital Electronics\nEmbedded and Robotics\nJavascript\nReact.Js\nNode.Js\nProjects Line following land rover\nWhen robot is placed on the fixed path,it follows the path by detecting the\nline. The robot direction of motion depends on the two sensors outputs.\nWhen the two sensors are on the line of path, robot moves forward. If the left\nsensor moves away from the line, robot moves towards right. Similarly, if\