# PDF Parsing Libraries

In [1]:
from thefuzz import fuzz
from pathlib import Path
from PyPDF2 import PdfReader
from pdfminer.high_level import extract_text
import pymupdf

In [2]:
pdf_fold = Path.cwd() / "Test PDFs"
library_output_fold = Path.cwd() / "Test PDF's Library Extraction"

## PyPDF2

In [3]:
PdfReader?

[0;31mInit signature:[0m
[0mPdfReader[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mstream[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mIO[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0m_local[0m[0;34m.[0m[0mPath[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrict[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpassword[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mNoneType[0m[0;34m,[0m [0mstr[0m[0;34m,[0m [0mbytes[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Initialize a PdfReader object.

This operation can take some time, as the PDF stream's cross-reference
tables are read into memory.

:param stream: A File object or an object that supports the standard read
    and seek methods similar to a File object. Could also be a
    string representing a path t

### PDF 1

In [4]:
PDF1 = pdf_fold / 'PDF1.pdf'

In [5]:
with open(PDF1) as file:
    reader = PdfReader(PDF1)
    page = reader.pages[0]
    print(page.extract_text())


Hi,
my
name
is
Prateek.
This
is
PDF
1
that
will
be
my
initial
test
that
the
first
Python
Library
will
extract.
Hope
this
works.
The
point
of
this
test
is
to
see
what
Python
PDF
Parsing
library
will
for
our
use
case.
That’s
it.


#### Observations:
1. By default, each word occupies it's own line.
2. After first test, my first thought is that this is very hard to use for our use case

---

## Pymupdf

### PDF 1

In [13]:
doc = pymupdf.open(PDF1)
#out = open("pdf1_output.txt", "wb") # create a text output
with open(library_output_fold / "pdf1_pymupdf_output.txt", "wb") as out:
    for page in doc: # iterate the document pages
        text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
        out.write(text) # write text of page
        out.write(bytes((12,))) # write page delimiter (form feed 0x0C)

#### Observations:
1. To extract text, you have to create a new 'txt' file and store the contents extracted in that file(in this case, (output))
2. Process to extract text seems much simpler using PyPdf2.
3. Compared to PyPDf2, exctraction worked more effectively, because in the 'output.txt', the format of the text in my original PDF was kept.

---

## pdfminer.six

### PDF 1

In [18]:
text = extract_text(PDF1, codec='utf-8')
with open(library_output_fold / 'pdf1_pdfminer_output.txt', 'wb') as out:
    out.write(text)
    #out.write(bytes((12,))) # write page delimiter (form feed 0x0C)

TypeError: a bytes-like object is required, not 'str'

#### Observations
1. Much simpler than the last two; very intuitive
2. Like PyMuPdf, it keeps the original format as in the PDF; unlike PuMuPdf, instead of having to save it to a seperate text file, pdfminer just shows what the output is with a print statement

---