# PDF Parsing Libraries

In [1]:
from thefuzz import fuzz
from pathlib import Path
from pypdf import PdfReader
from pdfminer.high_level import extract_text
import pymupdf

In [2]:
PDF_FILES_PATH = Path.cwd() / '../../data/raw'
LIBRARY_OUTPUT = Path.cwd() / '../../data/text extractions'
MANUAL_PDF = Path.cwd() / '../../data/raw/ground truth'

## PyPDF

### PDF 1

#### Text Extraction

In [4]:
doc = pymupdf.open(PDF_FILES_PATH / 'PDF1.pdf')
with open(LIBRARY_OUTPUT / "pdf1_pypdf_output.txt", "w") as out:
    reader = PdfReader(PDF_FILES_PATH / 'PDF1.pdf')
    number_of_pages = len(reader.pages)
    page = reader.pages[0]
    out.write(page.extract_text()) # write text of page

In [6]:
reader = PdfReader(PDF_FILES_PATH / 'PDF1.pdf')
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

Observations:
1. Overall maintained it's general format(paragraphs)
2. In the pdf, there was a space line between 2 lines; the library didn't implement this. 
3. Lot of commands and operations to do just to extract text from a pdf.

#### Similarity Ratio

In [8]:
with open(MANUAL_PDF / 'pdf1_manual.txt') as orig_file:
    original_txt = orig_file.read()
with open(LIBRARY_OUTPUT / 'pdf1_pypdf_output.txt') as lib_file:
    lib_text = lib_file.read()

In [9]:
fuzz.ratio(original_txt, lib_text)          # Similarity ratio for pypdf - out of 100.

90

This library is fairly accurate; however, it is relatively complicated to use -- it's a lot of code just to extract text from a pdf. 

---

## Pymupdf

### PDF 1

#### Text Extraction

In [17]:
doc = pymupdf.open(PDF_FILES_PATH / 'PDF1.pdf')
#out = open("pdf1_output.txt", "wb") # create a text output
with open(LIBRARY_OUTPUT / "pdf1_pymupdf_output.txt", "w") as out:
    for page in doc: # iterate the document pages
        text = page.get_text(sort=True) # get plain text (is in UTF-8)
        out.write(text) # write text of page

Observations:
1. Process to extract text seems relatively simpler than using pypdf.
2. Same problem as pypdf, didn't register the space line between the two lines.

#### Similarity Ratio

In [12]:
with open(MANUAL_PDF / 'pdf1_manual.txt') as orig_file:
    original_txt = orig_file.read()
with open(LIBRARY_OUTPUT / 'pdf1_pymupdf_output.txt') as lib_file:
    lib_text = lib_file.read()

In [13]:
fuzz.ratio(original_txt, lib_text)

97

---

## pdfminer.six

### PDF 1

#### Text Extraction

In [14]:
text = extract_text(PDF_FILES_PATH / 'PDF1.pdf', codec='utf-8')
with open(LIBRARY_OUTPUT / 'pdf1_pdfminer_output.txt', 'w') as out:
    out.write(text)

#### Observations
1. Much simpler than the last two; very intuitive
2. Unlike the other two, this library registed the space line added between the two lines, and added it in the txt file created.

#### Similarity Ratio

In [15]:
with open(MANUAL_PDF / 'pdf1_manual.txt') as orig_file:
    original_txt = orig_file.read()
with open(LIBRARY_OUTPUT / 'pdf1_pdfminer_output.txt') as lib_file:
    lib_text = lib_file.read()

In [16]:
fuzz.ratio(original_txt, lib_text)          # Similarity ratio for pypdf - out of 100.

97

---