# 2.13 Converting pdf's to text

In this notebook we will cover several libraries for converting a pdf to a text file. Pdf's contain both text and images and other *unstructured* data so that is sometimes hard to do. With this notebook I'll have you covered. 

### Contents
0. Install packages
1. Py2PDF
2. Unstructured
3. Getting tables out of pdf with pdfplumber
4. Generating pdf's from webpages

## 0. Install packages

In [None]:
pip install py2pdf

In [None]:
pip install unstructured

In [14]:
pip install pdfkit

Collecting pdfkit
  Using cached pdfkit-1.0.0-py3-none-any.whl (12 kB)
Installing collected packages: pdfkit
Successfully installed pdfkit-1.0.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install pdfplumber

Note: you may need to restart the kernel to use updated packages.


## 1. Py2PDF

- pypi: https://pypi.org/project/PyPDF2/
- source: https://github.com/py-pdf/pypdf

In [2]:
#show pdf's in your pwd
import glob
my_pdfs = glob.glob('*.pdf')
my_pdfs

['out.pdf',
 'TTF.pdf',
 'Didactisch Coachen-Copy1. Lia Voerman en Frans Faber.pdf',
 'pulp-fiction-1994.pdf']

In [30]:
#Do the magic
from PyPDF2 import PdfReader

reader = PdfReader("pulp-fiction-1994.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0] #select only first page
print(page.extract_text())

PULP FICTION
by
Quentin Tarantino & Roger Avary


In [10]:
number_of_pages

126

In [11]:
#a script to print all pages
reader = PdfReader("pulp-fiction-1994.pdf")
number_of_pages = len(reader.pages)

for i in range(number_of_pages):
    page = reader.pages[i] #select only first page
    print(page.extract_text())
    print(50*'-') #print page break

PULP FICTION
by
Quentin Tarantino & Roger Avary
--------------------------------------------------
PULP [pulp] n.
1. A soft, moist, shapeless
mass or matter.
2. A magazine or book containing lurid
subject matter and being characteristically
printed on rough, unfinished paper.
American Heritage Dictionary: New College Edition
INT. COFFEE SHOP – MORNING
A normal Denny's, Spires-like coffee shop in Los Angeles. It's
about 9:00 in the morning. While the place isn't jammed, there's a
healthy number of people drinking coffee, munching on bacon and
eating eggs.
Two of these people are a YOUNG MAN and a YOUNG WOMAN. The Young
Man has a slight working-class English accent and, like his fellow
countryman, smokes cigarettes like they're going out of style.
It is impossible to tell where the Young Woman is from or how old
she is; everything she does contradicts something she did. The boy
and girl sit in a booth. Their dialogue is to be said in a rapid-
pace "HIS GIRL FRIDAY" fashion.
YOUNG MAN
No,

EXT. BOXING AUDITORIUM (RAINING) – NIGHT
The cab WHIPS out of the alley, FISH-TAILING on the wet pavement
in front of the auditorium at a rapid pace.
INT. WILLIS LOCKER ROOM (AUDITORIUM) – NIGHT
Locker room door opens, Enghlish Dave fights his way through the
pandemonium which is going on outside in the hall, shutting the
door on the madness. Once inside, English Dave takes time to
adjust his suit and tie. Mia is standing by the door. She sees
Vincent with English Dave.
VINCENT
Mia. How you doin'?
MIA
Great. I never thanked you for the dinner.
In the room, black boxer FLOYD RAY WILLIS lies on a table – dead.
His face looks like he went dunking for bees. His TRAINER is on
his knees, head on Floyd's chest, crying over the body.
The huge figure that is Marsellus Wallace stands at the table,
hand on the Trainer's shoulder, lending emotional support. We
still do not see Marsellus clearly, only that he is big.
Mia sits in a chair at the far end of the room.
Marsellus looks up, sees English D

In [None]:
#THE one script to make one long txt file for a set of .pdf's
from PyPDF2 import PdfReader
from glob import glob

my_pdfs = glob('*.pdf')
my_file_names2= []

for i in range(len(my_pdfs)):
    reader = PdfReader(my_pdfs[i])
    number_of_pages = len(reader.pages)
    file_name, file_extension = os.path.splitext(my_pdfs[i])
    textfile = open(file_name+".txt", "w")

    for j in range (number_of_pages):
        page = reader.pages[j]
        textfile.write(page.extract_text())
        textfile.write('}\n')
    textfile.close() 

In [12]:
#check if you have the txt's
import glob
my_txts=glob.glob('*.txt')
my_txts

['out.txt',
 'TTF.txt',
 'demo.txt',
 'XML_file.txt',
 'pulp-fiction-1994.txt',
 'requirements.txt',
 'plantuml_complex2.txt',
 'servers.txt',
 'leegmelden_plantuml.txt',
 'NOAA_data.txt',
 'accounts2.txt',
 'input.txt',
 'untitled.txt',
 'Didactisch Coachen_10p.txt',
 'plantuml.txt',
 'a_file.txt',
 'plantuml_complex.txt']

## 2. Unstructured

Warning: installing this packages can be difficult, with many dependencies. Follow the instructions on pypi or unstructured.io homepage closely.

- source: www.unstructured.io
- pypi: https://pypi.org/project/unstructured/


### Basic instruction
PDF's are easy to read for humans but hard to parse for computers. Fortunately, since 2022, there is the unstructured package. With the unstructured package we can convert pdf's to text and at the same time get some structure in the text. With unstructured we can create:
- Elements
- Titles
- Narrative texts

Please keep in mind that there are many more options. We've just covered the basics

In [1]:
#show all pdfs
import glob
my_pdfs = glob.glob('*.pdf')
my_pdfs

['background-checks.pdf', 'pulp-fiction-1994.pdf', 'Didactisch Coachen.pdf']

In [2]:
# Warning: this might take some time
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="pulp-fiction-1994.pdf")

In [3]:
#print the elements
for elem in elements[:100]:
    print(elem)

PULP FICTION
by
Quentin Tarantino & Roger Avary
PULP [pulp] n.
1. A soft, moist, shapeless mass or matter.
2. A magazine or book containing lurid subject matter and being characteristically printed on rough, unfinished paper.
American Heritage Dictionary: New College Edition
INT. COFFEE SHOP — MORNING
A normal Denny's, Spires-like coffee shop in Los Angeles. It's about 9:00 in the morning. While the place isn't jammed, there's a healthy number of people drinking coffee, munching on bacon and eating eggs.
Two of these people are a YOUNG MAN and a YOUNG WOMAN. The Young Man has a slight working-class English accent and, like his fellow countryman, smokes cigarettes like they're going out of style.
It is impossible to tell where the Young Woman is from or how old
she is; everything she does contradicts something she did. The boy and girl sit in a booth. Their dialogue is to be said in a rapid-
pace "HIS GIRL FRIDAY" fashion.
YOUNG MAN No, forget it, it's too risky. I'm through doin' that 

In [4]:
#print the titles
titles = [elem for elem in elements if elem.category == "Title"]

for title in titles:
    print(title.text)

PULP FICTION
by
Quentin Tarantino & Roger Avary
PULP [pulp] n.
1. A soft, moist, shapeless mass or matter.
American Heritage Dictionary: New College Edition
INT. COFFEE SHOP — MORNING
pace "HIS GIRL FRIDAY" fashion.
YOUNG WOMAN
After tonight.
YOUNG WOMAN
Oh yes, thank you.
YOUNG MAN
YOUNG WOMAN
YOUNG WOMAN
YOUNG WOMAN And no more liquor stores?
YOUNG MAN
Not this life.
YOUNG WOMAN
Well what then?
YOUNG MAN
Garcon! Coffee!
YOUNG MAN
This place.
WAITRESS
(snotty)
YOUNG MAN
YOUNG WOMAN
YOUNG WOMAN
Thanks.
YOUNG WOMAN
YOUNG WOMAN
A lot of wallets.
YOUNG MAN
Pretty smart, huh?
YOUNG WOMAN
Pretty smart.
YOUNG WOMAN
Got it.
YOUNG WOMAN
YOUNG MAN I love you, Honey Bunny.
INT. '74 CHEVY (MOVING) — MORNING
JULES
JULES
What?
JULES
Examples?
JULES
VINCENT
Royale with Cheese.
JULES
JULES
JULES
What?
VINCENT
Mayonnaise.
JULES
Goddamn!
VINCENT
JULES
Uuccch!
INT. CHEVY (TRUNK) — MORNING
VINCENT
How many up there?
JULES
Three or four.
VINCENT
JULES
JULES
EXT. APARTMENT BUILDING COURTYARD — MORNING
We T

In [5]:
#get texts
import textwrap

narrative_texts = [elem for elem in elements if elem.category == "NarrativeText"]

for index, elem in enumerate(narrative_texts[:5]):
    print(f"Narrative text {index + 1}:")
    print("\n".join(textwrap.wrap(elem.text, width=70)))
    print("-" * 50)

Narrative text 1:
2. A magazine or book containing lurid subject matter and being
characteristically printed on rough, unfinished paper.
--------------------------------------------------
Narrative text 2:
A normal Denny's, Spires-like coffee shop in Los Angeles. It's about
9:00 in the morning. While the place isn't jammed, there's a healthy
number of people drinking coffee, munching on bacon and eating eggs.
--------------------------------------------------
Narrative text 3:
Two of these people are a YOUNG MAN and a YOUNG WOMAN. The Young Man
has a slight working-class English accent and, like his fellow
countryman, smokes cigarettes like they're going out of style.
--------------------------------------------------
Narrative text 4:
It is impossible to tell where the Young Woman is from or how old
--------------------------------------------------
Narrative text 5:
she is; everything she does contradicts something she did. The boy and
girl sit in a booth. Their dialogue is to be sai

## 3. Getting tables out of pdf with pdfplumber

- source: https://github.com/jsvine/pdfplumber
- pypi: https://pypi.org/project/pdfplumber/0.1.2/

In [27]:
#get the example
!curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 90468  100 90468    0     0   197k      0 --:--:-- --:--:-- --:--:--  197k


In [28]:
#import packages
import pdfplumber
import pandas as pd

In [17]:
#show all pdf's
import glob
my_pdfs =glob.glob('*.pdf')
my_pdfs

['background-checks.pdf', 'pulp-fiction-1994.pdf', 'Didactisch Coachen.pdf']

In [20]:
#do the pdfplumber magic
import pdfplumber

with pdfplumber.open('background-checks.pdf') as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

{'matrix': (6.96, 0.0, 0.0, 6.96, 47.04, 534.7), 'fontname': 'DCLTEC+Helvetica-Bold', 'adv': 0.667, 'upright': True, 'x0': 47.04, 'y0': 533.0992, 'x1': 51.68232, 'y1': 540.0592, 'width': 4.642319999999998, 'height': 6.960000000000036, 'size': 6.960000000000036, 'object_type': 'char', 'page_number': 1, 'text': 'S', 'stroking_color': None, 'non_stroking_color': None, 'top': 71.94079999999997, 'bottom': 78.9008, 'doctop': 71.94079999999997}


In [29]:
#extract the table from the file
table = pdf.pages[0].extract_table()
print(table[:2]) #print first part

[['NICS Firearm Background Checks\nNovember - 2015', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], ['State / Territory', 'Permit Handgun Long Gun *Other **Multiple Admin', None, None, None, None, None, 'Pre-Pawn\nHandgun Long Gun *Other', None, None, 'Redemption\nHandgun Long Gun *Other', None, None, 'Returned/Disposition\nHandgun Long Gun *Other', None, None, 'Rentals\nHandgun Long Gun', None, 'Private Sale\nHandgun Long Gun *Other', None, None, 'Return to Seller - Private Sale\nHandgun Long Gun *Other', None, None, 'Totals']]


In [26]:
#load it into a dataframe and print the first five rows
df = pd.DataFrame(table[1:], columns=table[0])
df.head(5)

Unnamed: 0,NICS Firearm Background Checks\nNovember - 2015,None,None.1,None.2,None.3,None.4,None.5,None.6,None.7,None.8,...,None.9,None.10,None.11,None.12,None.13,None.14,None.15,None.16,None.17,None.18
0,State / Territory,Permit Handgun Long Gun *Other **Multiple Admin,,,,,,Pre-Pawn\nHandgun Long Gun *Other,,,...,,Rentals\nHandgun Long Gun,,Private Sale\nHandgun Long Gun *Other,,,Return to Seller - Private Sale\nHandgun Long ...,,,Totals
1,Alabama\nAlaska\nArizona\nArkansas\nCalifornia,"18,870\n209\n2,303\n3,298\n98452","23,022\n3,062\n12,382\n6,359\n41181","22,650\n3,209\n9,041\n11,611\n35007",859\n191\n707\n168\n4559,"1,178\n184\n618\n376\n0",0\n0\n0\n0\n0,14\n9\n5\n12\n0,15\n3\n3\n6\n0,0\n0\n0\n1\n0,...,0\n1\n1\n0\n0,,,13\n0\n9\n6\n0,14\n0\n6\n12\n0,0\n0\n1\n1\n0,3\n0\n1\n0\n0,2\n0\n1\n0\n0,0\n0\n0\n0\n0,"71,137\n7,095\n27,087\n25,048\n180116"
2,Colorado\nConnecticut\nDelaware\nDistrict of C...,"4,144\n9,631\n204\n8\n15,907","19,784\n11,594\n2,152\n54\n50,796","16,082\n5,072\n2,424\n2\n28,981","932\n134\n65\n0\n2,268","1,151\n0\n72\n0\n1,957",0\n7\n0\n0\n121,0\n0\n3\n0\n8,0\n0\n4\n0\n9,0\n0\n0\n0\n0,...,0\n0\n0\n0\n0,,,0\n0\n59\n0\n36,0\n0\n24\n0\n19,0\n0\n0\n0\n0,0\n0\n4\n0\n0,0\n0\n0\n0\n0,0\n0\n0\n0\n0,"42,271\n26,438\n5,040\n64\n103,532"
3,Georgia\nGuam\nHawaii\nIdaho\nIllinois,"14,111\n0\n1,248\n1,944\n87,190","16,635\n100\n0\n3,609\n24,412","15,227\n55\n0\n5,227\n17,227",448\n12\n0\n190\n0,"758\n3\n0\n189\n1,032",0\n0\n0\n0\n0,10\n0\n0\n0\n0,14\n0\n0\n4\n0,0\n0\n0\n0\n0,...,0\n0\n0\n1\n0,,,10\n0\n0\n1\n0,9\n0\n0\n2\n0,1\n0\n0\n0\n0,0\n0\n0\n0\n0,0\n0\n0\n3\n0,0\n0\n0\n0\n0,"50,795\n171\n1,252\n11,925\n129,861"
4,Indiana\nIowa\nKansas\nKentucky\nLouisiana,"81,935\n8,785\n894\n264,140\n1,945","25,519\n267\n7,086\n12,155\n14,708","20,227\n4,596\n8,702\n14,847\n17,368","1,113\n27\n311\n254\n697",808\n4\n396\n648\n793,0\n37\n3\n1\n0,2\n0\n1\n9\n5,3\n1\n3\n11\n11,0\n0\n1\n0\n2,...,0\n0\n0\n0\n0,,,31\n0\n6\n6\n1,13\n0\n10\n8\n10,1\n0\n0\n0\n1,2\n0\n0\n0\n0,4\n0\n0\n0\n1,0\n0\n0\n0\n0,"130,333\n13,794\n18,433\n295,891\n37,752"


## 4. Generating pdf's from webpages. 
Please note that websites are often protected against this

In [6]:
#works properly with google com
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')

True

In [10]:
#see the new .pdf
from glob import glob
my_pdfs = glob('*.pdf')
my_pdfs

['nunl.pdf',
 'out.pdf',
 'background-checks.pdf',
 'pulp-fiction-1994.pdf',
 'Didactisch Coachen.pdf']

In [13]:
#nu.nl example results in an empty pdf because of 
import pdfkit
pdfkit.from_url('http://www.nu.nl', 'nunl.pdf')

OSError: wkhtmltopdf exited with non-zero code -11. error:
Unknown Error