# PDF Extraction

This notebook explores how to extract information from PDF files. All pdfs used are patents, taken from [Google Patents](https://patents.google.com/).

A lot of patents are actually stored as images inside `.pdf` files, rather than directly as text. This presents something of a problem. In order to work with that data, we need a particular utility: [ocrmypdf](https://ocrmypdf.readthedocs.io/en/latest/). 

## Imports

In [14]:
from PyPDF2 import PdfFileReader # Get metadata
from pdfminer.high_level import extract_text # Get full text
import ocrmypdf # Convert image PDFs to text
from glob import glob  # Get files of type in folder
import pandas as pd  # Dataframes

## Open a single PDF

We can use `PdfFileReader()` to extract information directly from a `.pdf` file.

In [2]:
# File name (and path if necessary)

filepath = "patent.pdf"

### Extracting metadata

In [3]:
# Open the file  for reading in binary mode (required)

with open(filepath, 'rb') as f:
    
        # Create the reader object
        pdf = PdfFileReader(f)
        
        # Extract information from the object (lots of different options)
        
        pages = pdf.getNumPages()
        
        print(pages)

1


### Extracting content

This method only works if the information inside the `.pdf` is actually stored as text.

In [4]:
extract_text(filepath)

'Patent title \n\n \nI am a patent and I contain patent information. \n \nSecond paragraph - more like patent-graph \n \nPatent subtitle \n\n \nPatently a patent. \n \n\nPatent heading \n\n \nDon’t get impatent, because I am a patent so that would be inappropriate for this situation. \n\n\x0c'

## Converting image PDFs to text PDFs

In [5]:
# Get a file containing images rather than text

filepath = './raw/GB2571201B.pdf'

In [6]:
# Convert the file

ocrmypdf.ocr(filepath, f'processed_GB2571201B.pdf')

Scanning contents: 100%|██████████| 18/18 [00:01<00:00, 14.81page/s]
OCR: 100%|██████████| 18.0/18.0 [00:34<00:00,  1.90s/page]
PDF/A conversion: 100%|██████████| 18/18 [00:03<00:00,  4.55page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 100%|██████████| 18/18 [00:00<00:00, 23.52item/s]


<ExitCode.ok: 0>

In [7]:
# Get the processed file

filepath = 'processed_GB2571201B.pdf'

# Extract the processed file's text

text = extract_text(filepath)

In [8]:
# Check it

text[:100]

'JK  Patent \n\n,.GB \n\n2011201 \n\n“yD \n\n(45) Date  of  B  Publication \n\n19.08.2020 \n\n(54)  Title of  the'

## Processing multiple image PDFs

In [9]:
# Get a list of files in a folder

files = glob('./raw/*.pdf')

for num, filepath in enumerate(files):
    
    print(f'Processing {num + 1} of {len(files)}')
    
    # process each one
    
    ocrmypdf.ocr(filepath, f'./processed/{filepath.split("/")[-1]}')

Processing 1 of 3


Scanning contents: 100%|██████████| 18/18 [00:01<00:00, 14.77page/s]
OCR: 100%|██████████| 18.0/18.0 [00:27<00:00,  1.55s/page]
PDF/A conversion: 100%|██████████| 18/18 [00:03<00:00,  5.64page/s]
JPEGs: 0image [00:00, ?image/s][A
PDF/A conversion: 100%|██████████| 18/18 [00:03<00:00,  5.25page/s]
JBIG2: 100%|██████████| 18/18 [00:00<00:00, 18.18item/s]


Processing 2 of 3


Scanning contents: 100%|██████████| 32/32 [00:01<00:00, 22.60page/s]
OCR:   5%|▍         | 1.5/32.0 [00:07<02:01,  3.98s/page][tesseract] lots of diacritics - possibly poor OCR
OCR: 100%|██████████| 32.0/32.0 [00:55<00:00,  1.75s/page]
PDF/A conversion: 100%|██████████| 32/32 [00:05<00:00,  7.20page/s]
JPEGs: 0image [00:00, ?image/s][A
PDF/A conversion: 100%|██████████| 32/32 [00:05<00:00,  5.88page/s]
JBIG2: 100%|██████████| 32/32 [00:01<00:00, 25.85item/s]


Processing 3 of 3


Scanning contents: 100%|██████████| 25/25 [00:01<00:00, 19.85page/s]
OCR:  86%|████████▌ | 21.5/25.0 [00:30<00:02,  1.23page/s][tesseract] lots of diacritics - possibly poor OCR
OCR: 100%|██████████| 25.0/25.0 [00:31<00:00,  1.28s/page]
PDF/A conversion: 100%|██████████| 25/25 [00:03<00:00,  6.96page/s]
JPEGs: 0image [00:00, ?image/s][A
PDF/A conversion: 100%|██████████| 25/25 [00:04<00:00,  5.85page/s]
JBIG2: 100%|██████████| 25/25 [00:00<00:00, 26.24item/s]


## Reading multiple text PDFs

In [10]:
# Get a list of files in a folder

files = glob('./processed/*.pdf')

# Holder for text data

holder = []

for num, filepath in enumerate(files):
    
    name = filepath.split("/")[-1][:-4]
    
    print(f'Processing {name} ({num + 1}/{len(files)})')
    
    holder.append({'name': name,
                   'text': extract_text(filepath)})

Processing GB2571201B (1/3)
Processing CA2151947C (2/3)
Processing CA2827558C (3/3)


In [11]:
# Create a dataframe from the holder

patents = pd.DataFrame(holder)

In [12]:
# Check the dataframe

patents.head()

Unnamed: 0,name,text
0,GB2571201B,"JK Patent \n\n,.GB \n\n2011201 \n\n“yD \n\n(4..."
1,CA2151947C,iwi \n\nOffice de la Propriété \nIntellectu...
2,CA2827558C,"ivi \n\nInnovation, Sciences et \nDéveloppem..."


In [13]:
# Output the dataframe to a file

patents.to_csv('patents.csv', sep=',', index=False)