Converting _PDF_ tables to pandas `pandas`-from images - a.k.a "rasterised" `pdf`. This workbook is for extracting information from `pdf` files that have been generated through scanning, photographs, or exported otherwise not as text. THis is significantly more complex task than the previous one - it can be broken down into 3 steps::
 1. Converting _PDF_ pages to images
 2. Recognising text (_OCR - optical character recognition_)
 3. Converting the extracted text into a data table

### Installs

#### Poppler

In [1]:
!apt-get install poppler-utils 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.14).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.


#### Tesseract

In [2]:
!apt install tesseract-ocr

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.


In [3]:
!pip install pytesseract

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Restart Python

Read this if you are interested in the process: https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/

In [4]:
!pip install Pillow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
!pip install pdf2image

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Read this if you want to run this outside of `Colab`: https://stackoverflow.com/questions/44439443/python-how-to-pip-install-opencv2-with-specific-version-2-4-9

In [6]:
!pip install opencv-python

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Conversion

# 1

https://hoineki.com/article.php?a=How_to_convert_PDF_to_Image34

In [7]:
from PIL import Image 
import sys 
from pdf2image import convert_from_path 
import os 

In [8]:
# Path of the pdf 
PDF_file = "patent-data.pdf"

In [9]:
if not os.path.exists('pdf/'+PDF_file+'/'):
    os.makedirs('pdf/'+PDF_file+'/')

We make a folder to store the extracted _PDF_ pages, as images.

In [10]:
# Store all the pages of the PDF in a variable 
#second argument is hte resolution, try between 300-600
pages = convert_from_path(PDF_file, 600) 
  
# Counter to store images of each page of PDF to image 
image_counter = 1
  
# Iterate through all the pages stored above 
for page in pages: 
  
    # Declaring filename for each page of PDF as JPG 
    # For each page, filename will be: 
    # PDF page 1 -> page_1.jpg 
    # PDF page 2 -> page_2.jpg 
    # PDF page 3 -> page_3.jpg 
    # .... 
    # PDF page n -> page_n.jpg 
    filename = 'pdf/'+PDF_file+"/page_"+str(image_counter)+".jpg"
    print(image_counter,'page done..')
      
    # Save the image of the page in system 
    page.save(filename, 'JPEG') 
  
    # Increment the counter to update filename 
    image_counter = image_counter + 1

1 page done..
2 page done..
3 page done..
4 page done..
5 page done..
6 page done..


# 2

Optical character recognition

In [11]:
import pytesseract

In [12]:
# Variable to get count of total number of pages 
filelimit = image_counter-1
  
# Creating a text file to write the output 
outfile = 'pdf/'+PDF_file+"/text.txt"
  
# Open the file in append mode so that  
# All contents of all images are added to the same file 
f = open(outfile, "a") 
  
# Iterate from 1 to total number of pages 
for i in range(1, filelimit + 1): 
  
    # Set filename to recognize text from 
    # Again, these files will be: 
    # page_1.jpg 
    # page_2.jpg 
    # .... 
    # page_n.jpg 
    filename = 'pdf/'+PDF_file+"/page_"+str(i)+".jpg"
          
    # Recognize the text as string in image using pytesserct 
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
    print(i,'page done..')
  
    # The recognized text is stored in variable text 
    # Any string processing may be applied on text 
    # Here, basic formatting has been done: 
    # In many PDFs, at line ending, if a word can't 
    # be written fully, a 'hyphen' is added. 
    # The rest of the word is written in the next line 
    # Eg: This is a sample text this word here GeeksF- 
    # orGeeks is half on first line, remaining on next. 
    # To remove this, we replace every '-\n' to ''. 
    text = text.replace('-\n', '')     
  
    # Finally, write the processed text to the file. 
    f.write(text) 

# Close the file after writing all the text. 
f.close() 

1 oldal kész..
2 oldal kész..
3 oldal kész..
4 oldal kész..
5 oldal kész..
6 oldal kész..


# 3

In [13]:
import pandas as pd

Processing the reocgnised text

In [14]:
pages=open(outfile,'r').read()

Splitting lines by the newline character `\n`

In [15]:
lines=[i for i in pages.split('\n') if i]

In [16]:
good_lines=[line for line in lines if line[0].isdigit()]

Typical errors - you're on your own from here as every `pdf` is different..

In [17]:
good_lines

['2. - DESIGNS, AND TRADE MARKS,',
 '2 WIth',
 '5.',
 '3 PRINTERS TO THE KING’S MOST EXCELLENT MAJESTY.',
 '7 t ¢ © bt 8',
 '1895.',
 '16,471',
 '1,185',
 '453',
 '35',
 '280',
 '173',
 '2,146',
 '90,962',
 '1,349',
 '635',
 '33',
 '88',
 '15',
 '1',
 '1897, 1898. 1899. 7 1900. “| 1901.',
 '19,897 17,380 1} 15,354 | 18,777 6,099',
 '1,439 1,595 1116 | 1,154 ? 320',
 '665 503 396 37% ae j',
 '38 29 a0 16',
 '12 14 | g 5 |',
 '13 9 8 id 1',
 '28 13 13 is 92',
 '127 121 97 90 94',
 '7 12 12 10 ll',
 '15 16 13 33 85',
 '9 8 9 38 9',
 '3 l 1 4',
 '6 1 1',
 '253 163 163 156 193',
 '12 1 2 wren 1',
 '69 63 7 68 68',
 '1 -— — —',
 '130 73 53 ": 0',
 '5 19 § 4',
 '1 *',
 '377 414 418 418 889',
 '245 295 208 ist 199',
 '41 63 69 77 80 |',
 '1,194 1,138 1,031 946 948',
 '2,459 2,599 9,991 | 2,631 assa |',
 '1 9 _',
 '59 108 112 100 97',
 '27 36 84 44',
 '3 9 — 2',
 '6 5 G 6',
 '104 115 125 102',
 '98 16 a4 18',
 '110 118 93 lod 14',
 '103 123 137 150 154',
 '5 3 6 —',
 '1 — 1 _ 9',
 '3 2 - 3',
 '

In [None]:
good_lines=[line.replace('_',' ').replace('. ',' ').replace('-',' ').replace('—',' ')\
     .replace('~',' ').replace('=',' ').replace('  ',' ').replace('  ',' ')\
     .replace('»',' ')for line in good_lines]