<center><h1>OCR Google Vision API</h1></center>

<h2><a id="part_intro">Introduction</a></h2>

In the world of computer vision, one is very popular, Object Character Recognition (OCR). This method detects characters in a picture and translate them into text. This notebook will shows you how to use the OCR develop by Google and save the result of the detection in a text file.

<h2>Content</h2>

- [Introduction](#part_intro)
- [Packages](#part_packages)
- [Parameters](#part_0)
- [Convert PDF to JPEG](#part_1)
- [Setup credentials](#part_2)
- [API Vision request](#part_3)
    - [Functions needed](#part_3_1)
    - [Call OCR per page](#part_3_2)
    - [Call OCR per document (stack de 10 pages)](#part_3_3)
    - [Multiprocessing](#part_3_4)

In [1]:
#!pip install google-cloud
#!pip install google-cloud-storage
#!pip install google-cloud-pubsub
#!pip install google-cloud-translate
#!pip install google-cloud-vision
#!pip install pdf2image
#!pip install google-api-python-client

<h2><a id="part_packages">Packages</a></h2>

In [2]:
from pdf2image import convert_from_bytes
import glob
from tqdm import tqdm
import base64
import json
import os
from io import BytesIO
import numpy as np
import io
from PIL import Image
from google.cloud import pubsub_v1
from google.cloud import vision

from google.oauth2 import service_account
import googleapiclient.discovery

tqdm().pandas()

0it [00:00, ?it/s]


<h2><a id="part_0">Parameters</a></h2>

In [3]:
NAME_INPUT_FOLDER = "PDF FOLDER NAME"
NAME_OUTPUT_FOLDER= "RESULT TEXTS FOLDER"

In [4]:
save_jpeg         = False
per_page          = False
per_document      = True    # stack 10 pages into one call 
multi_proc        = False    # use multiprocessing to call the OCR

In [5]:
list_pdf = glob.glob(NAME_INPUT_FOLDER+"/*.pdf") # stock the name of the pdf files

<h2><a id="part_1">I - Convert PDF into JPEG</a></h2>

In [8]:
if save_jpeg:
    
    for i in list_pdf:
        # convert the pdf into jpeg
        pages = convert_from_path(i, 500)
        
        for page in tqdm(enumerate(pages)):
            # save each page 
            page[1].save(NAME_OUTPUT_FOLDER+"/"+i.split('/')[-1].split('.')[0]+'_'+str(page[0])+'.jpg', 'JPEG') # keep the name of the document and add increment=

<h2><a id="part_2">II - Set up credentials</a></h2>

The lines below show how to create credentials with a service account

In [9]:
SCOPES = ['https://www.googleapis.com/auth/cloud-vision']
SERVICE_ACCOUNT_FILE = "PUT YOUR SERVICE ACCOUNT JSON FILE HERE"

In [10]:
credentials = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES)

<h2><a id="part_3">III - API Vision request</a></h2>

<h3><a id="part_3_1">Functions needed</a></h3>

In [11]:
def detect_text_document(content, credentials):
    """
    Function to call the API vision and return the text detected inside the image
    @param content: (bytes) image in bytes 
    @param credentials: credentials of the service account to call the API 
    @return: the text detected inside the picture
    """
    
    client = vision.ImageAnnotatorClient(credentials=credentials)
    #with io.open(uri, 'rb') as image_file:
    #    content = image_file.read()
    
    # load the image in bytes 
    image = vision.types.Image(content=content)
    # call the OCR and keep text annotation
    response = client.text_detection(image=image)
    
    # The actual response for the first page of the input file.
    breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
    paragraphs = []
    lines = []
    # extract text by block of detection
    for page in response.full_text_annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                para = ""
                line = ""
                for word in paragraph.words:
                    for symbol in word.symbols:
                        line += symbol.text
                        if symbol.property.detected_break.type == breaks.SPACE:
                            line += ' '
                        if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE:
                            line += ' '
                            lines.append(line)
                            para += line
                            line = ''
                        if symbol.property.detected_break.type == breaks.LINE_BREAK:
                            lines.append(line)
                            para += line
                            line = ''
                paragraphs.append(para)

    
    return "\n".join(paragraphs)

In [12]:
def pil_grid(images, max_horiz=np.iinfo(int).max):
    
    n_images = len(images)
    n_horiz = min(n_images, max_horiz)
    h_sizes, v_sizes = [0] * n_horiz, [0] * (n_images // n_horiz)
    for i, im in enumerate(images):
        h, v = i % n_horiz, i // n_horiz
        h_sizes[h] = max(h_sizes[h], im.size[0])
        v_sizes[v] = max(v_sizes[v], im.size[1])
    h_sizes, v_sizes = np.cumsum([0] + h_sizes), np.cumsum([0] + v_sizes)
    im_grid = Image.new('RGB', (h_sizes[-1], v_sizes[-1]), color='white')
    for i, im in enumerate(images):
        im_grid.paste(im, (h_sizes[i % n_horiz], v_sizes[i // n_horiz]))
    return im_grid

In [14]:
def concat_file_ocr(path, cred=credentials):
    '''
    Function to concat 10 pages of the document and feed them to the OCR
    @param path: (str) path of the pdf
    @param cred: google credentials 
    '''
    imgs = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg")
    nb_pages = len(imgs)
    nb_remaining_pages = nb_pages
    ocr_step = 10
    current_ocr_page_nb = 0
    text = []
    while nb_remaining_pages > 0:
        if nb_remaining_pages > ocr_step:
            ocr_range = range(current_ocr_page_nb, ocr_step + current_ocr_page_nb)
            nb_remaining_pages -= ocr_step
            current_ocr_page_nb += ocr_step
        else:
            ocr_range = range(current_ocr_page_nb, current_ocr_page_nb + nb_remaining_pages)
            nb_remaining_pages = 0
        # call ocr with range
        im_grid = pil_grid(imgs[ocr_range.start:ocr_range.stop],1)
        temp = BytesIO()
        im_grid.save(temp, format='jpeg')
        text.append(detect_text_document(temp.getvalue(), cred))
    np.savetxt(NAME_OUTPUT_FOLDER+"/"+path.split('/')[-1].split('.')[0]+'.txt', text, fmt="%s")

<h3><a id="part_3_2">Call OCR per page</a></h3>

In [15]:
if per_page:
    # call the API vision per page of the pdf
    for i in tqdm(list_pdf):
        # open the pdf and convert it into a PlImage format jpeg
        call_ocr_save_txt(i, cred=credentials)

In [16]:
def call_ocr_save_txt(path, cred=credentials):
    '''
    Function to feed the OCR with each page of the pdf convert into jpeg
    @param path: (str) path of the pdf
    @param cred: google credentials 
    '''
    pages = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg") 
    text = []
    # run on each page of the pdf 
    for page in pages:
        # cast the jpeg into bytes 
        temp = BytesIO()
        page.save(temp, format='jpeg')

            # save the result of the OCR inside the variable text 
        text.append(detect_text_document(temp.getvalue(), cred))
        # save the result into txt file 
    np.savetxt(NAME_OUTPUT_FILE+"/"+i.split('/')[1].split('.')[0]+'.txt', text, fmt="%s")

<h3><a id="part_3">Call OCR per document (stack de 10 pages)</a></h3>

In [17]:
if per_document:
    # In order to save money when calling th API you could 
    # stack to 10 pages of the pdf in one call 
    for doc_pdf in tqdm(list_pdf):

        # call the function which convert into jpeg, stack 10 images
        # and call the API, save the output into txt file 
        concat_file_ocr(doc_pdf)

<h3><a id="part_3">Multi-processing</a></h3>

In [18]:
if multi_proc:
    nb_threads = mp.cpu_count()
    print(f"The number of available CPU is {nb_threads}")
    
    if per_page:
        pool = mp.Pool(processes=nb_threads)    # create threads corresponding to the number specified
        result = pool.map(call_ocr_save_txt, list_pdf) # map the function with part of the list for each thread
        
    if per_document:
        pool = mp.Pool(processes=nb_threads) 
        result = pool.map(concat_file_ocr, list_pdf)

<h3>Dask</h3>