# Performing OCR in a PDF file in Portuguese

We are using pdf2image to convert the pdf file to a series of images, OpenCV to perform some preprocessing (turning the images to black and white, and Tesseract to perform the OCR and convert the images to text.

## Importing and Configuration

Importing the necessary modules

In [1]:
import cv2
import pytesseract
import numpy as np

from pdf2image import convert_from_path

It is necessary to define the Tesseract instalation path

In [2]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\rodri\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'

Configuration

In [3]:
# The pdf path can be either a web link or a file in the local disk
# The dpi should make the OCR more assertive, but can make things way slower and uses much more memory
pdf_path = "https://www.ufrgs.br/colegiodeaplicacao/wp-content/uploads/2020/10/Edital03_monitoria.pdf"
conversion_dpi = 350

# Here we are using the Portuguese Language in Tesseract. We must have it installed
ocr_language = 'por'

# We can use OpenCV to show the image that is feeding Tesseract, for testing purposes
show_image = False
image_scale = 0.4

## Helper Functions

In [4]:
# Converts an image to black and white
# Uses the OTSU's method to estimate the threshold
def get_black_white_image(image):
    # We first convert the image to gray and get a blurred version to filter for high frequency artifacts
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (7, 7), 0)
    
    # We then get the threshold with Otsu's Method
    T, blur_thresh = cv2.threshold(blurred, 0, 255,
            cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    # Finally we use the threshold on the grayscale image to get a BW image
    discard, thresh = cv2.threshold(gray, T, 255, cv2.THRESH_BINARY)
    return thresh

In [5]:
# Show a OpenCV image window and optionally rescales it
def show_scaled_image(image, scale = 1.0):
    width = int(image.shape[1] * scale)
    height = int(image.shape[0] * scale)
    dim = (width, height)
    
    cv2.namedWindow('Image', cv2.WINDOW_NORMAL)
    cv2.imshow("Image", image)
    cv2.resizeWindow('Image', dim)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

## Performing the OCR

Converting the pdf to images, and the images from PPM to numpy arrays

In [6]:
%%time
pages = convert_from_path(pdf_path, conversion_dpi)

np_image_array = []

for page in pages:
    np_image_array.append(np.array(page))

Wall time: 12.8 s


Using Tesseract to extract the text

In [7]:
%%time

pages_text = []

for np_image in np_image_array:    
    # convert the image to black and white
    bw_image = get_black_white_image(np_image)
    
    # pytesseract image to string to get results
    text = str(pytesseract.image_to_string(bw_image, lang=ocr_language))
    pages_text.append(text)
    
    if show_image:
        show_scaled_image(np_image, image_scale)

Wall time: 5.44 s


In [8]:
for text in pages_text:
    print(text)

SERVIÇO PÚBLICO FEDERAL

o, Fhos UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL ob

UNIVERMDADE FEDERAL

BS NES GRANDE DO TOL COLÉGIO DE APLICAÇÃO

. EDITAL Nº 03/2020
SELEÇÃO PARA PROVIMENTO DE VAGAS REMANESCENTES
DE MONITORIA ACADÊMICA DO COLÉGIO DE APLICAÇÃO

O Colégio de Aplicação da Universidade Federal do Rio Grande do Sule a
Pró-Reitoria de Graduação da UFRGS, no uso de suas atribuições, tornam público
que estão abertas, no período de 15/10/2020 a 20/10/2020, as inscrições para o
processo seletivo simplificado, que será regido pelas regras do presente Edital, para
monitor de disciplinas na forma da legislação vigente, no Colégio de Aplicação.

1. Das Disposições Gerais

O Programa de Monitoria Acadêmica está em conformidade com o
estabelecido no Decreto nº 8.862/81, complementado pelo artigo 84 da Lei nº 9.394
de 20 de dezembro de 1996, e é regido pela Instrução Normativa 02/2008 de
Monitoria Acadêmica — PROGRAD.

As inscrições para as Bolsas de Monitoria Acadêmica Remunerada deverão