 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*
 
# Tesseract
Tesseract is an Optical Character Recognition (OCR) library for Python, which can be used to read and recognise text in images, e.g. of pages, license plates, etc. To use it, we begin by importing the required modules.

In [None]:
# Import libraries
!pip install pytesseract
!pip install tesseract
!pip install pdf2image
!apt-get install poppler-utils
!apt install tesseract-ocr
!apt install libtesseract-dev
!sudo apt install tesseract-ocr
!pip install Pillow
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import cv2
import sys
import os

## Part 1: Convert PDF to Image
First we'll import a PDF and convert it to an image for reading by Tesseract.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

PDF_file = "/content/gdrive/My Drive/NLP/TextCleansing.pdf"
pages = convert_from_path(PDF_file, 500, single_file=True)
pages[0].save("/content/gdrive/My Drive/NLP/TextCleansing.jpg", 'JPEG')

#Tesseract Options

We can next use Tesseract to convert the image to a string.

The -l flag controls the language of the input text.

The --oem argument, or OCR Engine Mode, controls the type of algorithm used by Tesseract.

The --psm controls the automatic Page Segmentation Mode used by Tesseract.

In [None]:
import cv2
img = cv2.imread('/content/gdrive/My Drive/NLP/TextCleansing.jpg')

# Adding custom options
custom_config = r'--oem 3 --psm 6'
#Find out more about these options in the Tesseract documentation at https://tesseract-ocr.github.io/tessdoc/
#Convert to String
st = pytesseract.image_to_string(img, config=custom_config)
print(st)

It's also possible to annotate the image with boxes highlighting each character found by the algorithm:

In [None]:
from google.colab.patches import cv2_imshow
img = cv2.imread('/content/gdrive/My Drive/NLP/TextCleansing.jpg')

h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2) #OpenCV is in Blue, Green, Red format.

cv2_imshow(img)


We can combine this with a text pattern to search for in the image:

In [None]:
import re
from pytesseract import Output

img = cv2.imread('/content/gdrive/My Drive/NLP/TextCleansing.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())

text_pattern = '(NLP)'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
    	if re.match(text_pattern, d['text'][i]):
	        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
	        img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 255), 2)

cv2_imshow(img)

#Scanned Documents

A more realistic application is reading text from printed documents which have been scanned. Let's try this with the two typewritten documents provided.

In [None]:
img = cv2.imread('/content/gdrive/My Drive/NLP/HistoricalDoc1.jpg')

# Adding custom options
custom_config = r'--oem 3 --psm 6'
#Find out more about these options in the Tesseract documentation at https://tesseract-ocr.github.io/tessdoc/
#Convert to String
pytesseract.image_to_string(img, config=custom_config)

In [None]:
from google.colab.patches import cv2_imshow
img = cv2.imread('/content/gdrive/My Drive/NLP/HistoricalDoc1.jpg')

h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2) #OpenCV is in Blue, Green, Red format.

cv2_imshow(img)

In [None]:
img = cv2.imread('/content/gdrive/My Drive/NLP/HistoricalDoc2.jpg')

# Adding custom options
custom_config = r'--oem 3 --psm 6'
#Find out more about these options in the Tesseract documentation at https://tesseract-ocr.github.io/tessdoc/
#Convert to String
pytesseract.image_to_string(img, config=custom_config)

In [None]:
from google.colab.patches import cv2_imshow
img = cv2.imread('/content/gdrive/My Drive/NLP/HistoricalDoc2.jpg')

h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2) #OpenCV is in Blue, Green, Red format.

cv2_imshow(img)

## Exercise
Find an image of a number plate on a vehicle and use Tesseract to read it. 