# <center> Optical Character Recognition </center>
-----
- How to make machines **read** text? :) 
- Purpose: Converting 2-Dimensional text data into text
-----
### Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. 
- Standalone invocation script that can read all image types, including PNG, JPEG, gif, bmp, tiff, etc.

In [None]:
!pip install pytesseract # This module helps convert images to text.

## THEN, you need to grab the downloadble from the following website: https://github.com/UB-Mannheim/tesseract/wiki
- Link is in the description (YouTube)
- tesseract-ocr-w64-setup-v5.0.0-alpha.20210811.exe (64 bit) resp. (As of 9/5/2021)

# Run exe and store the exe into file path.
![DownloadMessage](ExeDownload.png)

In [1]:
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# open source library for computer vision, machine learning, and image processing applications.
# !pip install opencv-python

In [2]:
import cv2
from PIL import Image
from matplotlib import pyplot as plt
import numpy as np
import os

In [22]:
# image_name = 'Images/tax_ex.jpg'
# image_name = 'Images/indonesian_passport_example.jpg'
# image_name = 'Images/Stop_Sign.jpg'
image_name = 'Images/Yield_Sign.jpg'

In [23]:
# Reading in sample image
# sample_image
image = cv2.imread(image_name)
# If you want to resize image...
# # image = cv2.resize(image, (500,500))

In [24]:
cv2.imshow("Sample Image", image)
# Extraction of text from image
text = pytesseract.image_to_string(image)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

 



# Attempting with Gray scale to get all lettering?

In [25]:
# Reading in sample image
image = cv2.imread(image_name)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Gray scale
cv2.imshow("Grey Scaled Image", image)
# Extraction of text from image
text = pytesseract.image_to_string(image)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

 



# Gain Division (Normalization)
- Removes coloration from background and then weights each pixel
- https://stackoverflow.com/questions/67386714/detecting-white-text-on-a-bright-background-with-tesseract

In [26]:
# Reading an image in default mode:
image = cv2.imread(image_name)

# Get local maximum:
kernelSize = 5
maxKernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernelSize, kernelSize))
# processing ops. based on shapes
localMax = cv2.morphologyEx(image, cv2.MORPH_CLOSE, maxKernel, None, None, 1, cv2.BORDER_REFLECT101) 

# Perform gain division
gainDivision = np.where(localMax == 0, 0, (image/localMax))

# Clip the values to [0,255]
gainDivision = np.clip((255 * gainDivision), 0, 255)

# Convert the mat type from float to uint8:
gainDivision = gainDivision.astype("uint8")

In [27]:
cv2.imshow("White Background", gainDivision) # (Already white background so not much happening here.)
text = pytesseract.image_to_string(gainDivision)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

 

 



# Otsu's Thresholding
- http://www.labbookpages.co.uk/software/imgProc/otsuThreshold.html
- "Otsu's thresholding method involves iterating through all the possible threshold values and calculating a measure of spread for the pixel levels each side of the threshold, i.e. the pixels that either fall in foreground or background. The aim is to find the threshold value where the sum of foreground and background spreads is at its minimum."
- Essentially, this is trying to minimize the variance amongst the pixels in the image, trying to obtain only the more important features

In [28]:
# Convert RGB to grayscale:
grayscaleImage = cv2.cvtColor(gainDivision, cv2.COLOR_BGR2GRAY)

# Get binary image via Otsu:
_, binaryImage = cv2.threshold(grayscaleImage, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

In [29]:
cv2.imshow("Otsu Thresholding", binaryImage)
text = pytesseract.image_to_string(binaryImage)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

 



# Ensure closing of characters (for background color filling)

In [30]:
# Set kernel (structuring element) size:
kernelSize = 3
# Set morph operation iterations:
opIterations = 1

# Get the structuring element:
morphKernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernelSize, kernelSize))

# Perform closing:
binaryImage = cv2.morphologyEx( binaryImage, cv2.MORPH_CLOSE, morphKernel, None, None, opIterations, cv2.BORDER_REFLECT101 )

In [31]:
cv2.imshow("Character filling", binaryImage)
text = pytesseract.image_to_string(binaryImage)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

 



# Flood filling

In [20]:
# Flood fill (white + black): -- > Tesseract works best with white background and black text.
cv2.floodFill(binaryImage, mask=None, seedPoint=(int(0), int(0)), newVal=(255))

(401431,
 array([[255, 255, 255, ..., 255, 255, 255],
        [255, 255, 255, ..., 255, 255, 255],
        [255, 255, 255, ..., 255, 255, 255],
        ...,
        [255, 255, 255, ..., 255, 255, 255],
        [255, 255, 255, ..., 255, 255, 255],
        [255, 255, 255, ..., 255, 255, 255]], dtype=uint8),
 None,
 (0, 0, 825, 550))

In [21]:
cv2.imshow("Flood filling", binaryImage)
text = pytesseract.image_to_string(binaryImage) # Didn't work all the way.
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

ALLWAYe



# PDF Images
- pip install pymupdf

In [32]:
import fitz
pdffile = 'Images/Berkshire_hathaway_68.pdf'
doc = fitz.open(pdffile)
page = doc.loadPage(0)  # number of page
pix = page.getPixmap()
output_path = "Images/Berkshire_hathaway_68.png"
pix.writePNG(output_path)

In [33]:
image = cv2.imread(output_path)

In [34]:
cv2.imshow("Sample PDF Image", image)
# Extraction of text from image
text = pytesseract.image_to_string(image)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

Managements Discusio

 

nd Analysis (Continued)
Manufacturing, Service and Realing

|A summary of revenues and eamings of our manufacturing, service and retiling busineses follows (olay in
rillion)

     

    

 

 

 

 

Revenues
Manufacturing S 59079 § 62.790 $ 61883 (38% 14%
Service an retailing 75018 79985 78926 (62) 3
Sisko Sunes Av 160) a
Presta earnings *
Manatictring S S10 $ 952 $ 9.365 (Is9M%
Service and retailing 2879 2982 13
10,889 Bw (119)
Income tes and noncontoling interests 2589, 29a
58300 59368
stv income ta rte 233% 237% 74%
Pretax earings as percentage of revenues E% 87% 8%

 

 

 

+ Excludes certain acquisition accounting expenses, which primal related to the amortization of tdentifid angie
‘assets recorded in conection wit owr business acquisitions. Te fter-tax acquisition accounting expenses excluded
From earnings above sere 753 million in 2020, 3788 millon ts 2019 and $932 millon in 2018. In 2020, such
‘expenses also exclude afertax good! and indeintetvediman

In [35]:
image = Image.open(output_path)
image = image.resize((1782,2322),Image.ANTIALIAS)
image.save(fp="newimage_1.png")

In [36]:
image = cv2.imread('newimage_1.png')
cv2.imshow("Sample PDF Image Resized", image)
# Extraction of text from image
text = pytesseract.image_to_string(image)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

Management’s Discussion and Analysis (Continued)
Manufacturing, Service and Retailing

A summary of revenues and eamings of our manufacturing, service and retailing businesses follows (dollars in
millions).

 

 

 

 

 

 

 

 

 

Percentage change
2020 2019 2018 2020 vs 2019 2019 vs 2018
Revenues
Manufacturing $ 59,079 $ 62,730 $ 61,883 {5.89% 14%
Service and retailing 75,018 79,945 78,926 {6.2) 13
S$ 134.097 $ 142.675 $ 140,809 (6.0) 13
—_—_——— Eee —_—_—_———_—
Pre-tax earnings *
Manufacturing $ 8010 $ 9522 $ 9,366 (15.99% 1.7%
Service and retailing 2.879 2,843 2,942 1.3 (3.4)
10,889 12,365 12,308 (11.9) 0.5
Income taxes and noncontrolling interests 2,589 2,993 2,944
S$ 8300 $ 9372 $ 9.364
Effective income tax rate 23.3% 23.7% 23.4%
Pretax earnings as a percentage of revenues 8.1% 8.7% 8.7%

 

. Excludes certain acquisition accounting expenses, which primarily related to the amortization of identified intangible

assets recorded in connection with our business acquisitions. The a