### This notebook shows details power of tesseract OCR

#### We will see how tesseract can do a lot of things from the image text

<center>
    <u><h2>Image Text Extraction With Pytesseract</h2></u>
</center>

<h3>What is the Tesseract?</h3>

<p>Tesseract is an Optical Character Recognition tool (OCR) which can be used to extract written text from different types image files. This was originally designed and developed by the <b>Google</b>. Image text extraction can be used by both of the ways; using command terminal and api implementations. </p>

<p>Tesseract has unicode (UTF-8) support and can recognize more than <b>100 languages</b>.</p>

<p>Tesseract support various kind of output formats such as: <b>plain text</b>, <b>HTML</b>, <b>PDF</b>, <b>TSV</b>, <b>XML</b>.</p> 

<p>Accuracy and integrity of the output from tesseract is totally depends on the <u>quality of input image</u></p>

<h3>How to install Tesseract?</h3>

<ol>
    <li>Download and install the relevant binary for your operating system <a href='https://github.com/tesseract-ocr/tessdoc#500x'>here.</a></li>
    <li>Add the installed location reference in to the PATH. <b>eg. </b><i>C:\Program Files\Tesseract-OCR.</i></li>
    <li>Type command <span style="background-color: lightblue;">tesseract</span> and check about the status of your installation.</li>
</ol>

<h3>What is the Pytesseract?</h3>

<p>Python tesseract (pytesseract) is a wrapper module for tesseract-OCR-engine developed by the Google. Pytesseract can be installed in to your computer, if you have installed any python version greater than or equal to 3.0. Whenever you have installed tesseract, simply install this 3<sup>rd</sup> party module by a simple pip command and play around!</p>

<h3>How to install Pytesseract?</h3>

<ol>
    <li>Type command <span style="background-color: lightblue;">pip install pytesseract</span> and let it to be installed.</li>
    <li>Then open your python shell and type  <span style="background-color: lightblue;">import pytesseract</span> and check about the status of pytesseract installation.</li>
</ol>

### Import Modules

In [4]:
import os # To import test image files
import cv2 # To work with opencv images
from PIL import Image # Image submodule to work with pillow images
import pytesseract as pt # pytesseract module

#### After importing now we need to import data files

In [5]:
test_image_path = "Images"
create_path = lambda f : os.path.join(test_image_path, f)
test_image_files = os.listdir(test_image_path)
for f in test_image_files:
    print(f)


1.jpeg
2.jpeg
WhatsApp Image 2022-03-05 at 4.39.10 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.11 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.12 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.13 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.14 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.15 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.16 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.19 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.20 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.21 PM.jpeg
WhatsApp Image 2022-03-05 at 4.39.22 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.18 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.19 PM (1).jpeg
WhatsApp Image 2022-03-05 at 4.46.19 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.20 PM (1).jpeg
WhatsApp Image 2022-03-05 at 4.46.20 PM (2).jpeg
WhatsApp Image 2022-03-05 at 4.46.20 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.21 PM (1).jpeg
WhatsApp Image 2022-03-05 at 4.46.21 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.23 PM.jpeg
WhatsApp Image 2022-03-05 at 4.46.24 PM.jpeg
WhatsApp Image 2022-03-05

In [6]:
# Function to show images
def show_image(img_path, size=(500, 500)):
    image = cv2.imread(img_path)
    image = cv2.resize(image, size)
    
    cv2.imshow("IMAGE", image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

<u>
<h4>Configure tesseract path in implementations (No need to add in to the PATH explicitly)</h4>
</u>

In [7]:
# optional: only if you haven't configured PATH
pt.tesseract_cmd=r'/usr/bin/tesseract' # provide full path to tesseract.exe

<u>
<h4>Checkout available languages</h4>
</u>
<p>Check out <a href='https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html'>here</a> to learn about languages and their codes available in tesseract.</p>

<u>
<h4>Extract text from an image : Simple</h4>
</u>

In [8]:
image_path = test_image_files[1] # 2, 3, 12, 1, 13, 15
path = create_path(image_path)

image = Image.open(path)
text = pt.image_to_string(image)

print(text)
#show_image(path)

. L

oy
& DUTCH BANGLA BANK
plh Teihl

Transaction Sum be .
Custe __cQgrff i enterprise
ustomer 1 094643

Transaction Type : Fund Transfer po- 12731
Transaction ID : 102674119

Date : December 02, 2021
Initiator : 01711486765

From Account : 7017512561124
Account Name : MD.FAZLERABBI

To Account : 1031100022462
Account Name : RICECO INTERNATIONAL
Amount :27076.0

Fee :54.15

Generated Time  : 2021/12/02 14:40:55

Teller Signature Customer Signature

This is a system generated report of DBBL



<u>
    <h4>Extract text from an image : Specifying a language</h4>
</u>
<p>Check out <a href='https://github.com/tesseract-ocr/tessdata/tree/3.04.00'>here</a> to download different language data files.</p>

In [9]:
path = create_path("1.jpeg")

image = Image.open(path)
text = pt.image_to_string(image, lang='eng')

print(text)
show_image(path)

JUTCH-BANGLA BANK LIMITED
A\CADEMY MR FT-2 CHUA
IME TERMINAL 1D
2/21 09:34:01 25134094
M/S Ami
CARD NO # XXXXXKXXXX)\‘XXXX‘J{LvIQ
09)S7Y|

TXN NO. 00325472 RESP CODE: O

FUNDS TRANSFER

FR A/C#: 1351010120022

TO A/C#: 1031100022462

TXN AMOUNT: TK 52.00 52_}1"

AVAIL BAL : TK. AT

Cods 13 .

EMV APP ID: F0504442420110
THANK YOU FOR USING
DBBL NETWORK


