<center>
    <u><h2>Image Text Extraction With Pytesseract</h2></u>
</center>

<h3>What is the Tesseract?</h3>

<p>Tesseract is an Optical Character Recognition tool (OCR) which can be used to extract written text from different types image files. This was originally designed and developed by the <b>Google</b>. Image text extraction can be used by both of the ways; using command terminal and api implementations. </p>

<p>Tesseract has unicode (UTF-8) support and can recognize more than <b>100 languages</b>.</p>

<p>Tesseract support various kind of output formats such as: <b>plain text</b>, <b>HTML</b>, <b>PDF</b>, <b>TSV</b>, <b>XML</b>.</p> 

<p>Accuracy and integrity of the output from tesseract is totally depends on the <u>quality of input image</u></p>

<h3>How to install Tesseract?</h3>

<ol>
    <li>Download and install the relevant binary for your operating system <a href='https://github.com/tesseract-ocr/tessdoc#500x'>here.</a></li>
    <li>Add the installed location reference in to the PATH. <b>eg. </b><i>C:\Program Files\Tesseract-OCR.</i></li>
    <li>Type command <span style="background-color: lightblue;">tesseract</span> and check about the status of your installation.</li>
</ol>

<h3>What is the Pytesseract?</h3>

<p>Python tesseract (pytesseract) is a wrapper module for tesseract-OCR-engine developed by the Google. Pytesseract can be installed in to your computer, if you have installed any python version greater than or equal to 3.0. Whenever you have installed tesseract, simply install this 3<sup>rd</sup> party module by a simple pip command and play around!</p>

<h3>How to install Pytesseract?</h3>

<ol>
    <li>Type command <span style="background-color: lightblue;">pip install pytesseract</span> and let it to be installed.</li>
    <li>Then open your python shell and type  <span style="background-color: lightblue;">import pytesseract</span> and check about the status of pytesseract installation.</li>
</ol>

<u>
<h4>Import modules</h4>
</u>

In [1]:
import os # To import test image files
import cv2 # To work with opencv images
from PIL import Image # Image submodule to work with pillow images
import pytesseract as pt # pytesseract module

In [2]:
test_img_path = "../test images/"
create_path = lambda f : os.path.join(test_img_path, f)

test_image_files = os.listdir(test_img_path)

for f in test_image_files:
    print(f)

abc-text.jpg
bound-text-1.jpg
bound-text-2.jpg
contact-1.jpg
hello-text.jpg
hindi-news-1.jpg
hindi-news-2.jpg
hindi-text-1.jpg
hindi-text-2.jpg
image-paths.txt
jap-text-1.png
jap-text-2.png
letter-1.png
magazine-1.jpg
news-1.png
news-2.jpg
portu-text-1.jpg
portu-text-2.jpg
selfie-circle.jpg
sin-text-1.gif
sin-text-2.gif
span-text-1.png
tam-text-1.png


In [3]:
def show_image(img_path, size=(500, 500)):
    image = cv2.imread(img_path)
    image = cv2.resize(image, size)
    
    cv2.imshow("IMAGE", image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

<u>
<h4>Configure tesseract path in implementations (No need to add in to the PATH explicitly)</h4>
</u>

In [None]:
# optional: only if you haven't configured PATH
#pt.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # provide full path to tesseract.exe

<u>
<h4>Checkout available languages</h4>
</u>
<p>Check out <a href='https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html'>here</a> to learn about languages and their codes available in tesseract.</p>


In [11]:
# using cmd: tesseract --list-langs
avb_langs = pt.get_languages(config='')

for lang in avb_langs:
    print(lang)

print(len(avb_langs))

afr
amh
ara
asm
aze
aze_cyrl
bel
ben
bod
bos
bre
bul
cat
ceb
ces
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
chr
cos
cym
dan
deu
div
dzo
ell
eng
enm
epo
equ
est
eus
fao
fas
fil
fin
fra
frk
frm
fry
gla
gle
glg
grc
guj
hat
heb
hin
hrv
hun
hye
iku
ind
isl
ita
ita_old
jav
jpn
jpn_vert
kan
kat
kat_old
kaz
khm
kir
kmr
kor
lao
lat
lav
lit
ltz
mal
mar
mkd
mlt
mon
mri
msa
mya
nep
nld
nor
oci
ori
osd
pan
pol
por
pus
que
ron
rus
san
sin
slk
slv
snd
spa
spa_old
sqi
srp
srp_latn
sun
swa
swe
syr
tam
tat
tel
tgk
tha
tir
ton
tur
uig
ukr
urd
uzb
uzb_cyrl
vie
yid
yor
124


<u>
<h4>Extract text from an image : Simple</h4>
</u>

In [5]:
image_path = test_image_files[12] # 2, 3, 12, 1, 13, 15
path = create_path(image_path)

image = Image.open(path)
text = pt.image_to_string(image)

print(text)
show_image(path)

Dear David,

It's been a long time since we saw each
other. Do you remember when we met
in September, 2020 in Toronto?

Well, so much has happened since then
and I'm writing to tell you about my »

good news!

My Exam

with
Jonathan




In [10]:
# Print first line of text
print(text.split('\n')[7])

and I'm writing to tell you about my »


<u>
    <h4>Extract text from an image : Specifying a language</h4>
</u>
<p>Check out <a href='https://github.com/tesseract-ocr/tessdata/tree/3.04.00'>here</a> to download different language data files.</p>

In [13]:
path = create_path("hindi-news-2.jpg")

image = Image.open(path)
text = pt.image_to_string(image, lang='hin')

print(text)
show_image(path)

सुखद : आसन वेटलैंड में पलास
फिश ईगल की नेस्टिंग शुरू

जागरण संवाददाता, विकासनगर :देश के पहले
कंजरवेशन रिजर्व आसन वेटलैंड में प्रवास पर आए
पलास फिश ईगल ( वैज्ञानिक नाम हालियाइटस
ल्युकॉरीपस ) के जोड़े ने सेमल के पेड़ पर
आशियाना तैयार करना शुरू कर दिया है। चकराता
बन प्रभाग इसे एक सुखद संकेत मान रहा है। वहीं
आसन नमभूमि में प्रवासी परिंदों की संख्या बढ़कर
पांच हजार के करीब हो गई है ।शिकारियों पर अंकुश
को सशस्त्र वन टीम की रात दिन गश्त चल रही है।
रविवार को जीएमवीएन के आसन पर्यटन स्थल पर
आए पक्षी प्रेमियों व पर्यटकों ने बोटिंग के साथ ही बर्ड
वाचिंग का भी आनंद लिया।

डुर्लभ प्रजाति के पलाश फिश ईगल मुख्य रूप से
कजाकिस्तान, मंगोलिया, बंग्लादेश आदि देशों से
प्रवास पर आते हैं ।पिछले 50 साल से पलास फिश
ईगल का जोड़ा प्रवास पर आ रहा है। इस बार जोड़े ने
नमभूमि क्षेत्र के आरक्षित वन क्षेत्र में सेमल के पेड़
पर आशियाना तैयार करना शुरू कर दिया है। आसन
रेंजर जवाहर सिंह तोमर व वन बीट अधिकारी प्रदीप
सक्सेना ने लंबे समय बाद आशियाना तैयार कर रहे
पलाश फिश ईगल की गतिविधियों पर ध्यान देना
शुरू कर दिया है। प्रभाग

<u>
    <h4>Extract text from an image : Multiple images once</h4>
</u>

In [14]:
img_name_txt_file = "../test images/image-paths.txt"
text = pt.image_to_string(img_name_txt_file, lang='jpn')

print(text)

未練なく散も

without regret

日本語の表記においては, 漢字や仮名だけで
なく, ローマ字やアラビア数字, さらに名読
点や括弧類などの記述記号を用いる。 これら
を組み合わせて表す日本語の文書では, 表記
上における種々の問題がある。



<u>
    <h4>Extract text from an image : Timeout extraction</h4>
</u>

In [16]:
path = create_path("news-2.jpg")

image = Image.open(path)
text = 'NO TEXT TO BE APPEARED'

try:
    text = pt.image_to_string(image, lang='eng', timeout=1)
except RuntimeError as timeout_error:
    print("[TIMEOUT ERROR]")

print(text)
show_image(path)

[TIMEOUT ERROR]
NO TEXT TO BE APPEARED


<u>
    <h4>Get bounding box estimates</h4>
</u>

In [None]:
path = create_path("jap-text-1.png")

image = Image.open(path)
bound_rects = pt.image_to_boxes(image, lang='jpn')

print(bound_rects)
show_image(path)

In [None]:
img = cv2.imread(path)
h, _, _ = img.shape

for b in bound_rects.splitlines():
    b = b.strip().split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)

cv2.imshow("CHARACTERIZED IMAGE", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

<u>
    <h4>Get verbose data including boxes, confidences, line and page numbers</h4>
</u>

In [None]:
image_path = test_image_files[2]
path = create_path(image_path)

image = Image.open(path)
text = pt.image_to_data(image)

print(text)
show_image(path)

<u>
    <h4>Get information about orientation and script detection</h4>
</u>

In [None]:
image_path = "hindi-text-1.jpg" # news-2.jpg hindi-news-1.jpg hindi-news-2.jpg hindi-text-1.jpg
path = create_path(image_path)

print(pt.image_to_osd(path, lang='hin'))

<u>
    <h4>Convert in to different file formats (PDF, XML, HOCR)</h4>
</u>

In [None]:
image_path = "news-2.jpg"
path = create_path(image_path)
file_save_path = "../files/"

In [None]:
pdf = pt.image_to_pdf_or_hocr(path, extension='pdf')

file = open(os.path.join(file_save_path, "pdf-content.pdf"), 'w+b')
file.write(pdf)
file.close()

In [None]:
# hocr: open standard of data representation for formatted text obtained from (OCR)
hocr = pt.image_to_pdf_or_hocr(path, extension='hocr')

file = open(os.path.join(file_save_path, "hocr-content.html"), 'w+b')
file.write(hocr)
file.close()

In [None]:
xml = pt.image_to_alto_xml(path)

file = open(os.path.join(file_save_path, "xml-content.xml"), 'w+b')
file.write(xml)
file.close()

<u>
    <h4>Forcefully assigning different assumptions (Custom Configurations)</h4>
</u>

<b>OEM</b> : OCR Engine Mode (Type of the algorithm used by tesseract)<br>
<b>PEM</b> : Page Segmentation Mode (Page semgentation mode used by tesseract)<br><br>
 
<h4>Page Segmentation Modes</h4><hr>
<div style="font-size:13px;">

    
0 - Orientation and script detection(OSD) only.<br>
1 - Automatic page segmentation with OSD.<br>
2 - Automatic page segmentation, but no OSD, or OCR.<br>
3 - Fully automatic page segmentation, but no OSD.(Default)<br>
4 - Assume a single column of text of variable sizes.<br>
5 - Assume a single uniform block of vertically aligned text.<br>
6 - Assume a single uniform block of text.<br>
7 - Treat the image as a single text line.<br>
8 - Treat the image as a single word.<br>
9 - Treat the image as a single word in a circle.<br>
10 - Treat the image as a single character.<br>
11 - Sparse text.Find as much text as possible in no particular order.<br>
12 - Sparse text with OSD.<br>
13 - Raw line.Treat the image as a single text line, bypassing hacks that are Tesseract - specific.<br>
</div>

In [None]:
image_path = "abc-text.jpg"
path = create_path(image_path)
custom_oem_psm_config = r'--oem 3 --psm 9'

image = Image.open(path)
pt.image_to_string(image, config=custom_oem_psm_config)

<h4>References</h4><hr>

<ul>
    <li><a href='https://github.com/tesseract-ocr/tesseract'>Tesseract</a></li>
    <li><a href='https://github.com/madmaze/pytesseract'>Pytesseract</a></li>
    <li><a href='https://www.py4u.net/discuss/10850'>Multiple config options</a></li>
    <li><a href='https://stackoverflow.com/questions/20831612/getting-the-bounding-box-of-the-recognized-words-using-python-tesseract'>Getting bounding box cordinates</a></li>
</ul>