<a href="https://colab.research.google.com/github/JeanJulesBigeard/Getting-started-with-OpenCV/blob/master/Text_Detection/1_Text_Recognition_Tesseract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="blue">Introduction to Tesseract</font>
Tesseract is an open source text recognition (OCR) Engine - It is used to extract text from images. It is available under the Apache 2.0 license. It is also one of the best free softwares for performing OCR.

Tesseract was originally developed at HP between 1985 and 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest (LSTM based) stable version is v4.x which supports many additional languages ( around 116 languages ). 

We will use pytesseract - a python wrapper for Tesseract in this course.

In this notebook, we will see the supported functions and how to extract text from images. We will also see how to use it for other langauges.

# <font color="blue">Install Tesseract library</font>

In [0]:
!apt install libtesseract-dev tesseract-ocr > /dev/null





# <font color="blue">Install Python wrapper for Tesseract </font>

In [0]:
!pip install pytesseract > /dev/null

# <font color="blue">Import Libraries </font>

In [0]:
import pytesseract
import cv2
import glob
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image

# <font color="blue">Test Image 1 </font>
We will download a screenshot taken from the Keras Library

In [0]:
!wget https://www.dropbox.com/s/v3z5l2mq8swea1e/keras-snapshot.jpg?dl=1 -O text1.jpg --quiet

### <font color="green">Downloaded Image</font>
![](https://www.dropbox.com/s/v3z5l2mq8swea1e/keras-snapshot.jpg?dl=1)

# <font color="blue">Perform OCR</font>
Tesseract provides a very easy to use interface (with a lot of flexibility and parameters) to perform OCR on images.

### <font color="green">Output </font>

In [0]:
text1 = pytesseract.image_to_string('text1.jpg')
print(text1)

Keras is a high-level neural networks API, written in Python and capable of running on top of
TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation.
Being able to go from idea to result with the least possible delay is key to doing good research.


# <font color="blue">Test Image 2 </font>
We will use a screenshot of a full page from Chapter 9 of deep learning book by Ian Goodfellow.

In [0]:
!wget https://www.dropbox.com/s/ai7dsbpsyjb2inx/cnn-snapshot.jpg?dl=1 -O text2.jpg --quiet

### Downloaded Image
![](https://www.dropbox.com/s/ai7dsbpsyjb2inx/cnn-snapshot.jpg?dl=1)

### <font color="green">Output </font>

In [0]:
text2 = pytesseract.image_to_string('text2.jpg')
print(text2)

Chapter 9

Convolutional Networks

Convolutional networks (LeCun, 1989), also known as convolutional neural
networks, or CNNs, are a specialized kind of neural network for processing data
that has a known grid-like topology. Examples include time-series data, which can
be thought of as a 1-D grid taking samples at regular time intervals, and image data,
which can be thought of as a 2-D grid of pixels. Convolutional networks have been
tremendously successful in practical applications. The name “convolutional neural
network” indicates that the network employs a mathematical operation called
convolution. Convolution is a specialized kind of linear operation. Convolutional
networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.

In this chapter, we first describe what convolution is. Next, we explain the
motivation behind using convolution in a neural network. We then describe an
operation called pooling, which alm

### <font color="green">Observation </font>
Wow! It does a really good job even with a large text body.

It identifies special characters like  "(" , "." , "," , "-" , etc.

# <font color="blue">Test Image 3</font>
Till now we have seen "nice" images with uniform white backgrounds and black text. Let us make life a little harder for Tesseract with some different color background and non-uniform text. 

We will use a scanned image of the back of Computer Vision book by Forsyth and Ponce.

In [0]:
!wget https://www.dropbox.com/s/zrr4tvozzjbfrzv/forsyth_scan.jpg?dl=1 -O text3.jpg --quiet

### <font color="green">Downloaded Image</font>
<img src="https://www.dropbox.com/s/zrr4tvozzjbfrzv/forsyth_scan.jpg?dl=1" width=600>

### <font color="green">Output </font>

In [0]:
text3 = pytesseract.image_to_string('text3.jpg')
print(text3)

Computer Vision

A MODERN APPROACH

DAVID A. FORSYTH

University of California at Berkeley

JEAN PONCE

University of Illinois at Urbana-Champaign

 

Whether in the entertainment industry (building three-dimensiona! computer models), medical imaging,
interpreting satellite images (both for military and civilian purposes), the applications of computer
vision is varied and wide ranging. And this compact yet comprehensive text provides a survey of the
field of computer vision and views it from a modern perspective. It is self-contained, accessible, and
lays emphasis on basic geometry, physics of imaging and probabilistic techniques.

Throughout, the authors attempt to lay bare the essentials of computer vision to the students as
also to the professionals. The text reflects the latest developments in the field and integrates the
learning tools that aid understanding.

 

This uptodate, contemporary text would be useful for students of computer science, IT and MCA
offering courses in compu

### <font color="green">Observation </font>
From the above output, you can see that Tesseract output even preserves the Capitalized words and the formatting, making it ideal for document analysis and OCR.

# <font color="blue">Output Type</font>
Before going further, it is worth noting that the output of Tesseract is in the form of a **string by default**. There are other output types supported like a **dictionary, Byte or DataFrame**. In many cases, changing the output format might help if you need to perform further analysis of the output.

# <font color="blue">Tesseract Functions </font>

- **`get_tesseract_version`** - Returns the Tesseract version installed in the system.
- **`image_to_string`** - Returns the result of a Tesseract OCR run on the image as a single string
- **`image_to_boxes`** - Returns the recognized characters and their box boundaries.
- **`image_to_data`** - Returns the box boundaries/locations, confidences, words etc. 
- **`image_to_osd`** - Returns result containing information about orientation and script detection.
- **`image_to_pdf_or_hocr`** - Returns a searchable PDF from the input image.
- **`run_and_get_output`** - Returns the raw output from Tesseract OCR. This gives a bit more control over the parameters that are sent to tesseract.


### <font color="green">Check the detected characters </font>

In [0]:
boxes = pytesseract.image_to_boxes("text1.jpg")
print(boxes[:100])

K 12 65 14 76 0
e 14 65 21 76 0
r 22 65 29 73 0
a 31 65 35 73 0
s 36 65 42 73 0
i 43 65 49 73 0
s 55


You can see the result gives you the location of each recognized character. But, from the above, it is difficult to decipher how the location information is stored. Let us change the output type to dict which will show us what each of the column indicates.

In [0]:
boxes = pytesseract.image_to_boxes("text1.jpg",output_type="dict")
print(boxes.keys())

dict_keys(['char', 'left', 'bottom', 'right', 'top', 'page'])


So, the location is given by (left, bottom) and (top, right) coordinates

### <font color="green">Check Detected Words </font>

In [0]:
data = pytesseract.image_to_data("text1.jpg")
print(data[:500])

level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	688	91	-1	
2	1	1	0	0	0	11	14	662	63	-1	
3	1	1	1	0	0	11	14	662	63	-1	
4	1	1	1	1	0	12	14	645	15	-1	
5	1	1	1	1	1	12	15	30	11	92	Keras
5	1	1	1	1	2	43	15	21	11	95	is
5	1	1	1	1	3	69	18	7	8	95	a
5	1	1	1	1	4	82	14	66	15	95	high-level
5	1	1	1	1	5	154	15	42	11	96	neural
5	1	1	1	1	6	202	15	57	11	95	networks
5	1	1	1	1	7	260	15	34	11	95	API,
5	1	1	1	1	8	296	15	56	13	89	written
5	1	1	1	1	9	358	15	11	11	96	in
5	1


In [0]:
data = pytesseract.image_to_data("text1.jpg",output_type="dict")
print(data.keys())

dict_keys(['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text'])


### <font color="green">Create Searchable PDF from Image </font>
It simply returns the raw tesseract output. 

In [0]:
image2pdf = pytesseract.image_to_pdf_or_hocr('text2.jpg')
with open('text2.pdf', 'w+b') as f:
    f.write(image2pdf)

Now, download the PDF and check for yourself if the PDF is just a scanned image or it is searchable

### <font color="green">Check Orientation of the Text</font>

In [0]:
!wget https://wiki.openoffice.org/w/images/c/c2/WG3Ch7F14.png -O text3.png --quiet

### <font color="green">Downloaded Image</font>
![](https://wiki.openoffice.org/w/images/c/c2/WG3Ch7F14.png)

### <font color="green">Output </font>

In [0]:
osd = pytesseract.image_to_osd("text3.png")
print(osd)

Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 0.38
Script: Latin
Script confidence: 3.33


You can see that the above document is rotated by 270 (or -90 ) degrees and it has been correctly detected. 

# <font color="blue">What about a different language? </font>
You can check out the list of supported languages [**here**](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages)
We will see how to use Tesseract for German language.

In [0]:
!apt install tesseract-ocr-deu > /dev/null





# <font color="blue">Test Image 4 </font>

In [0]:
!wget https://www.dropbox.com/s/geevhmy62dy4pzh/german.jpg?dl=0 -O german.jpg --quiet

### Downloaded Image
<img src="https://www.dropbox.com/s/geevhmy62dy4pzh/german.jpg?dl=1" width=300>

### <font color="green">Using Tesseract trained for English </font>

In [0]:
text4 = pytesseract.image_to_string('german.jpg',lang='eng')
print(text4)

English - detected ~ rod German ~

Can we check x K6nnen wir

whether Uberpriifen, ob
Tesseract Tesseract Deutsch
understand versteht?

German?


### <font color="green">Using Tesseract trained for German </font>

In [0]:
text4_german = pytesseract.image_to_string('german.jpg',lang='deu')
print(text4_german)

English - detected = eo German =

Can we check x Können wir

whether überprüfen, ob
Tesseract Tesseract Deutsch
understand versteht?

German?


### <font color="green">Observation </font>
You can see that even though it is able to detect most words correctly, but the german language details are missing. For example, 
1. **`überprüfen`** is detected as **`Uberpriifen`**
1. **`Können`** is detected as **`K6nnen`**

# <font color="blue">Tesseract OCR on Natural Scene Images</font>
We have seen how Tesseract performs on scanned documents. The challenging part is how to handle natural scene images because they can have any type of variations ranging from low quality/lighting issues/occlusion/distortion etc.

# <font color="blue">Test Image 5</font>
We will use an image taken from the camera of the same scan that we used earlier.

In [0]:
!wget https://www.dropbox.com/s/jat0z82d76zlkjg/book1.jpg?dl=1 -O book1.jpg --quiet

### Downloaded Image
<img src="https://www.dropbox.com/s/jat0z82d76zlkjg/book1.jpg?dl=1" width=600>

In [0]:
text5 = pytesseract.image_to_string('book1.jpg')
print(text5)

Computer Vision
A MODERN APPROACH

DAVID A. FORSYTH

University of California at Berkeley

JEAN PONCE

University of Illinois at Urbana-Champaign

Whether in the entertainment industry (building three-dimensiona! computer models), medical imaging,
interpreting satellite images (both for military and civilian purposes), the applications of computer
vision is varied and wide ranging. And this compact yet comprehensive text provides a survey of the
field of computer vision and views it from a modern perspective. It is self-contained, accessible, and
lays emphasis on basic geometry, physics of imaging and probabilistic techniques.

Throughout, the authors attempt to lay bare the essentials of computer vision to the students as
also to the professionals. The text reflects the latest developments in the field and integrates the
learning tools that aid understanding.

This uptodate, contemporary text would be useful for students of computer science, IT and MCA
offering courses in computer gra

### <font color="green">Observation</font>
Even though it is natural image, Tesseract is able to perform OCR almost without any errors. This is good, but you will be surprised by how fast the output deteriorates on small changes in the images. 


# <font color="blue">Test Image 6</font>


In [0]:
!wget https://www.dropbox.com/s/uwrdek4jjac4ysz/book2.jpg?dl=1 -O book2.jpg --quiet

### Downloaded Image

<img src="https://www.dropbox.com/s/uwrdek4jjac4ysz/book2.jpg?dl=1" width=500>

In [0]:
text6 = pytesseract.image_to_string('book2.jpg')
print(text6)

The Impact of the Highly Improbable


### <font color="green">Observation </font>
So, it was only able to detect the above text. This needs to be fixed or at least improved. We will see that in the next section.