## What is Optical Character Recognition (OCR)?

OCR, or Optical Character Recognition, is a process of recognizing text inside images and converting it into an electronic form. These images could be of handwritten text, printed text like documents, receipts, name cards, etc., or even a natural scene photograph.

OCR has two parts to it. The first part is text detection where the textual part within the image is determined. This localization of text within the image is important for the second part of OCR, text recognition, where the text is extracted from the image. Using these techniques together is how you can extract text from any image.

## Tutorial that I follow
1. https://nanonets.com/blog/ocr-with-tesseract/#:~:text=Tesseract%20is%20an%20open%20source,a%20wide%20variety%20of%20languages.
2. https://medium.com/@jaafarbenabderrazak.info/ocr-with-tesseract-opencv-and-python-d2c4ec097866
3. https://pypi.org/project/tesserocr/
4. https://github.com/JaidedAI/EasyOCR

In [1]:
# OCR using easyocr

!pip install easyocr

Collecting easyocr
  Downloading easyocr-1.1.4-py3-none-any.whl (22.5 MB)
[K     |████████████████████████████████| 22.5 MB 2.8 MB/s 
Installing collected packages: easyocr
Successfully installed easyocr-1.1.4


In [2]:
import torch
import easyocr
import os

In [3]:
# In case you do not have GPU or your GPU has low memory, 
# you can run it in CPU mode by adding gpu = False

# reader = easyocr.Reader(['en', 'en'], gpu=False)

reader = easyocr.Reader(['en', 'en'])

Downloading detection model, please wait
Download complete
Downloading recognition model, please wait
Download complete


## Output will be in list format, each item represents bounding box, text and confident level, respectively.

```
[([[189, 75], [469, 75], [469, 165], [189, 165]], '愚园路', 0.3754989504814148),
 ([[86, 80], [134, 80], [134, 128], [86, 128]], '西', 0.40452659130096436),
 ([[517, 81], [565, 81], [565, 123], [517, 123]], '东', 0.9989598989486694),
 ([[78, 126], [136, 126], [136, 156], [78, 156]], '315', 0.8125889301300049),
 ([[514, 126], [574, 126], [574, 156], [514, 156]], '309', 0.4971577227115631),
 ([[226, 170], [414, 170], [414, 220], [226, 220]], 'Yuyuan Rd.', 0.8261902332305908),
 ([[79, 173], [125, 173], [125, 213], [79, 213]], 'W', 0.9848111271858215),
 ([[529, 173], [569, 173], [569, 213], [529, 213]], 'E', 0.8405593633651733)]
```

In [4]:
# Image to text using easyocr
# Output will be in list format, each item represents bounding box, text and confident level, respectively.

img_text = reader.readtext('../input/hackerearthimage/Test1161.jpg')
final_text = ""

for _, text, __ in img_text: # _ = bounding box, text = text and __ = confident level
    final_text += " "
    final_text += text
final_text

' IF I COULD HAVE CHOSEN TO BE GAY OR STRAIGHT, I THINKI WOULD HAVE SIMPLY CHOSEN TO BE HAPPY 0 k hlrp'

In [5]:
# Function to Traverse the folder

def traverse(directory):
    path, directory, files = next(os.walk(directory))
    return files

In [6]:
# Image directory and list of files

directory = '../input/hackerearthimage'
files_list = traverse(directory)

In [7]:
files_list[:4]

['Test3706.jpg', 'Test2209.jpg', 'Test449.jpg', 'Test1872.jpg']

In [8]:
# Doing OCR using GPU
# save the images text to dict

images_text = {}
for files in files_list:
    img_text = reader.readtext(directory + '/' +  files)
    final_text = ""
    for _, text, __ in img_text:
        final_text += " "
        final_text += text
    images_text[files] = final_text

In [9]:
# For sorting the image file name

keys = list(images_text.keys())
new_keys = [int(k[4:-4]) for k in keys]
new_keys.sort()

In [10]:
# Saving the Text file with image name ascending order

import csv

with open('image_easy_ocr.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(["Filename", "Text"])
    
    for n in new_keys:
        writer.writerow(['Test' + str(n) + '.jpg', images_text['Test' + str(n) + '.jpg']])

In [11]:
# OCR using pytesseract

import cv2
import pytesseract
from pytesseract import Output
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

In [12]:
# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread("../input/hackerearthimage/Test1161.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]


# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening


# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

, i - -
CHOSEN TO BE
HAPPY

lari |


In [13]:
def text_extraction(file_path):
    # Grayscale, Gaussian blur, Otsu's threshold
    image = cv2.imread(file_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]


    # Morph open to remove noise and invert image
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
    invert = 255 - opening


    # Perform text extraction
    data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
    return data

In [14]:
# Doing OCR using pytesseract
# save the images text to dict

images_text = {}
for files in files_list:
    img_text = text_extraction(directory + '/' +  files)
    final_text = ""
    for text in img_text:
        final_text += text
    images_text[files] = final_text

In [15]:
# For sorting the image file name

keys = list(images_text.keys())
new_keys = [int(k[4:-4]) for k in keys]
new_keys.sort()

In [16]:
# Saving the Text file with image name ascending order

import csv

with open('image_pytesseract_ocr.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(["Filename", "Text"])
    
    for n in new_keys:
        writer.writerow(['Test' + str(n) + '.jpg', images_text['Test' + str(n) + '.jpg']])