Welcome to the workshop on Optical Character Recognition (OCR)

Before getting started, 

*   Go to File --> Save a copy in drive. It should open the copied notebook in a new tab.
*   Alternatively, go to your Google drive, find the folder "Colab Notebooks", open the notebook.

Next, download:

*   Sample images: https://uppsala.box.com/s/qzvg9741dx915a0atc6mydo8qxgf9erc
*   Language models for Swedish (swe), French (fra), Italian (ita) and German (deu): https://uppsala.box.com/s/ovcpsdzj2dlomyghtw7vk50ilvxs3uix

Copy "sample_images" from Downloads folder to your Google Drive.

Navigate to the left panel, click on the upward arrow icon, go to /usr/share/tesseract-ocr/4.00/tessdata. Following the three dots, click "upload", select the 4 language models to upload. 

Close (X)

Now we are ready to get started! 

Go to Runtime, press "Run all".
Alternatively, run individual cells.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Set up for tesseract
!sudo apt-get install tesseract-ocr
!pip install pytesseract==0.3.9

If you see a warning to "Restart runtime", click on RESTART RUNTIME.

In [None]:
# Import modules

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2

In [None]:
# To check the location of pytesseract
#!pip show pytesseract

In [None]:
# List of available languages
print(pytesseract.get_languages(config=''))

For example:

*   osd: Orientation and script detection module
*   ita: Italian
*   deu: German
*   fra: France
*   swe: Swedish
*   eng: English


Here you can find the list of all languages supported by tesseract and the language codes to use:
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Note: to run tesseract for other languages, pass the language code in "lang". For example, lang = 'ita' for Italian. 

To download other language models, go to https://github.com/tesseract-ocr/tessdata (download as zip file).

In [None]:
# Read an input image (check left panel for the image name)
file = cv2.imread("/content/drive/My Drive/sample_images/test_eng.png")

In [None]:
#configure parameters for pytesseract
custom_config = r'--oem 3 --psm 6'      #oem 3: Default; psm 6: Assume a single uniform block of text.

In [None]:
# Pytesseract OCR - text file output
result_txt = pytesseract.image_to_string(file, lang='eng', config=custom_config) 

If you see "KeyError: 'PNG'", go to first cell and RESTART RUNTIME, or go to Runtime, "Run all"

In [None]:
print(result_txt)

In [None]:
# Swedish: Read an input image (check left panel for the image name)
file_sv = cv2.imread("/content/drive/My Drive/sample_images/test_swe1.jpg")

In [None]:
# Pytesseract OCR - text file output
result_txt_sv = pytesseract.image_to_string(file_sv, lang='swe', config=custom_config) 

In [None]:
print(result_txt_sv)

In [None]:
# To create a searchable pdf, uncomment this block
#result_pdf = pytesseract.image_to_pdf_or_hocr(file, lang='eng', config=custom_config) 

Insights on the data:

*   There are 3 Swedish documents from the 19th century written using an old typewriter.
*    test_swe1.jpg is a good quality image.
*    test_swe2.jpg is a poor quality image and OCR will be challenging.
*    test_swe3.jpg consists of a photograph along with the text. Unfortunately, Tesseract does not perform well for the texts with photographs. Does cropping the text before using Tesseract improves the results?
*    test_deu.png is obtained from OCR4ALL project, and written between 16-18th century in German.
*    test_eng.png represents a sample text from Marian’s play written in Old English.
*    test_fra.jpg is challenging as it also includes the book edges while scanning. Try the cropped version (test_fra_cropped.jpg) and observe if OCR results are better?
*    test_ita.png was obtained from an old Italian book online. Also try the cropped version (test_ita_cropped.png) and observe if OCR results are
better?


---

Did you notice that the OCR has errors and need post-processing? 

This is a common challenge with heritage data and thanks to AI, we now have a solution - layout based OCR!

You can also explore Deep Learning based Layout Detection which is performed before OCR and the results are much accurate. 
Here's the link to the Layout Parser: https://github.com/Layout-Parser/layout-parser



Thank you!