# Pytesseract

#### Links
- https://pypi.org/project/pytesseract/
- https://github.com/tesseract-ocr/tesseract
- https://en.wikipedia.org/wiki/Tesseract_(software)

## Pytesseract Info

- Tesseract
    - Hewlett Packard developed Tesseract in the 80's-90's
    - Written in C and C++
    - Released as open source in 2005
    - Development sponsored by Google since 2006
    - Pre trained on over 100 languages 
        - custom training possible for other languages
    - Didn't use deep learning originally
    - Uses LSTM (RNN) starting in v4.0

- Works best on good quality images (even lighting, no rotation, good contrast)
- Common preprocessing steps
    - Correct rotation, skew
    - High pass frequency filter to correct lighting

# Pytesseract Demo

In [None]:
from PIL import Image
import PIL.ImageEnhance
import pytesseract
import pandas as pd

filename = 'example_check.png'

In [None]:
image = Image.open(filename)
display(image)

# Data Extraction in Various Forms

In [None]:
# Simple image to string
print(pytesseract.image_to_string(image))

In [None]:
# # Batch processing with a single file containing the list of multiple image file paths
# print(pytesseract.image_to_string('images.txt'))

## Bounding Boxes

In [None]:
# Get bounding box estimates
boxes_str = pytesseract.image_to_boxes(image)
print(boxes_str[:200])

In [None]:
# Get bounding box estimates as dataframe
pd.DataFrame(pytesseract.image_to_boxes(image, output_type='dict')).head(10)

## Verbose Data

In [None]:
# Get verbose data including boxes, confidences, line and page numbers
# print(pytesseract.Output.__dict__) # see output options
# pytesseract.image_to_data(image, output_type='string') # string output
# pytesseract.image_to_data(image, output_type='bytes')
# pytesseract.image_to_data(image, output_type='dict')
pytesseract.image_to_data(image, output_type='data.frame').dropna(axis=0, subset=['text']) # df output

In [None]:
# Get information about orientation and script (font) detection
print(pytesseract.image_to_osd(image))

In [None]:
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr(image.filename, extension='pdf')

with open('searchable_pdf.pdf', 'wb') as f:
    f.write(pdf)

In [None]:
# Get hOCR output
# hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR) as html.
hocr = pytesseract.image_to_pdf_or_hocr(image.filename, extension='hocr')

print(hocr[1525:2013].decode())

***

In [None]:
### Use 2 languages at once
txt = pytesseract.image_to_string(image, lang='eng+fre')
txt

# Common API Functions

### Functions

- **get_tesseract_version** - Returns the Tesseract version installed in the system.
- **image_to_string** - Returns the result of a Tesseract OCR run on the image to string
- **image_to_boxes** - Returns result containing recognized characters and their box boundaries
- **image_to_data** - Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
- **image_to_osd** - Returns result containing information about orientation and script detection.
- **run_and_get_output** - Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.


### Other parameters

- **image** - Object, PIL Image/NumPy array of the image to be processed by Tesseract
- **lang** - String, Tesseract language code string
- **config** - String, Any additional configurations as a string, ex: config='--psm 6'
- **nice** - Integer, modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
- **output_type** - Class attribute, specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.
- **timeout** - Integer or Float, duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.