tesschar

Simple character checking for tesseract. The parameters are:

usage: tesschar.py [-h] [-f FILE] [-b BORDER] [-l LANG] [-o OUTPUT] [-t TEXT]

optional arguments:
  -h, --help            show this help message and exit

named arguments:
  -f FILE, --file FILE  input image, for example: imgs/my_image.tif
  -b BORDER, --border BORDER
                        adjust border value for extracted regions
  -l LANG, --lang LANG  language for OCR
  -o OUTPUT, --output OUTPUT
                        file for output
  -t TEXT, --text TEXT  text to reprocess

For example:

tesseract sample.jpg sample -c hocr_char_boxes=1 hocr
tesschar.py -f sample.jpg -t O,B

By default, the output will be in the base of the filename, sample.txt in this case. Note that a border is put around the extracted character to help improve the results. If an hocr file is not detected, pytesseract will be used to create an in-memory version. The single character recognition step is also done in-memory with pytesseract. This could be done more efficiently with the Tesseract API but the key would be to test on a big enough sample to make sure it is worth pursuing since the process adds considerable overhead.

This has had minimal testing, YMMV, etc...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

tesschar

Files

README.md

Latest commit

History

README.md

File metadata and controls

tesschar