Skip to content

OurDigitalWorld/tesschar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

tesschar

Simple character checking for tesseract. The parameters are:

usage: tesschar.py [-h] [-f FILE] [-b BORDER] [-l LANG] [-o OUTPUT] [-t TEXT]

optional arguments:
  -h, --help            show this help message and exit

named arguments:
  -f FILE, --file FILE  input image, for example: imgs/my_image.tif
  -b BORDER, --border BORDER
                        adjust border value for extracted regions
  -l LANG, --lang LANG  language for OCR
  -o OUTPUT, --output OUTPUT
                        file for output
  -t TEXT, --text TEXT  text to reprocess

For example:

tesseract sample.jpg sample -c hocr_char_boxes=1 hocr
tesschar.py -f sample.jpg -t O,B

By default, the output will be in the base of the filename, sample.txt in this case. Note that a border is put around the extracted character to help improve the results. If an hocr file is not detected, pytesseract will be used to create an in-memory version. The single character recognition step is also done in-memory with pytesseract. This could be done more efficiently with the Tesseract API but the key would be to test on a big enough sample to make sure it is worth pursuing since the process adds considerable overhead.

This has had minimal testing, YMMV, etc...

About

Simple character checking for tesseract

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages