Couple of errors with Tesseract v3.02.02 #3

Open
remon-georgy opened this Issue Dec 23, 2012 · 17 comments

5 participants

@remon-georgy

Thanks for this handy tool, it's really helpful except that I couldn't get it to work :).

I'm trying to train Tesseract with a new English font called KidKosmic with the following command

python ../TesseractTrainer/__main__.py --tesseract-lang kidkosmic --training-text eng.kidkosmic.exp0 --font-path kidkosmic.ttf --font-name kidkosmic  --font-properties font_properties --verbose

And here is the output

Generating individual tif image page0.tif
Generating multipage-tif kidkosmic.kidkosmic.exp0.tif
convert: no decode delegate for this image format `page0.tif' @ error/constitute.c/ReadImage/550.
convert: no images defined `kidkosmic.kidkosmic.exp0.tif' @ error/convert.c/ConvertImageCommand/3078.
Removing all individual tif images
Generating boxfile kidkosmic.kidkosmic.exp0.box

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Cannot open input file: kidkosmic.kidkosmic.exp0.tif
Extracting unicharset from kidkosmic.kidkosmic.exp0.box
Wrote unicharset file ./unicharset.

Warning: No shape table file present: shapetable
Reading kidkosmic.kidkosmic.exp0.tr ...

Error: Unable to open kidkosmic.kidkosmic.exp0.tr!
signal_termination_handler:Error:Signal_termination_handler called:Code 3000
Reading kidkosmic.kidkosmic.exp0.tr ...
Error: Unable to open kidkosmic.kidkosmic.exp0.tr!
signal_termination_handler:Error:Signal_termination_handler called:Code 3000
Traceback (most recent call last):
  File "../TesseractTrainer/__main__.py", line 50, in <module>
    trainer.training()  # generate a multipage tif from args.training_text, train on it and generate a traineddata file
  File "[home]bin/TesseractTrainer/lib/tesseract_training.py", line 155, in training
    self._rename_files()
  File "[home]bin/TesseractTrainer/lib/tesseract_training.py", line 131, in _rename_files
    os.rename('%s' % (generated_file), '%s.%s' % (self.dictionary_name, generated_file))
OSError: [Errno 2] No such file or directory

Any clues?

Fyi, I'm running the script on mac os 10.8 and dependencies insalled.

@brouberol
Collaborator
@remon-georgy

Thanks for you reply!
Yes the first error is an ImageMagick one, however, I do get plenty of additional errors after fixing it :)
I agree with you that it is a compatibility issue with Tesseract 3.x where x > 0.

@brouberol
Collaborator

Tesseract 3.02 has introduced a new clustering command: shapeclustering
(see https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Clustering)

It's seems to be important, as the following message appears in your traceback:

Warning: No shape table file present: shapetable

I'll add an automatic version check, and if tesseract >= 3.02, then the shapeclustering command will be executed.
Stay tuned :)

@brouberol
Collaborator

It seems that we'll have to wait a little more for 3.02 support.
I've added the shapeclustering command and automatic checking of tesseract version, but tesseract 3.02 fails to perform the blob ←→coordinates match.

All I get is a super-long error log looking like this

APPLY_BOXES: boxfile line 28/a ((421,580),(446,551)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 29/w ((446,580),(471,551)): FAILURE! Couldn't find a matching blob

I've found reports of people experiencing the same behaviour with 3.02 and tried to contribute. See here .

As this bug is a pure tesseract one, I hope you understand I cannot guarantee when I'll be able to support tesseract 3.02.

As an alternative solution, I suggest you fall back on tesseract 3.01, which seems to work fairly well with TesseractTrainer.

@brouberol brouberol was assigned Dec 27, 2012
@marcolino

Hi!
Do you have any deadline for supporting tesseract 3.02?
I did try to compile 3.01, but it doesn't compile anymore (out-of-the-box, at least), on latest dists (Ubuntu 12.10).
What do you suggest: try hard to compile 3.01 or wait for 3.02 support (I can't be of any help with the last one, sorry... :-)

@remon-georgy

Well, I believe that aforementioned errors (couldn't find matching blob...etc) are originating from using training text with very very long words (words that can't be wrapped in one line) and it has nothing to do with Tisseract version.

@marcolino

I don't believe so... My text is 26 letters, double spaced... And the author himself suggests "Couldn't find a matching blob" error are pure "tesseract ones"...

@zdenop

@marcolino: you are welcome to test your believe with evidence ;-): https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16

@marcolino

Which evidence are you talking about? Of course I did read that page, but I'm saying my problem is not the same as the one described on comment 16.
This is the tif (transformed to jpeg for size limitations) from my text file, and, as you can see, there are no possible overlaps:
ita francisco-serial-regular exp0

I made two statements:
1) "tesseract-ocr 3.01 doesn't compile on Ubuntu 12.10 (fresh install + build-essentials + tesseract-ocr + libleptonica-dev)"
2) the problem described on comment 16 does not apply to my situation, since my text is 26 single-letter words

TesseractTrainer author made one statement (in the comment of 2012-12-27 09:31:08 in this thread):
"As this bug [...couldn't find a matching blob...] is a pure tesseract one, I hope you understand I cannot guarantee when I'll be able to support tesseract 3.02."

remon-georgy said:
"aforementioned errors (couldn't find matching blob...etc) are originating from using training text with very very long words"

Please, be specific, or don't be... :-)

@brouberol
Collaborator

Hi,
I sadly currently do not have any time to spend on TesseractTrainer, which explains my slow responses and bug fixes.

About v3.02: As you've both read https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16, you've seen that if was suggested to increase the resolution (>72 DPI) or increase the inter character spacing.
I tried to generate a 300 DPI tiff, by multiplying all metrics by 4.16 and setting "3OO DPI" into the tif metadata, using ImageMagick, but it did not help.

@marcolino your example tif suggests that increasing the inter character spacing does not have any effect either.

The only solution I could offer now is trying to compile tesseract 3.01. I reacll that you needed leptonica-lib to compile it. Maybe they are not shipped anymore (wild undocumented guess here)?

Thanks for your reports.

B

@zdenop

@marcolino: I wrote about evidence that "Couldn't find a matching blob" error are pure "tesseract ones"... This is not true at lease for latin script based inputs (situation for hieroglyph, arabic, azian scripts is different IMO). When I tested reported issues it always came out, that problem is in:
a) wrong box file
b) input image (do not following tesseract requirements).

If you post somewhere your files (image -> try to use 2 color png ;-) & box file) I can analyse it and hopefully to offer you some suggestion. 3.02 version is no (so much ;-) ) sensible for spacing. BTW: you are aware that 26 single letters do not meet requirements, right?

@BaltoRouberol: Root of problem in 698#c16 is not in DPI, but in the boxes. DPI is just minor issue IMO. You have to be aware, that tesseract will convert (binarize) images to 2 colours, and than will run training. Maybe is you visualize "your" and tesseract box files, you can see what makes difference.

@marcolino

@BaltoRouberol: no problem for your slow response, of course...
I'll go through the "3.01" solution. You are right, libleptonica-dev is not shipped with latest default ubuntu, but it's as far as an "apt-get install libleptonica-dev"...
The problem is tesseract-ocr 3.01 doesn't compile anymore with latest system libraries (while 3.02 does); I didn't deeply investigate, but suspect some structure change in some system library... I hope I will be able to "port back" just the changed portions from 3.02 to 3.01 to build it successfully...

@zdenop: thanks for your support... I'm not OCR expert... My goal is to digitalize as well as I can a bunch of old books (really old and precious Italian books :-)... So I'm trying to automate the training process with TesseractTrainer... I did hope to be able not to "dirty my hands" with box files and input images, but just to:
1) somehow identify the fonts (most books use the same font) with the help - for example - of some online resource like "www.myfonts.com/WhatTheFont/"
2) scan the books with a professional book scanner
3) process the scanned images with a trained tesseract
4) enjoy... :-)
Now I see I have to dig into the interiors of the training process... But I am starting just now, please excuse my ignorances...
So, to answer your requests, my box file and input image are produced for me by TesseractTrainer (before it fails).
I just provide a text file (I'm aware that 26 single letters do not meet requirements, but changing to the suggested minimal text ("The (quick) brown {fox} jumps! over the $3,456.78 #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam?") doesn't change anything... (By the way, I can't anymore find the official recommended full text url... :-().

I try to post here all the data I use:

I hope it's enough... :-)
Please let me know if I can help you some way while investigating this issue...

Thanks again for your interest, everybody!

@zdenop

@marcolino: problem is (in) box file. I posted correct one on pastebin (it will expire in one month). I suggest you to compare it with your version e.g. in kdiff3. I create it with tesseract and I just need to correct one "1" to "l" and m-dash to minus. Are you sure you need to run training if there is such result? Tesseract users experience is that user are not able to create such good language data as Google did for supported languages. (e.g. training is reasonable only for uncommon font like fraktur). Instead for training it make sense to focus on input image quality and image preprocessing.

@BaltoRouberol: Problem of TesseractTrainer is that PIL.ImageFont returns always the same height for different chars ('T', 'g', '.', 'x'). This is not correct. Tesseract 3.02 requires than box file is rectangle of char only without empty space. I think you are not able to create such box file with PIL.

@marcolino

Thanks, zdenop. You are right, I'm not sure I need training... What I miss is the understanding of the kind of work I have to do to perform OCR on many books with different fonts: should I build a box file for each font I have?
And, you say, I should "focus on input image quality and image preprocessing": do you know of any (open source) tool to preprocess images for better ocr processing?
Thanks again!

@zdenop

@marcolino: this is off-topic for this issue. I suggest you to post example image and ask on tesseract user forum for suggestion. In my opinion scantailor is most complex (with simple user interface) from free software. You should not expect 100% result (even commercial OCR will not provide it).

@brouberol
Collaborator

@zdenop That's very interesting, thanks for your input. I guess that would mean that the whole tif+boxfile generation would have to be re-written using another Image Processing tool (eg: ImageMagick).

See http://www.imagemagick.org/Usage/text/#font_info

At this point, I would be happy to assist anyone willing to fork TesseractTrainer and fix this issue, but I feel I currently do not have the time to fix this (and I'm really sorry about that).

Thanks again!

B

@BrendonKoz

FYI: For anyone looking for further information in to this (one interested in forking the project, perhaps?), another post was made in the Tesseract bug listing related to this particular issue.

https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment