Have you thought about the situation of a very long words? #4

Open
hairui opened this Issue Dec 26, 2012 · 1 comment

Comments

Projects
None yet
2 participants
@hairui

hairui commented Dec 26, 2012

Since i am trying to make a training data for Chinese fonts, I happened to prepare a text file including a very long "words" consist of about 3000 Chinese characters.
Certainly I can not get the expected result of a good tessdata.
After looking into the code, I found it seems the code just consider about the situations of short words.
Have you think about the word could be as long as 3000 characters (a word even not fit for a page)? if not , why not mention about that as a limitation somewhere?

@brouberol

This comment has been minimized.

Show comment Hide comment
@brouberol

brouberol Dec 26, 2012

Collaborator

Indeed, I did not have this case in mind. As you must have understood, the algorithm will try to fit each word in the current line. If the word does not fit, a new line will be used to fit it. This is a very simple way to proceed, and I agree that very long words could break it, as they would be too long to fit it an entire line.
The initial idea was to provide tesseract a "real life" text. I personally use an extract of a Python tutorial. In these cases, you do not encounter situations as you describe.

One possible way around it would be to decrease the font size specifically for this word to fit in, but I think that would just mess things up: different font sizes would mean different inter-character spacings metrics, and
the wiki specifies that inter-character spacings are vital, and there is no need to mix different font sizes.

I do not clearly understand why you would require a 3000 character long word for training. Couldn't you use 3000 independent characters, separated by a whitespace?

Thanks for your feedback!

Collaborator

brouberol commented Dec 26, 2012

Indeed, I did not have this case in mind. As you must have understood, the algorithm will try to fit each word in the current line. If the word does not fit, a new line will be used to fit it. This is a very simple way to proceed, and I agree that very long words could break it, as they would be too long to fit it an entire line.
The initial idea was to provide tesseract a "real life" text. I personally use an extract of a Python tutorial. In these cases, you do not encounter situations as you describe.

One possible way around it would be to decrease the font size specifically for this word to fit in, but I think that would just mess things up: different font sizes would mean different inter-character spacings metrics, and
the wiki specifies that inter-character spacings are vital, and there is no need to mix different font sizes.

I do not clearly understand why you would require a 3000 character long word for training. Couldn't you use 3000 independent characters, separated by a whitespace?

Thanks for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment