-
-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract: switch to only including eng, osd, and snum tessdata #36786
Conversation
…es sizes from 680MB to 30MB
Relatedly, the |
In general, I don’t think it is a good idea to restrict a package like this down to just one or even a handful of languages. Since there isn’t a way to add more languages without making your own tap. Of course this is a problem that could be solved with options, and was, but since that isn’t an option now, here we are. |
Why are all not English speaking people excluded from using Tesseract? |
Homebrew's decision to purge all options from core formulae has definitely caused some issues and challenges. I suspect/hope that 3rd party taps will rise to cover the gaps but that's still taking shape. From @retokromer's recent change (which I like!), Tesseract in now a dependency on the very popular ffmpeg formula and given that most people downloading tesseract are downloading it as a dependency and unlikely to ever use it at all I just don't think that making all of them download the full 162 languages/scripts is the right tradeoff. Being english only though also is perhaps a little extreme... If we included the top 10 languages that'd make the size ~60-70MB which could be a better balance but still seems like a half measure. For context, these are all the formulae that now depend on tesseract in some way.
ffmpeg, vapoursynth, mpv, and opencv seems among the more popular ones. (possibly bad idea: we could add a caveat message to the formula that provides a few commands on how to manually install extra tessdata and set the appropriate |
Is there a way to make |
@zchrykng I don't think so. It could be great if |
@varenc Looking at 793ad82, this seems to have been the case before options became deprecated. The resources eng and osd (currently unused) are exactly these data and should be used as a minimal working subset of tessdata. Since the current folder structure exactly allows tessdata to be separated completely from the tesseract core, it is very much practicable to extrapose tessdata as a separate formula and make eng and osd into tessdata-min as a dependency of tesseract core, making it just a matter of linking. I'll probably look into this later to see if there have been formulae created with such dependency logic. |
@CL-Jeremy hmm I don't think tesseract has ever installed all language data by default until 012aea3 just recently. Both before and after 793ad82 the default
(To test I had to clone the old formulae but remove the cxx11 references) I think I see now how a |
@varenc Context: we (and many of our clients, i.e. film and broadcast archives) use often Tesseract in FFmpeg to extract cast&credits, subtitles and intertitles in various European languages. Here in Switzerland we do have four official languages: German, French, Italian and Romansh. |
My take is if tesseract is a required dep on the very popular ffmpeg library and that dep is just enabling a very unpopular feature than it shouldn't be so massive or the dep should be removed. Right now tesseract is adding 1GB of disk space and 330MB of bandwidth for every ffmpeg install. The 2nd largest ffmpeg dep is icu4c with a 25MB bottle. You can see from the old analytics that very few people were enabling tesseract to make use of ffmpeg's I suppose just bringing back options would be great but that ship has sailed. I bet an extra Also @retokromer if you're serious about OCR you should consider switching to tessdata_best anyway. The current formula includes the |
I agree with all of the statements above. I realise some may contradict others in which case I'll happily provide more input. |
I hereby leave this racist discussion. |
Well, if only we could make an experimental tap with "flavoured bottles", that'd be IMHO the most straightforward (and least practicable) solution... |
I'm 👍 on a If a third party formula is to be created, which relies on all tesseract data (which I think @retokromer would like to host for video archival purposes, perhaps in addition to @varenc's tap), then we would cover all use cases. |
No, @slhck, I don’t wish to host an alternate formula at all (cf. the discussion on ffmpeg-devel). |
I would be in favour of:
Quick browsing of the tesseract codebase shows this may be achieved by setting the |
I think @fxcoudert's approach is the best here. |
(this same approach could perhaps be taken with |
@fxcoudert I like the plan and happy to take a swing at it. It sounds like this PR is trending to being blocked on a good Though could you give me some more specifics on how to build that? It seems like tesseract will only check one tessdata set path. I know that setting a TESSDATA_PREFIX env would work, but I don't think there's a great way a formula can add an ENV for the user without telling them to do so in a caveat message, right? Maybe that's fine though? I certainly see it occasionally in other formulae. (or |
If the data is installed into |
hmm.... And the * esseract-training-tools_ formula that installs extra tools for tesseract. Now i am in trouble in it,I can't find out the training command such as unicharset_extractor, mftraining. If installed together, Can someone tell me where is it. |
I wrote a
I updated also the test block to perform an actual OCR task using the examples provided in the git repository, and everything seems to work fine in both formulas (using German as extra tested language). Should I make a new PR or do you prefer if I update this one, if @varenc agrees? |
@albertosottile Great work! Please create a new PR, thanks! |
Thanks @varenc for starting the discussion, and thanks @albertosottile for the pull request! |
brew install --build-from-source <formula>
, where<formula>
is the name of the formula you're submitting?brew test <formula>
, where<formula>
is the name of the formula you're submitting?brew audit --strict <formula>
(after doingbrew install <formula>
)?This reduces tesseract's installed size from 680MB to 30MB. A a change from two weeks ago made tesseract include data for 162 languages and scripts. See #36760 for context and more details.
It's possible that other solutions could be better (like including data for the top 10 most popular languages?). If so probably better to just close this and have a discussion on #36760.