Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract: switch to only including eng, osd, and snum tessdata #36786

Closed
wants to merge 1 commit into from

Conversation

varenc
Copy link
Contributor

@varenc varenc commented Feb 7, 2019

  • Have you followed the guidelines for contributing?
  • Have you checked that there aren't other open pull requests for the same formula update/change?
  • Have you built your formula locally with brew install --build-from-source <formula>, where <formula> is the name of the formula you're submitting?
  • Is your test running fine brew test <formula>, where <formula> is the name of the formula you're submitting?
  • Does your build pass brew audit --strict <formula> (after doing brew install <formula>)?

This reduces tesseract's installed size from 680MB to 30MB. A a change from two weeks ago made tesseract include data for 162 languages and scripts. See #36760 for context and more details.

It's possible that other solutions could be better (like including data for the top 10 most popular languages?). If so probably better to just close this and have a discussion on #36760.

@varenc
Copy link
Contributor Author

varenc commented Feb 7, 2019

Relatedly, the ffmpeg formula now includes tesseract as a required dependency. Since very few ffmpeg users will ever use its ocr filter feature, reducing tesseract's bloat could have a big impact on them.

@varenc varenc changed the title tesseract, switch to only including eng, osd, and snum tessdata tesseract: switch to only including eng, osd, and snum tessdata Feb 7, 2019
@varenc
Copy link
Contributor Author

varenc commented Feb 7, 2019

even more info: this is currently the 13th largest bottle on homebrew. Its ranking is probably higher when uncompressed on disk. I had a little fun building the full list. Here's the 30 largest.

1280.52 MB https://homebrew.bintray.com/bottles/jumanpp-1.02.mojave.bottle.tar.gz
868.10 MB https://homebrew.bintray.com/bottles/llvm-7.0.1.mojave.bottle.tar.gz
808.57 MB https://homebrew.bintray.com/bottles/llvm@6-6.0.1_1.mojave.bottle.tar.gz
722.01 MB https://homebrew.bintray.com/bottles/llvm@5-5.0.2_1.mojave.bottle.tar.gz
687.01 MB https://homebrew.bintray.com/bottles/geant4-10.5.0.mojave.bottle.tar.gz
630.78 MB https://homebrew.bintray.com/bottles/llvm@4-4.0.1_1.mojave.bottle.tar.gz
598.12 MB https://homebrew.bintray.com/bottles/freeling-4.1_2.mojave.bottle.tar.gz
514.17 MB https://homebrew.bintray.com/bottles/llvm@3.9-3.9.1_2.mojave.bottle.tar.gz
498.21 MB https://homebrew.bintray.com/bottles/swift-4.2.1.mojave.bottle.1.tar.gz
472.28 MB https://homebrew.bintray.com/bottles/mitie-0.6_1.mojave.bottle.tar.gz
397.91 MB https://homebrew.bintray.com/bottles/freeswitch-1.6.20.mojave.bottle.1.tar.gz
374.11 MB https://homebrew.bintray.com/bottles/wesnoth-1.12.6_8.mojave.bottle.tar.gz
339.90 MB https://homebrew.bintray.com/bottles/tesseract-4.0.0.mojave.bottle.2.tar.gz
271.15 MB https://homebrew.bintray.com/bottles/cling-0.5_2.mojave.bottle.tar.gz
236.13 MB https://homebrew.bintray.com/bottles/ghc-8.4.4.mojave.bottle.1.tar.gz
228.93 MB https://homebrew.bintray.com/bottles/emscripten-1.38.25.mojave.bottle.tar.gz
222.09 MB https://homebrew.bintray.com/bottles/rust-1.32.0.mojave.bottle.1.tar.gz
202.33 MB https://homebrew.bintray.com/bottles/pioneer-20180203.mojave.bottle.tar.gz
202.10 MB https://homebrew.bintray.com/bottles/ghc@8.2-8.2.2.mojave.bottle.1.tar.gz
197.06 MB https://homebrew.bintray.com/bottles/voldemort-1.10.26.mojave.bottle.tar.gz
183.86 MB https://homebrew.bintray.com/bottles/widelands-build19_13.mojave.bottle.tar.gz
174.37 MB https://homebrew.bintray.com/bottles/mingw-w64-5.0.4_1.mojave.bottle.tar.gz
173.75 MB https://homebrew.bintray.com/bottles/blast-2.8.1.mojave.bottle.tar.gz
166.13 MB https://homebrew.bintray.com/bottles/mimic-1.2.0.2_5.mojave.bottle.tar.gz
145.14 MB https://homebrew.bintray.com/bottles/couchdb-lucene-2.1.0.mojave.bottle.tar.gz
141.29 MB https://homebrew.bintray.com/bottles/go-1.11.5.mojave.bottle.tar.gz
139.45 MB https://homebrew.bintray.com/bottles/mono-5.18.0.225.mojave.bottle.1.tar.gz
134.23 MB https://homebrew.bintray.com/bottles/mercury-14.01.1_1.mojave.bottle.tar.gz
130.54 MB https://homebrew.bintray.com/bottles/cp2k-6.1.mojave.bottle.tar.gz
125.00 MB https://homebrew.bintray.com/bottles/arangodb-3.4.0.mojave.bottle.tar.gz

@zchrykng
Copy link

zchrykng commented Feb 7, 2019

In general, I don’t think it is a good idea to restrict a package like this down to just one or even a handful of languages. Since there isn’t a way to add more languages without making your own tap.

Of course this is a problem that could be solved with options, and was, but since that isn’t an option now, here we are.

@retokromer
Copy link
Contributor

Why are all not English speaking people excluded from using Tesseract?

@varenc
Copy link
Contributor Author

varenc commented Feb 7, 2019

Homebrew's decision to purge all options from core formulae has definitely caused some issues and challenges. I suspect/hope that 3rd party taps will rise to cover the gaps but that's still taking shape.

From @retokromer's recent change (which I like!), Tesseract in now a dependency on the very popular ffmpeg formula and given that most people downloading tesseract are downloading it as a dependency and unlikely to ever use it at all I just don't think that making all of them download the full 162 languages/scripts is the right tradeoff. Being english only though also is perhaps a little extreme... If we included the top 10 languages that'd make the size ~60-70MB which could be a better balance but still seems like a half measure.

For context, these are all the formulae that now depend on tesseract in some way.

➜  brew uses --recursive tesseract
audacious                    minidlna                     pc6001vx
mkvdts2ac3                   pdfsandwich
caffe                        mkvtomp4                     pianobar
corsixth                     mlt                          ppsspp
echoprint-codegen            moc                          qcli
ffmpeg                       mpd                          qmmp
ffmpeg2theora                mps-youtube                  scrcpy
ffmpegthumbnailer            mpv                          siril
ffms2                        mvtools                      synfig
get_iplayer                  ocrmypdf                     unpaper
gifcap                       opencv                       vapoursynth
gifify                       opencv@2                     vcs
mat2                         opencv@3                     vice
mgba                         openimageio                  visp

ffmpeg, vapoursynth, mpv, and opencv seems among the more popular ones.

(possibly bad idea: we could add a caveat message to the formula that provides a few commands on how to manually install extra tessdata and set the appropriate TESSDATA_PREFIX...but that just seems like a manual replacement for the old option.)

@zchrykng
Copy link

zchrykng commented Feb 7, 2019

Is there a way to make tesseract-en style packages? Similar to how Debian packages aspell's languages, aspell-en as an example. Not sure how the internals of brew works and if that would be possible.

@varenc
Copy link
Contributor Author

varenc commented Feb 8, 2019

@zchrykng I don't think so. It could be great if tesseract would just come with basic eng/osd data and then there was a tesseract-lang-all formula that would depend on tesseract and just add more language support but it's not clear to me there's a good way to do that without depending on the user reading and following some caveats message.

@CL-Jeremy
Copy link
Contributor

@varenc Looking at 793ad82, this seems to have been the case before options became deprecated. The resources eng and osd (currently unused) are exactly these data and should be used as a minimal working subset of tessdata. Since the current folder structure exactly allows tessdata to be separated completely from the tesseract core, it is very much practicable to extrapose tessdata as a separate formula and make eng and osd into tessdata-min as a dependency of tesseract core, making it just a matter of linking. I'll probably look into this later to see if there have been formulae created with such dependency logic.

@varenc
Copy link
Contributor Author

varenc commented Feb 8, 2019

@CL-Jeremy hmm I don't think tesseract has ever installed all language data by default until 012aea3 just recently. Both before and after 793ad82 the default tesseract installation only includes eng and osd language data and was less than 40MB.

➜ brew install https://gist.github.com/varenc/72bad151d4eb1cd3c59906a16e027f62/raw/ae87114e37acae571dad7b4ed72efc6700df7c2b/tesseract.rb  #tesseract.rb ca4932cb8
➜  du -d 1 -h $(brew --cellar tesseract)
 39M	/usr/local/Cellar/tesseract/3.05.02
 39M	/usr/local/Cellar/tesseract

➜  brew install https://gist.githubusercontent.com/varenc/6a532958272de35424c396859f9b9c93/raw/9131921175db88866265b115f025bc4464ca303d/tesseract.rb   #tesseract.rb 793ad82
➜  du -d 1 -h $(brew --cellar tesseract)
 22M	/usr/local/Cellar/tesseract/4.0.0
 22M	/usr/local/Cellar/tesseract

(To test I had to clone the old formulae but remove the cxx11 references)

I think I see now how a tesseract-lang-all formula could add more tessdata. Though if it creates a new /usr/local/share/tessdata symlink pointing to the data in the new tesseract-lang-all cellar, won't that conflict with the default symlink installed by tesseract? Hopefully still possible, but I don't think this current PR should be blocked by having a working tesseract-lang-all. Would love some feedback from @MikeMcQuaid on this =)

@retokromer
Copy link
Contributor

@varenc Context: we (and many of our clients, i.e. film and broadcast archives) use often Tesseract in FFmpeg to extract cast&credits, subtitles and intertitles in various European languages. Here in Switzerland we do have four official languages: German, French, Italian and Romansh.

@varenc
Copy link
Contributor Author

varenc commented Feb 8, 2019

My take is if tesseract is a required dep on the very popular ffmpeg library and that dep is just enabling a very unpopular feature than it shouldn't be so massive or the dep should be removed. Right now tesseract is adding 1GB of disk space and 330MB of bandwidth for every ffmpeg install. The 2nd largest ffmpeg dep is icu4c with a 25MB bottle. You can see from the old analytics that very few people were enabling tesseract to make use of ffmpeg's ocr filter. MikeQuaid weighed in briefly on your PR that added tesseract to say that if tesseract's overall footprint is 1GB it's not worth it making it a dependency and I'm inclined to agree...but if tesseract was small and the recent bloat was just unintentional during the "great options purging" then it seems fine to leave tesseract in ffmpeg. It's much easier to add more tesseract language data than it is to recompile ffmpeg with ocr/tesseract support.

I suppose just bringing back options would be great but that ship has sailed. I bet an extra tesseract-lang-all formula would work well for you? Hopefully, that's possible. I might add a tesseract formula that installs all languages to my 3rd party ffmpeg tap and have it use that in the meantime.

Also @retokromer if you're serious about OCR you should consider switching to tessdata_best anyway. The current formula includes the tessdata_fast models which offer a speed/accuracy tradeoff that result in a smaller disk size. If you don't mind disk space and want the best accuracy testdata_best is the way to go.

@MikeMcQuaid
Copy link
Member

It could be great if tesseract would just come with basic eng/osd data and then there was a tesseract-lang-all formula that would depend on tesseract and just add more language support but it's not clear to me there's a good way to do that without depending on the user reading and following some caveats message.

It's possible that other solutions could be better (like including data for the top 10 most popular languages?)

My take is if tesseract is a required dep on the very popular ffmpeg library and that dep is just enabling a very unpopular feature than it shouldn't be so massive or the dep should be removed

I agree with all of the statements above. I realise some may contradict others in which case I'll happily provide more input.

@retokromer
Copy link
Contributor

if you're serious about OCR you should

I hereby leave this racist discussion.

@CL-Jeremy
Copy link
Contributor

Well, if only we could make an experimental tap with "flavoured bottles", that'd be IMHO the most straightforward (and least practicable) solution...
Another kind of solutions involves having something like tesseract-base (or even libtesseract with the binary stripped) for backdependents and tesseract-data as well as tesseract-data-min formulae to complement the missing resources, leaving tesseract a plain binary formula with everything else stripped away but requiring tesseract-data (same goes for tesseract-data-min should the need for a tesseract-min exist). Caveats may need to be added for the backdependents in question regarding this fact. For me this is also ugly and very much against the Homebrew tradition.

@slhck
Copy link
Contributor

slhck commented Feb 9, 2019

I'm 👍 on a tesseract-lang-all formula next to a more restricted baseline installation that doesn't take up so much space. We could (should) then link to the default tesseract formula in any ffmpeg formulae.

If a third party formula is to be created, which relies on all tesseract data (which I think @retokromer would like to host for video archival purposes, perhaps in addition to @varenc's tap), then we would cover all use cases.

@retokromer
Copy link
Contributor

No, @slhck, I don’t wish to host an alternate formula at all (cf. the discussion on ffmpeg-devel).

@fxcoudert
Copy link
Member

I would be in favour of:

  • a tesseract formula that installs the code and restricted dataset
  • a tesseract-lang formula that installs extra data for many languages, in a place that tesseract knows to check

Quick browsing of the tesseract codebase shows this may be achieved by setting the TESSDATA option at compile time, or using DATA_PATH at runtime.

@MikeMcQuaid
Copy link
Member

I think @fxcoudert's approach is the best here.

@MikeMcQuaid
Copy link
Member

(this same approach could perhaps be taken with aspell too)

@varenc
Copy link
Contributor Author

varenc commented Feb 9, 2019

@fxcoudert I like the plan and happy to take a swing at it. It sounds like this PR is trending to being blocked on a good tesseract-lang formula existing.

Though could you give me some more specifics on how to build that? It seems like tesseract will only check one tessdata set path. I know that setting a TESSDATA_PREFIX env would work, but I don't think there's a great way a formula can add an ENV for the user without telling them to do so in a caveat message, right? Maybe that's fine though? I certainly see it occasionally in other formulae.

(or tesseract-lang could just re-installed tesseract completely with all the data, but it seems bad to duplicate tesseract like that right?)

@MikeMcQuaid
Copy link
Member

Though could you give me some more specifics on how to build that? It seems like tesseract will only check one tessdata set path. I know that setting a TESSDATA_PREFIX env would work, but I don't think there's a great way a formula can add an ENV for the user without telling them to do so in a caveat message, right? Maybe that's fine though? I certainly see it occasionally in other formulae.

If the data is installed into share it could be configured to be linked into /usr/local and therefore that be used as the set path. tesseract-lang could similarly look in the /usr/local/share rather than the keg share. Make sense?

@sswink
Copy link

sswink commented Feb 13, 2019

hmm.... And the * esseract-training-tools_ formula that installs extra tools for tesseract. Now i am in trouble in it,I can't find out the training command such as unicharset_extractor, mftraining. If installed together, Can someone tell me where is it.

@albertosottile
Copy link
Contributor

albertosottile commented Feb 14, 2019

I wrote a tesseract-lang formula and updated the original tesseractformula according to the above suggestions from @MikeMcQuaid. As expected, this leads to

/usr/local/Cellar/tesseract/4.0.0 (63 files, 30.0MB) *
/usr/local/Cellar/tesseract-lang/4.0.0 (163 files, 651.8MB) *

I updated also the test block to perform an actual OCR task using the examples provided in the git repository, and everything seems to work fine in both formulas (using German as extra tested language). Should I make a new PR or do you prefer if I update this one, if @varenc agrees?

@MikeMcQuaid
Copy link
Member

Should I make a new PR or do you prefer if I update this one, if @varenc agrees?

@albertosottile Great work! Please create a new PR, thanks!

@fxcoudert
Copy link
Member

Thanks @varenc for starting the discussion, and thanks @albertosottile for the pull request!

@fxcoudert fxcoudert closed this Feb 16, 2019
@lock lock bot added the outdated PR was locked due to age label Mar 18, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Mar 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated PR was locked due to age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants