Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search path for model files #7

Open
mikegerber opened this issue Oct 28, 2019 · 11 comments
Open

Search path for model files #7

mikegerber opened this issue Oct 28, 2019 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@mikegerber
Copy link
Collaborator

@bertsky in #6:

Moreover, I think it would be useful to not rely on the CWD for relative paths, because this is not reliable across the many layers (e.g. a script which calls ocrd-calamari-recognize which calls calamari.ocr.MultiPredictor). Instead, like TESSDATA for Tesseract one could define an installation prefix via setuptools (overridable via environment variable), or simply use os.path.dirname(os.path.abspath(file)) as reference, i.e. the directory where ocrd_calamari is installed. Absolute pathnames should stay untouched, however.

@mikegerber
Copy link
Collaborator Author

If I understand the reporter correctly, he wants to install ocrd_calamari somewhere and relative pathnames in the checkpoint parameter should be relative to the called ocrd_calamari installation, not the cwd of the caller.

I am not sure if I find this useful or intuitive.

@bertsky
Copy link
Contributor

bertsky commented Oct 28, 2019

If I understand the reporter correctly, he wants to install ocrd_calamari somewhere and relative pathnames in the checkpoint parameter should be relative to the called ocrd_calamari installation, not the cwd of the caller.

Exactly. Either that or make that behaviour configurable at installation time or via environment variable.

I am not sure if I find this useful or intuitive.

How else would you encapsulate a call to ocrd-calamari-recognize (i.e. a parameter file) in a portable way? You need to become independent of both absolute filenames (because they are different on every host) and the CWD (because you do not want know where the workflow engine happened to start the CLI). Think shared workflow-configuration. Cf. Tesseract (using TESSDATA) and ocrd_cis for Ocropy (using the installation path).

@mikegerber
Copy link
Collaborator Author

So your use case is specifying a parameter file like this:

{
    "checkpoint": "GT4HistOCR/*.ckpt.json"
}

And because of convention, ocrd_calamari would find these model files in e.g. share/calamari/models relative to the installation prefix/virtualenv.

I'll check if Calamari has a mechanism like this - a search path - first, but I think this could be implemented in that way.

@bertsky
Copy link
Contributor

bertsky commented Oct 28, 2019

Exactly. I don't think Calamari itself has it. But ocrd_calamari could independently provide that behaviour. In the very least, non-absolute path expressions could be prefixed by os.path.dirname(os.path.abspath(__file__))/

@mikegerber mikegerber changed the title Do not rely on the CWD Search path for model files Oct 28, 2019
@mikegerber mikegerber added the enhancement New feature or request label Oct 28, 2019
@bertsky
Copy link
Contributor

bertsky commented Nov 10, 2019

BTW, I think a good place to look at for this is the Ocropus path search – modulo .gz vs JSON globbing of course.

For a suitable name of the new environment variable, I recommend CALAMARI_DATA or CALAMARI_DATAPREFIX. (Ocropy uses OCROPUS_DATA and Tesseract uses TESSDATA_PREFIX.)

@kba
Copy link
Member

kba commented Nov 11, 2019

I like the idea and I'd prefer CALAMARI_DATA, always found TESSDATA_PREFIX mechanism unintuitive. Environment variable is easy to inject into both shell and docker execution, so that's a plus.

@mikegerber mikegerber self-assigned this Nov 18, 2019
@kba
Copy link
Member

kba commented Nov 17, 2020

As a coda to this issue: Can we get rid of the wildcard in the checkpoint parameter? Is there ever a reason not to use all the *.cpkt.json files in a model folder?

@mikegerber
Copy link
Collaborator Author

As a coda to this issue: Can we get rid of the wildcard in the checkpoint parameter? Is there ever a reason not to use all the *.cpkt.json files in a model folder?

Probably not, but the main reason I chose to do the checkpoint parameter this way is because upstream Calamari's calamari-predict also requires giving all files:

  --checkpoint CHECKPOINT [CHECKPOINT ...]

@kba
Copy link
Member

kba commented Dec 3, 2020

Maybe we could add a checkpoint_dir parameter?

@mikegerber
Copy link
Collaborator Author

I can imagine having another parameter model (or checkpoint_dir like @kba suggested) to do something like this (pseudo code):

if not checkpoint:
    checkpoint = model + "/*.cpkt.json"
for path in search_path.split(":"):
   if something_exists (path + checkpoint)
     ... this is the full path ...

I forgot about this issue. I'll have a look at how OCRopus does it first, but I think I'd want something like in the pseudo-code

@paulpestov paulpestov added this to Ideas in coordinate_all Aug 30, 2021
@kba kba removed this from Ideas in coordinate_all Sep 6, 2021
@bertsky
Copy link
Contributor

bertsky commented Mar 18, 2023

Simply renaming checkpoint_dir to model (or aliasing and deprecating) would also be preferable IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants