Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GT should consistently have file extensions #35

Closed
tboenig opened this issue Jul 11, 2019 · 11 comments
Closed

GT should consistently have file extensions #35

tboenig opened this issue Jul 11, 2019 · 11 comments
Assignees
Labels
groundtruth Groundtruth quality issues
Projects

Comments

@tboenig
Copy link
Contributor

tboenig commented Jul 11, 2019

Files referenced in the METS and stored in the data directory should have file extensions to make life easier for PAGE Viewer etc.

@tboenig tboenig self-assigned this Jul 11, 2019
@bertsky
Copy link
Contributor

bertsky commented Jul 11, 2019

I totally agree. We have started to make this a pattern in the processors (preprocessing, OCR), but IMO core should lead by good example in Workspace.add_file – at least if no local_filename was already specified by the caller – and in the bagger (see OCR-D/core#258).

@kba
Copy link
Member

kba commented Jul 11, 2019

Implementing OCR-D/core#258 will most likely fix this since the extensions are present until bagging...

@bertsky
Copy link
Contributor

bertsky commented Jul 11, 2019

Yes, but processors that do not care about this and do not use local_filename when doing their Workspace.add_file currently also effectively suppress extensions. This case is still in core's responsibility.

@kba
Copy link
Member

kba commented Jul 11, 2019

You're right, I just meant that for our provided GT the problem is the bagger. Workspace.add_file must be fixed too ofc.

@bertsky
Copy link
Contributor

bertsky commented Jul 11, 2019

Oh, now I got it. We just keep agreeing you know!

@kba kba added this to Backlog in coordinate Jul 23, 2019
@kba kba self-assigned this Aug 21, 2019
@cneud cneud moved this from Backlog to In progress in coordinate Oct 16, 2019
@wrznr
Copy link
Collaborator

wrznr commented Oct 23, 2019

@kba Fixed?

@kba
Copy link
Member

kba commented Oct 23, 2019

It's fixed for the bagger but I'm still evaluating whether

when doing their Workspace.add_file currently also effectively suppress extensions. This case is still in core's responsibility.

is still an issue in core.

@kba
Copy link
Member

kba commented Oct 23, 2019

Yes, but processors that do not care about this and do not use local_filename when doing their Workspace.add_file currently also effectively suppress extensions.

Not sure whether I can follow. Can you give me an example when this might happen @bertsky ?

@bertsky
Copy link
Contributor

bertsky commented Oct 23, 2019

Yes, but processors that do not care about this and do not use local_filename when doing their Workspace.add_file currently also effectively suppress extensions.

Not sure whether I can follow. Can you give me an example when this might happen @bertsky ?

I can't see it myself right now. What I do understand is that Workspace.add_file does not in itself require either the local_filename or url kwarg, especially if it does not pass a content with it. So OcrdMets.add_file will instantiate a new OcrdFile and then set local_filename=None.

So it all depends on what then happens with that file reference later-on in the processor. If content was passed to Workspace.add_file, then an exception will come up. (I have already complained about this as a documentation issue.) Otherwise, the processor might use OcrdMets.find_files to get a reference and then do things to it. Somewhere along that path local_filename will/must be set. There we have to look whether it is in core's responsibility to ensure filename extensions.

Sorry, that's all I can offer ATM.

@cneud cneud added the groundtruth Groundtruth quality issues label Nov 5, 2019
@cneud
Copy link
Member

cneud commented Feb 4, 2021

It appears all file extensions are available now in assets/data, a related issue OCR-D/core#332 in core was closed - can this be closed too?

@kba
Copy link
Member

kba commented Feb 4, 2021

Yes, fixed in assets and we're doing file extensions in the processors now as well.

@kba kba closed this as completed Feb 4, 2021
coordinate automation moved this from In progress to Done Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
groundtruth Groundtruth quality issues
Projects
No open projects
coordinate
  
Done
Development

No branches or pull requests

5 participants