Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to delete a group #245

Closed
mikegerber opened this issue Jun 19, 2019 · 24 comments
Closed

Ability to delete a group #245

mikegerber opened this issue Jun 19, 2019 · 24 comments
Assignees
Labels
Projects

Comments

@mikegerber
Copy link
Contributor

ocrd workspace delete-group could be a command to delete a group/fileGrp in the workspace/METS file.

My use case for this:

Currently, I use xmlstarlet to remove a group of files so that I can repeat a script run, i.e. to enable repeated (and almost idempotent) runs of the following:

remove_filegrp() {
  filegrp_use=$1
  mets=$2

  xmlstarlet ed --inplace \
    -N mets=http://www.loc.gov/METS/ \
    -d "//mets:fileGrp[@USE='$filegrp_use']" $mets
}

remove_filegrp OCR-D-SEG-LINE mets.xml
ocrd-ocropy-segment -l DEBUG -m mets.xml -I OCR-D-IMG -O OCR-D-SEG-LINE
@bertsky
Copy link
Collaborator

bertsky commented Jun 19, 2019

I see the need, too.

It is also already possible to use the CLI's workspace backup facility for this, but you have to be careful to create a backup beforehand, and undo does not always pick the right version, and restore sometimes fails because the checksum is not unique (and then requires a full backup filename).

But I generally would prefer to have an option for overwriting existing groups and files instead.

@bertsky
Copy link
Collaborator

bertsky commented Jun 19, 2019

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references. Here is an additional XPath expression to have xmlstarlet delete (before the actual removal):

//mets:structMap//mets:fptr[@FILEID=//mets:fileGrp[@USE='$filegrp_use']/mets:file/@ID]

@kba
Copy link
Member

kba commented Jun 20, 2019

I see the need, too.

OK then, ocrd workspace rm-group shouldn't be too hard to implement. Should this behave like rmdir and only delete empty fileGroups or like rm -r and also delete non-empty fileGroups?

Probably should also have the equivalent for removing files then, i.e. ocrd workspace rm. What do you think?

But I generally would prefer to have an option for overwriting existing groups and files instead.

Like an option --overwrite/--force for ocrd workspace add?

@kba
Copy link
Member

kba commented Jun 20, 2019

Like an option --overwrite/--force for ocrd workspace add?

We already have --force ("If file with ID already exists, replace it"). What behavior changes do you want here?

@bertsky
Copy link
Collaborator

bertsky commented Jun 20, 2019

Like an option --overwrite/--force for ocrd workspace add?

We already have --force ("If file with ID already exists, replace it"). What behavior changes do you want here?

No, I meant an option for ocrd process and all individual module processor CLIs. (The --force option is not exposed to those CLIs.)

Thus we can have repeated runs without even looking at which files and groups are affected.

OK then, ocrd workspace rm-group shouldn't be too hard to implement. Should this behave like rmdir and only delete empty fileGroups or like rm -r and also delete non-empty fileGroups?

Probably should also have the equivalent for removing files then, i.e. ocrd workspace rm. What do you think?

IMHO, this depends on the above: if we get a simple way to repeat processing steps, then these removal operations would become but an extra, for special circumstances. So it would be okay to be cautious and have the user first delete all files (and file references!) – like rm – and then empty groups – like rmdir.

But if (for some reason) we don't get some ocrd process -f, then removal should be easy to use – like rm -fr.

@kba kba added this to Backlog in coordinate Jun 20, 2019
@mikegerber
Copy link
Contributor Author

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references.

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

@mikegerber
Copy link
Contributor Author

But if (for some reason) we don't get some ocrd process -f, then removal should be easy to use – like rm -fr.

I agree.

@bertsky
Copy link
Collaborator

bertsky commented Jun 20, 2019

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references.

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

But it should could some day. Since GROUPID was abandoned in favour of physical page ID, every processor should at least pass on the page id of the input. (And those processors which already do will create such additional references.)

See here

@kba
Copy link
Member

kba commented Jun 20, 2019

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

It should do that when using the OcrdMets/OcrdFile APIs and specifying a pageId for workspace.add_file. But it's optional for the ocrd workspace add command and not implemented by the MPs you've listed. When iterating over the input files, the pageId of those input files should be passed on to the output files...

Would it make sense to require the pageId to be set and fail if it is missing for all add operations?

@kba
Copy link
Member

kba commented Jun 20, 2019

OLD:

self.workspace.add_file(
    ID=ID,
    file_grp=self.output_file_grp,
    mimetype=MIMETYPE_PAGE,
    local_filename="%s/%s" % (self.output_file_grp, ID),
    content=to_xml(pcgts)
)

NEW

self.workspace.add_file(
    ID=ID,
    file_grp=self.output_file_grp,
    mimetype=MIMETYPE_PAGE,
    local_filename="%s/%s" % (self.output_file_grp, ID),
    content=to_xml(pcgts),
	pageId=input_file.pageId
)

@kba
Copy link
Member

kba commented Jun 20, 2019

Sorry, that was redundant. What @bertsky said

@bertsky
Copy link
Collaborator

bertsky commented Jun 20, 2019

Would it make sense to require the pageId to be set and fail if it is missing for all add operations?

I think it should be required for Processor when it wants to add page-level result files – this is my understanding of the spec. But there are other cases, where it must not be required: adding files for AlternativeImage for example.

kba added a commit to kba/ocrd-core that referenced this issue Jul 9, 2019
kba added a commit to kba/ocrd-core that referenced this issue Jul 11, 2019
@kba
Copy link
Member

kba commented Aug 13, 2019

There are now (as of https://github.com/OCR-D/core/releases/tag/v1.0.0b13) two commands for this:

  • ocrd workspace remove [--force] ID to delete file with ID ID. If --force: delete from local storage if file references local file
  • ocrd workspace remove-group [--recursive] [--force] to delete file group GROUP. If --recursive is given, also delete containing mets:files. If --force is given, also delete local files.

Programmatic access via remove(ID) and remove_group(GROUP) in Workspace and OcrdMets.

Comment if you need further file removal capabilities in ocrd process.

@kba kba closed this as completed Aug 13, 2019
coordinate automation moved this from Backlog to Done Aug 13, 2019
@bertsky
Copy link
Collaborator

bertsky commented Aug 13, 2019

Sorry to bring this up so late, but could we make both the ID argument for remove and the GROUP argument for remove-group multi-valent (i.e. Click's nargs=-1)?

Also, would it be much extra effort to also provide ocrd process --overwrite (and/or the same option for ocrd.decorators.ocrd_cli_options for all individual CLIs)?

@kba kba reopened this Aug 14, 2019
coordinate automation moved this from Done to In progress Aug 14, 2019
kba added a commit to kba/ocrd-core that referenced this issue Aug 15, 2019
kba added a commit that referenced this issue Aug 15, 2019
make ID/GROUP arguments to remove/remove-group variadic, #245
@bertsky
Copy link
Collaborator

bertsky commented Aug 30, 2019

IMO the current situation with removal is still unsatisfactory. We have --force for remove and remove-group, but this does not behave like rm -f at all (despite our discussion), instead it allows for removal in local filesystem (which should really be unconditional). It complains if the group/file did not exist yet – this is what I would expect to suppress via --force. And without --overwrite / -o in the processor, we always have to write extra steps for removal in our workflow configurations.

I am currently exploring the possibility of makefileization for workflow configuration – squashing configuration syntax (including incremental parameters, what can be done in parallel, and what should be evaluated in the end) and engine logic (calculation of dependencies for rebuilding partially and parallel execution). For that alone the aforementioned changes would be highly valuable.

@bertsky
Copy link
Collaborator

bertsky commented Sep 24, 2019

To elaborate on --force: Currently I have 2 options:

  1. not use -f and leave lots of stale files and directories, which I cannot trivally remove without the risk of removing files and directories along which are still referenced in the METS.
  2. use -f but risk stumbling across cases were files have accidentally been removed already, in which case the METS does not get updated at all, but the remove-group action aborts:
    workspace.remove_file_group(g, recursive, force)
  File "core/ocrd/ocrd/workspace.py", line 148, in remove_file_group
    self.remove_file(f.ID, force=force)
  File "core/ocrd/ocrd/workspace.py", line 130, in remove_file
    unlink(ocrd_file.local_filename)
FileNotFoundError

Neither of those can be scripted easily.

@bertsky
Copy link
Collaborator

bertsky commented Sep 24, 2019

And rmdir is missing in the current implementation as well (it should belong to remove-group).

@bertsky
Copy link
Collaborator

bertsky commented Oct 16, 2019

And a short option variant -f is missing for remove.

@mikegerber
Copy link
Contributor Author

mikegerber commented Nov 6, 2019

I'd like to note that the missing rmdir and the FileNotFoundError is something that I deal with every day when handling workspaces. (Of course I deal with a lot of slightly inconsistent workspaces due to developing, fixing bugs etc., nevertheless it would make life more pleasant if the rmdir was there and the FileNotFoundError was caught.)

@bertsky
Copy link
Collaborator

bertsky commented Nov 6, 2019

Same here. This not only complicates development. I also see it coming with the pilot users.

@kba kba added the bug label Nov 6, 2019
@kba
Copy link
Member

kba commented Nov 6, 2019

I hear you, I'm on it. This, the TEMP-directory-on-validation and input-file-copying, and improving ocrd process are next on my agenda for core.

@bertsky bertsky added this to the Final workshop milestone Jan 10, 2020
@stweil
Copy link
Contributor

stweil commented Jan 13, 2020

Is it already possible to add all files of a directory as a new group, or would that require an add-group command?

@kba
Copy link
Member

kba commented Jan 13, 2020 via email

kba added a commit to kba/ocrd-core that referenced this issue Jan 16, 2020
kba added a commit to kba/ocrd-core that referenced this issue Jan 21, 2020
kba added a commit that referenced this issue Jan 21, 2020
improve workspace.remove_file/remove_group, #245
@kba
Copy link
Member

kba commented Jan 21, 2020

Removal Implemented in #409

@kba kba closed this as completed Jan 22, 2020
coordinate automation moved this from In progress to Done Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
coordinate
  
Done
Development

No branches or pull requests

4 participants