Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too restrictive file mode for mets.xml #403

Closed
stweil opened this issue Jan 14, 2020 · 18 comments
Closed

Too restrictive file mode for mets.xml #403

stweil opened this issue Jan 14, 2020 · 18 comments

Comments

@stweil
Copy link
Contributor

stweil commented Jan 14, 2020

When mets.xml is created or modified by ocrd, the new file mode only allows access by the owner which is typically not sufficient for practical use:

-rw------- 1 stweil stweil 822382 Jan 13 22:43 mets.xml

So it is always necessary to add a chmod command after processing with OCR-D.

The code should use the default file mode which respects umask.

@stweil stweil changed the title Wrong file mode for mets.xml Too restrictive file mode for mets.xml Jan 14, 2020
@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2020

The problem is caused by atomic_write.

@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2020

https://www.oreilly.com/library/view/python-cookbook/0596001673/ch04s25.html shows a portable alternative solution.

Do we need atomic writes here at all? Just using open would fix the problem, too, and we could remove a dependency. This would require documentation that certain commands (for example ocrd workspace add) must not run in parallel for the same METS file.

I doubt that atomic_write does what is really needed, namely allowing any number of processes modifying and reading the METS file without loosing written data or reading an inconsistent file. So maybe the problem is not the restrictive file mode, but incomplete file locking:

  • When reading a METS file, a lock is needed which only allows read access for any process.
  • When modifying a METS file, it must be opened with a lock for exclusive access before reading and writing.
  • If a lock fails, the program must wait until it succeeds to get the required lock or fail after a timeout.

@kba
Copy link
Member

kba commented Jan 14, 2020

I doubt that atomic_write does what is really needed

You can check the code here, it's really short https://github.com/untitaker/python-atomicwrites/blob/master/atomicwrites/__init__.py

I don't want to move away from atomicwrites unless there is a real reason not to use it, it seems to do what it says on the package.

The code should use the default file mode which respects umask.

There's some entry points in atomicwrites to influence this, I'll have a look. See also untitaker/python-atomicwrites#42

@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2020

I don't want to move away from atomicwrites unless there is a real reason not to use it, it seems to do what it says on the package.

What is required? Example: Should it be possible to run several ocrd workspace add commands which add new entries to the same METS file in parallel?

Without trying it I would say it cannot work: Task 1 reads the data, task 2 reads / modifies / writes the data, task 1 modifies / writes the data and overwrites the data which was written by task 2. I did not see locks to prevent that scenario, and atomic_write would not help here.

@stweil
Copy link
Contributor Author

stweil commented Jan 14, 2020

You can check the code here.

I checked it before writing that I have doubts.

@bertsky
Copy link
Collaborator

bertsky commented Jan 15, 2020

What is required? Example: Should it be possible to run several ocrd workspace add commands which add new entries to the same METS file in parallel?

One of the reasons for this was safe SIGINT (ctrl+c) IIRC.

@stweil
Copy link
Contributor Author

stweil commented Jan 15, 2020

SIGINT handling could be done with signal.signal. I think it would be sufficient to ignore SIGINT while writing the METS file. Or do we expect that this file can be so large that it should be possible to abort the writing?

Maybe it would even make sense to block SIGINT while processing a page and handle it only after a page was processed.

@kba
Copy link
Member

kba commented Jan 15, 2020

I think it would be sufficient to ignore SIGINT while writing the METS file

That was my initial approach but I was happy to delegate this to a third-party library that handles the nitty-gritty details.

@stweil
Copy link
Contributor Author

stweil commented Jan 15, 2020

I see. So the have several requirements to address:

  • Safe handling of SIGINT
  • Support for parallel r/w access
  • Correct file mode

@bertsky
Copy link
Collaborator

bertsky commented Jan 15, 2020

Could also be SIGABRT, SIGTERM, SIGSEGV, SIGIO, SIGPIPE etc.

Maybe it would even make sense to block SIGINT while processing a page and handle it only after a page was processed.

No, ctrl+c should have immediate effect! Think of a badly written processor...

I don't think parallel access is really the issue here – as you said yourself, with the current API, which allows reading METS separate/independent from the modifications to be made, this is impossible, and we could make use of that kind of parallelization only in non-linear workflows anyway.

@stweil
Copy link
Contributor Author

stweil commented Jan 15, 2020

Could also be SIGABRT, SIGTERM, SIGSEGV, SIGIO, SIGPIPE etc.

Of course only few of those can occur during a file write.

I don't think parallel access is really the issue here.

If we agree that parallel access is not an issue but simply unsupported for the moment, that simplifies things a lot and we only have to document it clearly.

But some day people will ask whether on ocrd process can be made faster. This is less important as long as I have to process many books, so its possible to parallelize on the book level. It gets important if I have a single book and are waiting for the result. Then the single processes could run in parallel with a slight offset, so a page which was processed by the first process could be passed to the second process while the first process starts the next page and so on.

@bertsky
Copy link
Collaborator

bertsky commented Jan 15, 2020

Yes, page pipelining is another desideratum for a better API.

But for certain kinds of GPU-dependent processors, document/book-parallel processing (with a data generator that queues entire books into batches) might be the only efficient option.

Right now all that processors can do is parallelize across the line or page (which is not possible for stateful/contiguous processors).

I would also like to see this recurring theme documented and accessible (perhaps as a permament issue in spec)...

@stweil
Copy link
Contributor Author

stweil commented Sep 6, 2020

This is still an issue:

21:47:30.044 INFO ocrd.resolver - Writing METS to /tmp/xyz/mets.xml
open("/tmp/xyz/mets.xml", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2335, ...}) = 0
21:47:30.046 INFO ocrd.workspace - Saving mets '/tmp/xyz/mets.xml'
open("/tmp/xyz/tmpzz0d09k3", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 3
open("/tmp/xyz/tmpzz0d09k3", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
rename("/tmp/xyz/tmpzz0d09k3", "/tmp/xyz/mets.xml") = 0

The strace protocol shows that first mets.xml is created with mask 0666, but later another file which was created with mask 0600 is renamed to mets.xml. So the final file can only be read by the user who created it. Would a chmod in the Python code after the rename be an acceptable workaround until there is a better solution?

@stweil
Copy link
Contributor Author

stweil commented Sep 19, 2020

See pull request #608 for a possible fix.

@bertsky bertsky linked a pull request Sep 21, 2020 that will close this issue
@bertsky
Copy link
Collaborator

bertsky commented Sep 21, 2020

Would a chmod in the Python code after the rename be an acceptable workaround until there is a better solution?

As the discussion on untitaker/python-atomicwrites#42 shows, that might create another race IIUC.

@stweil
Copy link
Contributor Author

stweil commented Sep 21, 2020

As far as I see that race condition is only relevant for massiv parallel writes to the METS file. I currently don't have such use cases, and I am afraid that atomic_write would not correctly support them, too, because changes of the METS file might get lost.

@bertsky
Copy link
Collaborator

bertsky commented Sep 21, 2020

As far as I see that race condition is only relevant for massiv parallel writes to the METS file. I currently don't have such use cases, and I am afraid that atomic_write would not correctly support them, too, because changes of the METS file might get lost.

Ah, right, I remember now. We only use this for atomic signal handling, not against potential races.

@bertsky
Copy link
Collaborator

bertsky commented Nov 4, 2020

Fixed by #625

@bertsky bertsky closed this as completed Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants