Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bagit.py and broken soft links #115

Open
ajnelson-nist opened this issue Sep 24, 2018 · 4 comments
Open

bagit.py and broken soft links #115

ajnelson-nist opened this issue Sep 24, 2018 · 4 comments

Comments

@ajnelson-nist
Copy link

Hello,

I encountered an issue today with bagit.py failing to deal with a broken soft link, and halting bagging an otherwise-intact file system tree. I was attempting to use the script on a directory tree that included a soft-linked file apparently meant to be set at a later date (e.g. .../foo.cfg pointing to .../config/populated_by_user/foo.cfg, similar to what the Apache webserver does with config files). The execution environment in this case is in a POSIX-interfaced file system.

The problem in bagit.py appears to stem from the function _can_read, on (today's) Line 1362. The broken soft link in this case pointed at a non-existent directory, raising an error on Line 206.

I suggest that a broken soft link should not prevent a directory tree from being bagged. It may be better for _can_read to only report actual directories and files that are unreadable, possibly with broken links as a new third output.

If helpful, there is a script that converts a file system walk (via os.walk) to DFXML, and it has an if-ladder that goes through all file-system-level file types, not just directories and regular files. See the walk_to_dfxml.py function filepath_to_fileobject, and all assignment statements matching name_type = (starting on Line 36 today). You may want _can_read to skip operating on other file types as well.

--Alex

@acdha
Copy link
Member

acdha commented Sep 24, 2018

Bagit.py is designed to package entire directories. If it can’t do that the only safe thing to do is fail so the unreadable files can be corrected.

The long-term refactoring to make bagit more customizable might allow you to create a subclass which could implement custom ignore logic but I haven’t had time to work on that in awhile.

@ajnelson-nist
Copy link
Author

Could there instead be a flag that allows broken soft links to be preserved, instead of failing the whole process? In the application where I'm using bags, "correcting" files is not an option. They are legitimately included as "direction" files.

@edsu
Copy link
Contributor

edsu commented Nov 19, 2018

v1.0 of the BagIt specification doesn't preclude the use of links in the payload directory, it simply requires the filename be readable as binary data (2.1.2):

Each payload file is treated as an opaque octet stream when verifying file correctness.

In addition to the constraint in _can_read identified by @ajnelson-nist above, the current version of bagit-python does not allow symbolic links (broken or not) to exist in the payload directory. Both bagging and validation operations use Bag._path_is_dangerous to sanity check the path, and if the link points outside the payload directory it will fail.

We could consider relaxing bagging and validation in bagit-python to allow symbolic links, since they don't seem as dangerous as things like .../../, /etc/passwd, etc which Bag._is_dangerous guards against. In fact, symlinks could actually be useful in some situations, like when managing storage volumes.

However if a symbolic link in the payload directory is broken, it is not possible to read it as an opaque octet stream, and thus it is an invalid payload file name. It seems like this use case is out of scope for the bagit-python utility, at least with respect to the v1.0 specification.

I do like the idea of a customizable bagit-python that could allow the Bag class to be extended in various ways for custom behavior. For example you could imagine an allow_broken_symlinks option that would create a new tag file broken_symlinks.txt that could be populated during bagging and consulted during verification.

@acdha I'd be interested in learning more about the refactor you had planned, and if I could perhaps help with it. Full disclosure, I am working with @ajnelson-nist at NIST on the use of BagIt within the National Software Reference Library. Perhaps me doing a heavy lift on the refactor you had in mind could be mutually amenable to LC and NIST?

edsu added a commit to edsu/bagit that referenced this issue Nov 19, 2018
This commit adds a follow_links parameter to Bag.make_bag and
Bag.__init__. When set to true Bag._is_dangerous will allow
links to files that exist outside of the payload directory. By
default it is set to False, but I think a reasonable argument
could be made for making the default set to True.

See LibraryOfCongress#115 for more context.
@edsu
Copy link
Contributor

edsu commented Dec 6, 2018

Just to be transparent about this, I met up with @acdha @dbrunton and Liz Madden from LC and we discussed the possibility of me contributing to a v2.0 version of bagit-python that would be backward compatible with the existing implementation but provide a more modular approach that could be easier to extend for various purposes (like the use case this issue is concerned with).

I'm going to put together a proposal for how we talked about moving things around to get some feedback from people who are using bagit-python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants