Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest filename escaping #509

Open
diamondap opened this issue Feb 15, 2022 · 3 comments
Open

Manifest filename escaping #509

diamondap opened this issue Feb 15, 2022 · 3 comments

Comments

@diamondap
Copy link
Member

This is a general bug filed against the manifest spec at LibraryOfCongress/bagit-spec#46.

TLDR: No existing baggers correctly implement the BagIt spec's filename percent-encoding requirements. Fixing this in DART likely means breaking compatibility with all other baggers and validators.

Wait to see how the community moves on this.

@pwinckles
Copy link

I was just starting to file issues for implementations. I would also note:

  1. That, if desired, it would be possible to support validating bags with manifests that are not encoded correctly by falling back to the old behavior if spec complaint validation fails.
  2. Paths in fetch.txt are supposed to be encoded the same as manifest paths. I have not done a survey of fetch.txt implementations, but I would be surprised if there was not a general problem here as well. I'm not sure how this could be addressed in a backward compatible way.

@diamondap
Copy link
Member Author

Thanks. The fallback for backwards compatibility is a good idea.

@pwinckles
Copy link

It occurred to me that another approach to backwards compatible validation would be to only decode %0D, %0A, and %25 when decoding paths in manifests. Normally, when percent-decoding, you'd decode all encoded characters as described here. However, by only decoding these three a correct BagIt 1.0 implementation would still be able to validate most bags produced by existing implementations.

For example, if a bag contains the file test%201.txt, then an existing implementation would write it to the manifest as data/test%201.txt when it should actually be data/test%25201.txt. However, if you only decode %0D, %0A, and %25, then the paths are equivalent.

This approach does not work for files that naturally include these three strings. For example, if a bag has a file named test%251.txt. Existing implementations would write it to the manifest as data/test%251.txt when it should be data/test%25251.txt. These paths are not equivalent. The first decodes to data/test%1.txt and the second decodes to data/test%251.txt.

While not perfect, I think this approach would greatly improve validation compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants