Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various fixes #125

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

various fixes #125

wants to merge 9 commits into from

Conversation

jscancella
Copy link
Contributor

@jscancella jscancella commented Jul 22, 2018

fixes #124 - by renaming to make it more clear that we verify all manifests
fixes #123 - by fixing the typo in the message formatting so the path gets added to the message
fixes #122 - fixed when checking if a file is in at least one manifest and file is hidden. It was not skipping if it was hidden and the ignoreHidden setting was set
fixes #121 - changes the number of threads to be limited by the cpu count instead of unbounded
fixes #119 - by adding test to ensure we don't get warnings when there shouldn't be any

@coveralls
Copy link

coveralls commented Jul 22, 2018

Coverage Status

Coverage increased (+0.07%) to 98.354% when pulling a06001c on jscancella:master into 2b7002e on LibraryOfCongress:master.

@@ -6,9 +6,9 @@
public enum StandardSupportedAlgorithms implements SupportedAlgorithm{
MD5("MD5"),
SHA1("SHA-1"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this "SHA-1" now different than the others? Now the others don't have this dash anymore, but this one still has it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update to include sha1 as well as tests on very basic valid test bags

@rvanheest
Copy link

@jscancella Not sure if I can ask this here or if I should make a new issue for this. However, since you're touching the class I wanted to ask about in this PR, I'd try it here...

I was wondering if there is any special reason for BagVerifier to be final and checkHashes to be package private. In my project we would really benefit from extending from this class and calling checkHashes from another sequence of validation steps (like your isComplete and isValid. I mean, if the final is there only to protect isComplete and/or isValid from being overridden and not conforming to the specs, you could also use final keywords on these methods themselves.

…. Also added tests for valid bags of each standard algorithm
@jscancella
Copy link
Contributor Author

@rvanheest The original reason was to convey that these classes were not designed to be overridden, That's not to say you would be in danger if you did, merely that you were on your own if you wanted to fork this and change the class to be one you can override.

I also personally like to make everything as final as possible to decrease the attack surface should a vulnerability be found in the library or dependency library. But I no longer work at the Library of Congress so if the maintainers wish to make this public that is their choice. What are you doing that you need to check the hashes again later without using BagVerifer?

@rvanheest
Copy link

rvanheest commented Jul 24, 2018

@jscancella I understand why you made these classes final. We defined a another kind of validness (called 'virtually valid'), which is basically the same as 'valid', but with some different rules for the fetch file. So it would be ideal if I could extend from BagVerifier and create a new method that uses the existing functionality for validating the checksums (checkHashes).
By the way, I'm not asking to make checkHashes public! That would expose it too much. Personally I would choose protected in this case. But I'll leave that to the new maintainers, indeed.

Finally, sorry to hear that you left Library of Congress. It was a pleasure for my coworkers and me to collaborate and interact with you via GitHub. Best wishes in your new position.

@jscancella
Copy link
Contributor Author

@rvanheest interesting, virtually valid makes me think you are building a bag and then checking it later, is this a correct assumption? If so my only comment would be that I tend to think of bagit as a method of transferring files correctly using (typically) a "sneakernet".

Thanks for the kind remarks. As for leaving the Library of Congress, it was most unfortunate but as you can see I am still active on bagit and hope to remain so (even if it is on my own time).

@rvanheest
Copy link

@jscancella, as I mentioned in another place to you, we use the concept of Bags, BagIt, etc. as one of our storage formats. Indeed we also use it for transferring datasets from our clients to us, but also use it for storage (after having done some cleanup and metadata enrichment). See our public documentation for a description of virtuallyValid.

In the implementation of virtuallyValid I could use bagit-java's implementation of checkHashes as 'stand-alone'. As you know however, this method is currently package private. I'll prepare a pull request for this later, such that you and the maintainers can comment on my proposed changes.

@jscancella
Copy link
Contributor Author

@rvanheest could you not do something like create a derivative bag that doesn't include the files listed in the fetch file in the manifest(s)? That way you can use all the tools normally and you don't need to define a "virtually" valid bag.

For example, something like this:

BagReader reader = new BagReader();
Bag bag = reader.read(<YOUR PATH HERE>);
for(FetchItem item : bag.getItemsToFetch()){
  for(Manifest manifest : bag.getPayLoadManifests()){
    manifest.getFileToChecksumMap().remove(PathUtils.getDataDir(bag).resolve(item.getPath()));
  }
}

//TODO update tag manifest(s)...
BagWriter.write(bag, <SOME PLACE YOU KEEP STORE BAGS>);

… least one manifest when option is enabled
@jscancella
Copy link
Contributor Author

@acdha just a side note, this PR includes your changes in #120

@smntb
Copy link

smntb commented Aug 20, 2018

@jscancella Would you please share the compiled jar with me that I can use in my project as I am also having the Thread pool issue for the larger bag.

@jscancella
Copy link
Contributor Author

@smntb I would wait until someone from the Library of Congress publishes an official jar. But if you really can't wait, you can clone my repo and use the instructions in the README to build it yourself

@jscancella
Copy link
Contributor Author

@acdha thoughts? It looks like there are people who are interested in this being merged in and deployed to the main java repositories (maven central and jcenter)

@acdha
Copy link
Member

acdha commented Aug 23, 2018

I don't have access to this repo. Perhaps @dbrunton knows who's picking up maintainership?

@smntb
Copy link

smntb commented Sep 4, 2018

Any update on this pull request?

@jscancella
Copy link
Contributor Author

bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants