-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SHA1 rather than MD5 as a checksum on a per file basis #3354
Comments
During the 2016-09-08 SBGrid Sprint Planning meeting ( https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit ) this issue was given an effort level of "3". Dataverse calculates and stores MD5 checksums but https://data.sbgrid.org (the system @bmckinney @pameyer and I especially) are trying to migrate from supports and needs to continue to support SHA1 checksums instead. Rather than adding a "sha1" column next to the "md5" column in the "datafile" table at http://phoenix.dataverse.org/schemaspy/latest/tables/datafile.html we're probably make the system extensible so that we can easily add additional schemes such as "sha256" or whatever else in the future. Also, we should be able to calculate and store multiple schemes for a single file. That is, we should be able to store both the MD5 value and SHA1 value for a given file. |
@djbrooke and I discussed this issue today and I mentioned that one of the issues we'll need to consider is the "cards" for files should show MD5, SHA-1, or both. Here's how https://dataverse.harvard.edu looks today (v4.5), showing MD5s: Here's how https://dv.sbgrid.org (demo code) looks, with SHA-1 values in the MD5 column (we know this is a bug, which is what this issue is about): @bmckinney @pameyer I assume you have no use for the MD5 values. You only want SHA-1 values. Perhaps there should be an installation-wide setting for which value to show (or both)? I'm a little concerned that showing both is overkill and clutters the card but maybe @mheppler can weigh in on that. I guess we squeeze UNF values on file cards too, if memory serves. @scolapasta has probably thought about a lot of this stuff as well. I'm open to suggestions. I must say that MD5 values and SHA-1 values sure do look similar to a layman like myself:
|
@djbrooke and I also discussed why one would choose to use SHA-1 rather than MD5 in the first place. From a quick look at http://crypto.stackexchange.com/questions/18612/how-is-sha1-different-from-md5 it seems like SHA-1 is less (or completely?) vulnerable to collision attacks: https://en.wikipedia.org/wiki/Collision_attack In practice, I don't know if using MD5s to verify file integrity is really a problem. I suspect it's secure enough but I'm happy to be told I need to get with the times. Citations needed though! 😄 |
In general, being able to import checksums using any algorithm via manifest files (like we now do for rsync) and export them (e.g., bagit) is desirable. Re: display - what was the user story for this? Do people currently copy and paste MD5s and/or UNFs from the display and do something with them? |
@scolapasta and I just discussed the design that @sekmiller and I worked on yesterday and the plan now is to say on a per file basis, only one checksum would be able to be stored but it could be MD5 or SHA-1. @bmckinney this was before I saw what you wrote above. Let's sync up on this at our 2pm meeting today. I added it to the agenda: https://docs.google.com/document/d/1A_CtAOyrUTAmR13xUAtsswWrHxiEQEE50kboxaz5qmk/edit?usp=sharing |
@bmckinney @scolapasta and I decided on an approach, which I'll make a pull request for. I'd like to get a list from @pameyer of checksum schemes he thinks should be included in a Java enum but without this input it'll simply be a list of two schemes: MD5 and SHA-1. That is to say I'm not blocked. |
A pull request will eventually be based on this scratch work which probably still has a number of bugs in it. For now this branch will be pushed to pdurbin's fork for safe keeping. Still need to think about: - "View differences" on dataset page - Exercising hard-coded "MD5" strings in Bundle.properties - Calculate something other than MD5 when files are uploaded? Is this a setting? - Document changes to Search and Datasets API (new "checksum" field) - Tests, tests, tests - All the stuff I'm forgetting or am not aware of...
I just pushed some scratch work to my fork at pdurbin@babb225 so that @bmckinney and @scolapasta can make sure that the database schema changes reflect the design we agreed upon yesterday and so that they can get a sense of the size and scope of the code change. I've changed 22 files so far and I'm not done yet. There are still decisions to make, which I captured in the commit message. More to come. |
@djbrooke @bmckinney @pameyer should we change the title of this issue from "Store multiple checksum schemes" to "Store an alternative checksum scheme"? I feel like this better captures what we agreed to and what I'm working on per my last comment. Basically, on a per file level, the scheme that is stored is either MD5 (the default) or some other scheme such as SHA1. |
- A database setting has been added. - Resolved and documented use of "SHA1" vs "SHA-1". - Other bug fixes.
Also, explain further SHA-1 vs SHA1.
A dependency for rsync support (#3145) is the ability to persist SHA-1 checksums for files rather than MD5 checksums. A new installation-wide configuration setting called ":FileFixityChecksumAlgorithm" has been added which can be set to "SHA-1" to have Dataverse calculate and show SHA-1 checksums rather than MD5 checksums. In order to run this branch you must run the provided SQL upgrade script: scripts/database/upgrades/3354-alt-checksum.sql In addition, the Solr schema should be updated to the version in this branch.
I just made pull request #3367 and moved this issue to Code Review at https://waffle.io/IQSS/dataverse @mheppler you are welcome to switch to this branch and give feedback but I believe that no UI changes will be necessary beyond what I changed. Generally speaking, when you configure your Dataverse installation to use SHA-1 rather than MD5 values. See the pull request for details on how to get the branch running. In the PR I'll leave a note for @bmckinney for the sort of feedback I'm looking for. |
I just moved this issue to QA in https://waffle.io/IQSS/dataverse since @bmckinney did the Code Review of the SQL script I asked for. It's ready for QA. The code is in pull request #3367 so the "3354-alt-checksum" branch is the one to build from. Please note that to run the code you must:
Please see https://github.com/IQSS/dataverse/blob/b703e27ec2d785adff459d3ff806e10344a2ef06/doc/sphinx-guides/source/installation/config.rst#filefixitychecksumalgorithm for documentation on the new configuration setting I added (
Please note that while it's possible to change the system-wide setting from the default of "MD5" to "SHA-1" we only expect installations of Dataverse to do this at installation time. It's like how we tell people to pick a persistent ID provider at installation time. We don't expect them to change it later. For testing and QA purposes it shouldn't be necessary to stand up a fresh installation of Dataverse. It should be fine to have a mix of some files with MD5 values and some files with SHA-1 values. |
Works, closing. Btw, if a file has an md5, then both the newer checksum and the older md5 field will appear in the json, if sha-1, only the checksum field. The 2 api calls I used to test were: curl -X GET http://dataverse-internal.iq.harvard.edu:8080/api/search?key=xxx-yyyy-zzzz\&q=*\&subtree=kc93016b\&type=file|jq . |
Support SHA1 rather than MD5 as a checksum on a per file basis #3354
In support of rsync.
The text was updated successfully, but these errors were encountered: