Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SHA1 rather than MD5 as a checksum on a per file basis #3354

Closed
djbrooke opened this issue Sep 13, 2016 · 11 comments
Closed

Support SHA1 rather than MD5 as a checksum on a per file basis #3354

djbrooke opened this issue Sep 13, 2016 · 11 comments

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Sep 13, 2016

In support of rsync.

@pdurbin
Copy link
Member

pdurbin commented Sep 13, 2016

During the 2016-09-08 SBGrid Sprint Planning meeting ( https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit ) this issue was given an effort level of "3".

Dataverse calculates and stores MD5 checksums but https://data.sbgrid.org (the system @bmckinney @pameyer and I especially) are trying to migrate from supports and needs to continue to support SHA1 checksums instead.

Rather than adding a "sha1" column next to the "md5" column in the "datafile" table at http://phoenix.dataverse.org/schemaspy/latest/tables/datafile.html we're probably make the system extensible so that we can easily add additional schemes such as "sha256" or whatever else in the future. Also, we should be able to calculate and store multiple schemes for a single file. That is, we should be able to store both the MD5 value and SHA1 value for a given file.

@pdurbin
Copy link
Member

pdurbin commented Sep 14, 2016

@djbrooke and I discussed this issue today and I mentioned that one of the issues we'll need to consider is the "cards" for files should show MD5, SHA-1, or both.

Here's how https://dataverse.harvard.edu looks today (v4.5), showing MD5s:

screen shot 2016-09-14 at 11 55 27 am

Here's how https://dv.sbgrid.org (demo code) looks, with SHA-1 values in the MD5 column (we know this is a bug, which is what this issue is about):

screen shot 2016-09-14 at 12 23 35 pm

@bmckinney @pameyer I assume you have no use for the MD5 values. You only want SHA-1 values. Perhaps there should be an installation-wide setting for which value to show (or both)? I'm a little concerned that showing both is overkill and clutters the card but maybe @mheppler can weigh in on that. I guess we squeeze UNF values on file cards too, if memory serves. @scolapasta has probably thought about a lot of this stuff as well. I'm open to suggestions.

I must say that MD5 values and SHA-1 values sure do look similar to a layman like myself:

murphy:Downloads pdurbin$ md5 frame000000002 
MD5 (frame000000002) = c89089b48d513040f8221bc17ee7543b
murphy:Downloads pdurbin$ shasum frame000000002 
046f4c214c3c2bfa73dff3c388a3450eaee2851c  frame000000002

@pdurbin
Copy link
Member

pdurbin commented Sep 14, 2016

@djbrooke and I also discussed why one would choose to use SHA-1 rather than MD5 in the first place. From a quick look at http://crypto.stackexchange.com/questions/18612/how-is-sha1-different-from-md5 it seems like SHA-1 is less (or completely?) vulnerable to collision attacks: https://en.wikipedia.org/wiki/Collision_attack

In practice, I don't know if using MD5s to verify file integrity is really a problem. I suspect it's secure enough but I'm happy to be told I need to get with the times. Citations needed though! 😄

@bmckinney
Copy link
Contributor

In general, being able to import checksums using any algorithm via manifest files (like we now do for rsync) and export them (e.g., bagit) is desirable.

Re: display - what was the user story for this? Do people currently copy and paste MD5s and/or UNFs from the display and do something with them?

@pdurbin
Copy link
Member

pdurbin commented Sep 15, 2016

@scolapasta and I just discussed the design that @sekmiller and I worked on yesterday and the plan now is to say on a per file basis, only one checksum would be able to be stored but it could be MD5 or SHA-1. @bmckinney this was before I saw what you wrote above. Let's sync up on this at our 2pm meeting today. I added it to the agenda: https://docs.google.com/document/d/1A_CtAOyrUTAmR13xUAtsswWrHxiEQEE50kboxaz5qmk/edit?usp=sharing

@pdurbin
Copy link
Member

pdurbin commented Sep 15, 2016

@bmckinney @scolapasta and I decided on an approach, which I'll make a pull request for. I'd like to get a list from @pameyer of checksum schemes he thinks should be included in a Java enum but without this input it'll simply be a list of two schemes: MD5 and SHA-1. That is to say I'm not blocked.

pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 16, 2016
A pull request will eventually be based on this scratch work which
probably still has a number of bugs in it. For now this branch will be
pushed to pdurbin's fork for safe keeping.

Still need to think about:

- "View differences" on dataset page
- Exercising hard-coded "MD5" strings in Bundle.properties
- Calculate something other than MD5 when files are uploaded? Is this a
  setting?
- Document changes to Search and Datasets API (new "checksum" field)
- Tests, tests, tests
- All the stuff I'm forgetting or am not aware of...
@pdurbin
Copy link
Member

pdurbin commented Sep 16, 2016

I just pushed some scratch work to my fork at pdurbin@babb225 so that @bmckinney and @scolapasta can make sure that the database schema changes reflect the design we agreed upon yesterday and so that they can get a sense of the size and scope of the code change. I've changed 22 files so far and I'm not done yet. There are still decisions to make, which I captured in the commit message. More to come.

@pdurbin
Copy link
Member

pdurbin commented Sep 19, 2016

@djbrooke @bmckinney @pameyer should we change the title of this issue from "Store multiple checksum schemes" to "Store an alternative checksum scheme"? I feel like this better captures what we agreed to and what I'm working on per my last comment. Basically, on a per file level, the scheme that is stored is either MD5 (the default) or some other scheme such as SHA1.

@pdurbin pdurbin changed the title Store multiple checksum schemes Store an alternative (non-MD5) checksum scheme on a per file basis such as SHA1 Sep 19, 2016
@pdurbin pdurbin changed the title Store an alternative (non-MD5) checksum scheme on a per file basis such as SHA1 Support SHA1 rather than MD5 as a checksum on a per file basis Sep 19, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 19, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 19, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 20, 2016
- A database setting has been added.
- Resolved and documented use of "SHA1" vs "SHA-1".
- Other bug fixes.
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 20, 2016
Also, explain further SHA-1 vs SHA1.
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 20, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 20, 2016
pdurbin added a commit to pdurbin/dataverse that referenced this issue Sep 20, 2016
pdurbin added a commit that referenced this issue Sep 20, 2016
A dependency for rsync support (#3145) is the ability to persist SHA-1
checksums for files rather than MD5 checksums.

A new installation-wide configuration setting called
":FileFixityChecksumAlgorithm" has been added which can be set to
"SHA-1" to have Dataverse calculate and show SHA-1 checksums rather than
MD5 checksums.

In order to run this branch you must run the provided SQL upgrade
script: scripts/database/upgrades/3354-alt-checksum.sql

In addition, the Solr schema should be updated to the version in this
branch.
@pdurbin
Copy link
Member

pdurbin commented Sep 20, 2016

I just made pull request #3367 and moved this issue to Code Review at https://waffle.io/IQSS/dataverse

@mheppler you are welcome to switch to this branch and give feedback but I believe that no UI changes will be necessary beyond what I changed. Generally speaking, when you configure your Dataverse installation to use SHA-1 rather than MD5 values. See the pull request for details on how to get the branch running.

In the PR I'll leave a note for @bmckinney for the sort of feedback I'm looking for.

@djbrooke djbrooke added this to the 4.6 - File Replace milestone Sep 20, 2016
@pdurbin
Copy link
Member

pdurbin commented Sep 20, 2016

I just moved this issue to QA in https://waffle.io/IQSS/dataverse since @bmckinney did the Code Review of the SQL script I asked for. It's ready for QA.

The code is in pull request #3367 so the "3354-alt-checksum" branch is the one to build from. Please note that to run the code you must:

Please see https://github.com/IQSS/dataverse/blob/b703e27ec2d785adff459d3ff806e10344a2ef06/doc/sphinx-guides/source/installation/config.rst#filefixitychecksumalgorithm for documentation on the new configuration setting I added (:FileFixityChecksumAlgorithm). Once the code is deployed, in order to test the change, you'll want to change this setting to "SHA-1" and start uploading files. You should observe the following:

  • GUI shows SHA-1 values for newly updated files (SHA-1 values are not calculated for previously updated files). Please be sure to check MyData, "diffs" between dataset versions when files are involved, search "cards", and anywhere you can think of where MD5 used to be hard-coded
  • The Search and Native API returned "md5" in the JSON output and I kept this in place for existing files for backward compatibility but newly uploaded files will not show "md5" in the JSON output but rather a new key called "checksum" with "type" and "value" inside.

Please note that while it's possible to change the system-wide setting from the default of "MD5" to "SHA-1" we only expect installations of Dataverse to do this at installation time. It's like how we tell people to pick a persistent ID provider at installation time. We don't expect them to change it later. For testing and QA purposes it shouldn't be necessary to stand up a fresh installation of Dataverse. It should be fine to have a mix of some files with MD5 values and some files with SHA-1 values.

@kcondon
Copy link
Contributor

kcondon commented Oct 3, 2016

Works, closing.

Btw, if a file has an md5, then both the newer checksum and the older md5 field will appear in the json, if sha-1, only the checksum field.

The 2 api calls I used to test were:

curl -X GET http://dataverse-internal.iq.harvard.edu:8080/api/search?key=xxx-yyyy-zzzz\&q=*\&subtree=kc93016b\&type=file|jq .

curl -X GET http://dataverse-internal.iq.harvard.edu:8080/api/datasets/:persistentId/?persistentId=doi:10.5072/FK2/JMUUK0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants