-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Upload - allow files with same MD5 (or other checksum) in a dataset #4813
Comments
Is the issue here a hash collision or something else? Otherwise, wouldn't the different polygons have different hashes? |
I'm not sure. The files in question for that support ticket are .prj files, defined here as:
So I think the content is the same, but they're meant for different polygons. (It looks like the journal allowed the author to upload the second file in a zip file.) |
|
There's some related discussion in the "Add a checkbox to disable unzipping" issue at #3439 (comment) and below. Also, as we think about pulling in files from GitHub (#2739 and #5372), we should consider that identical files are somewhat common. There could be identical |
Here's another example of the need to have two files with same content, but different filenames (3D files at UVA):
|
@shlake great real world, non-code example. Thanks! |
May make sense to discuss at the same time as #6574. |
Technically, this should be straightforward to remove this check. A couple of questions: From the above seems like completely might fill all use cases, but then we do lose something. In its place, should there be a warning in the UI? ("Note: this file has the same md5 as another file in this dataset") For the API, we would either not warn or need to implement functionality similar to what we do for move Dataset where we return the warnings and require an extra parameter of "force=true". This seems more problematic with file upload, though, since we wouldn't want the user to have to re upload. Another alternative for either UI or API could be to have this warning be on publish. |
In UVa's case above, the "duplicate" is in the same directory. So I vote "no" to just check if in a different folder. I like a "warning" message, versus a "stop - you can't do that" (and the file not get uploaded). |
Definitely UI warning msg confirmation popup at time of upload. Similar to how we warn users in file replace workflow if the new file is a different type than the original, where we ask the user if they want to continue or not. |
Thanks @mheppler for offering to include a mockup here so that we can bring it into a sprint soon. |
For the duplicate file, do we know that the file is a duplicate, or might it be a file with the same name? |
If it's checking MD5 or other checksums (SHA, et al), it is the same file, contents and all. |
@mheppler but a file with same content (same MD5) could have a different filename. So would there need to be a different popup for that? I see two types of duplicates: one with same filename & same content AND one with different filename & same content. |
If the user does not want to keep the file, at the time the popup is generated, is the system deleting the file, or canceling the upload/ingest? |
For documentation and QA: |
If someone uploads a file in a dataset that Dataverse notices already has a file with the same content (both files have the same MD5), Dataverse shows an error and doesn't allow the "duplicate" file to be uploaded.
Issues with this feature have been discussed in another github issue (#2955, closed when File Replace was released in Dataverse 4.6.1), in Dataverse's Google Group here and here, and in a recent Dataverse support ticket, where a depositor wrote that "for uploading shape files for two different polygons but the same projection, it might be nice to be able to upload both at the same time." For this researcher, a common workaround, uploading the file in different double-zipped archived files (7-Zip, tar file, etc) won't work because the journal policy doesn't allow depositors to upload archived files.
The text was updated successfully, but these errors were encountered: