Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipper utility: duplicate file names #7276

Merged
merged 1 commit into from
Sep 21, 2020

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Sep 21, 2020

What this PR does / why we need it:
This PR fixes an issue in the zipper utility where it failed to zip multiple files with the same name. This was discovered when investigating a download issue with a specific dataset (IQSS/dataverse.harvard.edu#80 with an RT issue linked).

With this fix applied, duplicate filenames are treated the traditional way - by making each filename unique with added numeric suffixes (foo.bar, foo_1.bar, foo_2.bar, ...).

TL;DR:

Without this fix, the failure behavior is not entirely straightforward. I said in the RT ticket last week that it would "stop as soon as the first duplicate is encountered", but that was not entirely accurate. When a duplicate is encountered, it is simply dropped. Meaning, if you try this with just a couple of duplicate filenames, the download is NOT going to fail in any obvious way - you'll get a zip file, with the extra dupes dropped quietly. Past a certain number of duplicates on the list however the download actually fails with an error. I haven't tried to figure out what that number is; but it was failing with the datast referenced from the support issue above - where there are 69 duplicates total. It may have something to do with S3 storage specifically, and the fact that the failures to add files to the zip stream were resulting in a growing number of open S3 connections that were not being closed properly (? - just a guess). Again, I did not bother to find out for sure; since it is working now.

Which issue(s) this PR closes:

Closes IQSS/dataverse.harvard.edu#80

Special notes for your reviewer:

Suggestions on how to test this:

The new zipper jar is already installed on rserve.dataverse; so it can be used to drop in on dataverse-internal, if you want to test it there - instead of building it from sources.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

Coverage Status

Coverage remained the same at 19.476% when pulling 5c14006 on 80_local-zipper-duplicate-filenames into efcd24d on develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems fine and we certainly need some kind of fix in production (Harvard Dataverse), where duplicate file names still occur.

What about installations that are nice and clean with no duplicate file names? What if they could use the zipper in a "strict" mode where this extra logic is disabled?

Or what if some day we get all the duplicates names out of production? Could we get rid of this extra logic.

I'm guess I'm suggesting that this extra logic could be configurable, could be turned off, if it isn't needed. Just a thought. I'm also fine with this pull request going to QA as-is.

@scolapasta
Copy link
Contributor

I'm guess I'm suggesting that this extra logic could be configurable, could be turned off, if it isn't needed. Just a thought. I'm also fine with this pull request going to QA as-is.

@pdurbin, isn't it by definition turned off for clean installations, though? If they have no duplicate names, it won't do anything differently.

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2020

You're right, the end result is the same.

@landreev
Copy link
Contributor Author

I don't think we ever want to turn this logic off; even on a system that is "clean" - where there are no duplicate filenames within the same datasets - /api/access/datafiles/ still accepts arbitrary file ids. I.e., a user can ask for files from different datasets to be zipped, that can legitimately have the same file path. (yes, to be accurate, it's not a "duplicate file name", it's a "duplicate file path", folder + name combination we are talking about).
Also, to reiterate, this logic in the PR is not new - it just syncs the behavior of the standalone zipper utility with that of the internal zipping.

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2020

Having READMEs from two different datasets in one zip (README, README-1) still kind of blows my mind but thanks for the reminder about this. I agree we always want this logic. I'll go ahead and move it to QA.

@scolapasta
Copy link
Contributor

Out of scope for this issue, but for the case where you pass in files from different datasets, would it make sense for the zipper (zippers?) to check the datasets of the files and if multiple add an extras layer folder (or folders if it wants to follow the doi/authority/id format we do in our storage)?

@kcondon kcondon self-assigned this Sep 21, 2020
@landreev
Copy link
Contributor Author

Out of scope for this issue, but for the case where you pass in files from different datasets, would it make sense for the zipper (zippers?) to check the datasets of the files and if multiple add an extras layer folder (or folders if it wants to follow the doi/authority/id format we do in our storage)?

We could do something like that. Or, alternatively, we could say that /api/access/datafiles should not accept file ids from different datasets.
But yes, out of scope.
(For the record, either of the 2 changes above would be entirely in the application - the standalone zipper utility would not need to be modified.)

@landreev
Copy link
Contributor Author

I'm still assuming that this zipper thing is an experimental feature. But if we decide to stick with it, I'll want to invest in creating some restassured tests for it. It would just need some extra steps in the spinup scripts (because it lives outside the main application/application server); but then we could have an automated way to compare its behavior to that of the builtin zipper.

@kcondon kcondon merged commit 2d9562a into develop Sep 21, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Sep 21, 2020
@kcondon kcondon deleted the 80_local-zipper-duplicate-filenames branch September 21, 2020 21:07
@djbrooke djbrooke added this to the 5.1 milestone Sep 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Zip download: Fails in prod with zip sizes above 1.4MB-2MB
6 participants