Zipper utility: duplicate file names #7276

landreev · 2020-09-21T13:27:35Z

What this PR does / why we need it:
This PR fixes an issue in the zipper utility where it failed to zip multiple files with the same name. This was discovered when investigating a download issue with a specific dataset (IQSS/dataverse.harvard.edu#80 with an RT issue linked).

With this fix applied, duplicate filenames are treated the traditional way - by making each filename unique with added numeric suffixes (foo.bar, foo_1.bar, foo_2.bar, ...).

TL;DR:

Without this fix, the failure behavior is not entirely straightforward. I said in the RT ticket last week that it would "stop as soon as the first duplicate is encountered", but that was not entirely accurate. When a duplicate is encountered, it is simply dropped. Meaning, if you try this with just a couple of duplicate filenames, the download is NOT going to fail in any obvious way - you'll get a zip file, with the extra dupes dropped quietly. Past a certain number of duplicates on the list however the download actually fails with an error. I haven't tried to figure out what that number is; but it was failing with the datast referenced from the support issue above - where there are 69 duplicates total. It may have something to do with S3 storage specifically, and the fact that the failures to add files to the zip stream were resulting in a growing number of open S3 connections that were not being closed properly (? - just a guess). Again, I did not bother to find out for sure; since it is working now.

Which issue(s) this PR closes:

Closes IQSS/dataverse.harvard.edu#80

Special notes for your reviewer:

Suggestions on how to test this:

The new zipper jar is already installed on rserve.dataverse; so it can be used to drop in on dataverse-internal, if you want to test it there - instead of building it from sources.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…y handle duplicate file names (IQSS/dataverse.harvard.edu#80)

coveralls · 2020-09-21T13:38:14Z

Coverage remained the same at 19.476% when pulling 5c14006 on 80_local-zipper-duplicate-filenames into efcd24d on develop.

pdurbin

The code seems fine and we certainly need some kind of fix in production (Harvard Dataverse), where duplicate file names still occur.

What about installations that are nice and clean with no duplicate file names? What if they could use the zipper in a "strict" mode where this extra logic is disabled?

Or what if some day we get all the duplicates names out of production? Could we get rid of this extra logic.

I'm guess I'm suggesting that this extra logic could be configurable, could be turned off, if it isn't needed. Just a thought. I'm also fine with this pull request going to QA as-is.

scolapasta · 2020-09-21T14:35:24Z

I'm guess I'm suggesting that this extra logic could be configurable, could be turned off, if it isn't needed. Just a thought. I'm also fine with this pull request going to QA as-is.

@pdurbin, isn't it by definition turned off for clean installations, though? If they have no duplicate names, it won't do anything differently.

pdurbin · 2020-09-21T14:48:18Z

You're right, the end result is the same.

landreev · 2020-09-21T15:03:56Z

I don't think we ever want to turn this logic off; even on a system that is "clean" - where there are no duplicate filenames within the same datasets - /api/access/datafiles/ still accepts arbitrary file ids. I.e., a user can ask for files from different datasets to be zipped, that can legitimately have the same file path. (yes, to be accurate, it's not a "duplicate file name", it's a "duplicate file path", folder + name combination we are talking about).
Also, to reiterate, this logic in the PR is not new - it just syncs the behavior of the standalone zipper utility with that of the internal zipping.

pdurbin · 2020-09-21T15:08:07Z

Having READMEs from two different datasets in one zip (README, README-1) still kind of blows my mind but thanks for the reminder about this. I agree we always want this logic. I'll go ahead and move it to QA.

scolapasta · 2020-09-21T15:10:45Z

Out of scope for this issue, but for the case where you pass in files from different datasets, would it make sense for the zipper (zippers?) to check the datasets of the files and if multiple add an extras layer folder (or folders if it wants to follow the doi/authority/id format we do in our storage)?

landreev · 2020-09-21T15:18:35Z

Out of scope for this issue, but for the case where you pass in files from different datasets, would it make sense for the zipper (zippers?) to check the datasets of the files and if multiple add an extras layer folder (or folders if it wants to follow the doi/authority/id format we do in our storage)?

We could do something like that. Or, alternatively, we could say that /api/access/datafiles should not accept file ids from different datasets.
But yes, out of scope.
(For the record, either of the 2 changes above would be entirely in the application - the standalone zipper utility would not need to be modified.)

landreev · 2020-09-21T15:22:02Z

I'm still assuming that this zipper thing is an experimental feature. But if we decide to stick with it, I'll want to invest in creating some restassured tests for it. It would just need some extra steps in the spinup scripts (because it lives outside the main application/application server); but then we could have an automated way to compare its behavior to that of the builtin zipper.

Fixes the bug in the standalone zipper utility where it can't properl…

5c14006

…y handle duplicate file names (IQSS/dataverse.harvard.edu#80)

landreev added this to Code Review 🦁 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Sep 21, 2020

pdurbin reviewed Sep 21, 2020

View reviewed changes

pdurbin moved this from Code Review 🦁 to QA 🔎✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 21, 2020

kcondon self-assigned this Sep 21, 2020

kcondon merged commit 2d9562a into develop Sep 21, 2020

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Sep 21, 2020

kcondon deleted the 80_local-zipper-duplicate-filenames branch September 21, 2020 21:07

djbrooke added this to the 5.1 milestone Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipper utility: duplicate file names #7276

Zipper utility: duplicate file names #7276

landreev commented Sep 21, 2020 •

edited

coveralls commented Sep 21, 2020

pdurbin left a comment

scolapasta commented Sep 21, 2020

pdurbin commented Sep 21, 2020

landreev commented Sep 21, 2020

pdurbin commented Sep 21, 2020

scolapasta commented Sep 21, 2020

landreev commented Sep 21, 2020

landreev commented Sep 21, 2020

Zipper utility: duplicate file names #7276

Zipper utility: duplicate file names #7276

Conversation

landreev commented Sep 21, 2020 • edited

coveralls commented Sep 21, 2020

pdurbin left a comment

Choose a reason for hiding this comment

scolapasta commented Sep 21, 2020

pdurbin commented Sep 21, 2020

landreev commented Sep 21, 2020

pdurbin commented Sep 21, 2020

scolapasta commented Sep 21, 2020

landreev commented Sep 21, 2020

landreev commented Sep 21, 2020

landreev commented Sep 21, 2020 •

edited