Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring file stores (storage drivers) for individual datasets #7272

Merged
merged 10 commits into from
Sep 16, 2020

Conversation

landreev
Copy link
Contributor

What this PR does / why we need it:
Even though the title of the issue specifically mentions "direct S3 upload", the PR adds something more general - an ability to designate a file store for a specific dataset. The file store in question does not have to be S3. But enabling direct S3 upload is the main use case behind this PR. For example, if we want to enable direct upload for a specific dataset in prod., without opening it for everybody, we will achieve it by

  1. Creating another S3-type store, pointing to the same storage bucket (for ex., "s3direct")
  2. Enable direct upload on the above.
  3. Configure "s3direct" to be the designated storage driver for the dataset in question, using the API added in this PR:
    curl -H "X-Dataverse-key: XXX" -X PUT -d s3direct http://localhost:8080/api/datasets/NNNN/storageDriver

The words "API based" in the issue title refer to the fact that a file store can be configured for a dataset via API only; i.e. there's no GUI for that (like we have for dataverses). But once a direct upload-enabled store has been assigned to a dataset, the uploads will work both via the dataset page, or the API (using the DVUploader utility); just like when it's enabled dataverse-wide.

Which issue(s) this PR closes:

Closes #6872

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Sep 16, 2020

Coverage Status

Coverage decreased (-0.007%) to 19.476% when pulling 49ebafc on 6872-direct-upload-for-datasets into dbf0bca on develop.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The only ~issue I see is that there's no UI indicator that a Dataset isn't inheriting the storageDriver from it's Dataverse, so it may not be obvious when a Dataset has been switched. No reason that can be a future issue though.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Code Review 🦁 to QA 🔎✅ Sep 16, 2020
@kcondon kcondon self-assigned this Sep 16, 2020
@landreev
Copy link
Contributor Author

@qqmyers

Looks good to me. The only ~issue I see is that there's no UI indicator that a Dataset isn't inheriting the storageDriver from it's Dataverse, so it may not be obvious when a Dataset has been switched. No reason that can be a future issue though.

We don't otherwise have a UI indicator showing which store the files are going to, when it's inherited from the dataverse... do we?
Thanks for bringing this up - made me check and realize that there was a whole bunch of places where the upload size limit inherited from the dataverse would be used, instead of the dataset's own, etc...

… checked... all needed to be changed not to assume that it's inherited from the parent dataverse. (#6872)
@qqmyers
Copy link
Member

qqmyers commented Sep 16, 2020

Good catch! I was just thinking that an admin can check (or change) the Dataverse setting under General Information, but that value may not be what the dataset has.

@kcondon
Copy link
Contributor

kcondon commented Sep 16, 2020

(I meant to reply to this comment, not to edit it earlier; on my phone, sorry)

  1. get current driver fails silently with bad id: {} and 500 error in server log. Does indicate bad id if drop last /
    This works now.
  2. get available drivers seems to double the driver names:
    curl -H "X-Dataverse-key: xxxx-yyyy-zzz" http://localhost:8080/api/admin/dataverse/storageDrivers
    {"status":"OK","data":{"s31":"s31","s32":"s32","file2":"file2","file1":"file1"}}'
    OK, noted this is due to label and value being the same so label:value is what is displayed here.

@landreev
Copy link
Contributor Author

2. get available drivers seems to double the driver names:
   curl -H "X-Dataverse-key: xxxx-yyyy-zzz" http://localhost:8080/api/admin/dataverse/storageDrivers
   {"status":"OK","data":{"s31":"s31","s32":"s32","file2":"file2","file1":"file1"}}'

Please note that this is not part of my PR - this is an existing API; that was merged some months ago.
It doesn't "double" the driver names. These are "id" and "label" pairs describing each driver. We've been using the same value for both in all the practical use cases (I'm not sure why anyone would want to use different values? Something more descriptive for the label - idk); but they can be different.

@qqmyers
Copy link
Member

qqmyers commented Sep 16, 2020

FWIW: TDL has s3 with label "TDL" and s3tacc with label "TACC". Harvard might want labels like 'Normal' and 'Large Data', etc.

@landreev
Copy link
Contributor Author

FWIW: TDL has s3 with label "TDL" and s3tacc with label "TACC". Harvard might want labels like 'Normal' and 'Large Data', etc.

I was thinking something along these lines - there may be a situation where we need to spell out the name of the institution that owns a specific storage location (or even the name of the grant that pays for it)... We may need/want to display this kind of info on the page later on...

@landreev
Copy link
Contributor Author

1. get current driver fails silently with bad id: {} and 500 error in server log. 

True; just checked in a fix. Should be printing the same "no such dataset" error message as the PUT and DELETE versions now.

@kcondon kcondon merged commit efcd24d into develop Sep 16, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Sep 16, 2020
@kcondon kcondon deleted the 6872-direct-upload-for-datasets branch September 16, 2020 21:00
@djbrooke djbrooke added this to the 5.1 milestone Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Direct S3 Upload mechanism for individual datasets (API based)
5 participants