Skip to content

Commit

Permalink
Merge pull request #6995 from GlobalDataverseCommunityConsortium/IQSS…
Browse files Browse the repository at this point in the history
…/6763

IQSS/6763-multi-part upload API calls
  • Loading branch information
kcondon committed Aug 31, 2020
2 parents 796e612 + c6d06e5 commit cf1ac95
Show file tree
Hide file tree
Showing 10 changed files with 970 additions and 412 deletions.
3 changes: 3 additions & 0 deletions doc/release-notes/6763-multipart-uploads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Large Data Support (continued)

Direct S3 uploads now support multi-part uploading of large files (> 1GB by default) via the user interface and the API (which is used in the [Dataverse Uploader](https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader)). This allows uploads larger than 5 GB when using Amazon AWS S3 stores.
18 changes: 13 additions & 5 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,17 @@ This option can handle files >40GB and could be appropriate for files up to a TB
To configure these options, an administrator must set two JVM options for the Dataverse server using the same process as for other configuration options:

``./asadmin create-jvm-options "-Ddataverse.files.<id>.download-redirect=true"``

``./asadmin create-jvm-options "-Ddataverse.files.<id>.upload-redirect=true"``


With multiple stores configured, it is possible to configure one S3 store with direct upload and/or download to support large files (in general or for specific dataverses) while configuring only direct download, or no direct access for another store.
With multiple stores configured, it is possible to configure one S3 store with direct upload and/or download to support large files (in general or for specific dataverses) while configuring only direct download, or no direct access for another store.

The direct upload option now switches between uploading the file in one piece (up to 1 GB by default) and sending it as multiple parts. The default can be changed by setting:

``./asadmin create-jvm-options "-Ddataverse.files.<id>.min-part-size=<size in bytes>"``

For AWS, the minimum allowed part size is 5*1024*1024 bytes and the maximum is 5 GB (5*1024**3). Other providers may set different limits.

It is also possible to set file upload size limits per store. See the :MaxFileUploadSizeInBytes setting described in the :doc:`/installation/config` guide.

Expand All @@ -30,8 +37,8 @@ At present, one potential drawback for direct-upload is that files are only part
``./asadmin create-jvm-options "-Ddataverse.files.<id>.ingestsizelimit=<size in bytes>"``


**IMPORTANT:** One additional step that is required to enable direct download to work with previewers is to allow cross site (CORS) requests on your S3 store.
The example below shows how to enable the minimum needed CORS rules on a bucket using the AWS CLI command line tool. Note that you may need to add more methods and/or locations, if you also need to support certain previewers and external tools.
**IMPORTANT:** One additional step that is required to enable direct uploads via Dataverse and for direct download to work with previewers is to allow cross site (CORS) requests on your S3 store.
The example below shows how to enable CORS rules (to support upload and download) on a bucket using the AWS CLI command line tool. Note that you may want to limit the AllowedOrigins and/or AllowedHeaders further. https://github.com/GlobalDataverseCommunityConsortium/dataverse-previewers/wiki/Using-Previewers-with-download-redirects-from-S3 has some additional information about doing this.

``aws s3api put-bucket-cors --bucket <BUCKET_NAME> --cors-configuration file://cors.json``

Expand All @@ -42,9 +49,10 @@ with the contents of the file cors.json as follows:
{
"CORSRules": [
{
"AllowedOrigins": ["https://<DATAVERSE SERVER>"],
"AllowedOrigins": ["*"],
"AllowedHeaders": ["*"],
"AllowedMethods": ["PUT", "GET"]
"AllowedMethods": ["PUT", "GET"],
"ExposeHeaders": ["ETag"]
}
]
}
Expand Down
33 changes: 17 additions & 16 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,7 @@ Amazon S3 Storage (or Compatible)

Dataverse supports Amazon S3 storage as well as other S3-compatible stores (like Minio, Ceph RADOS S3 Gateway and many more) for files uploaded to Dataverse.

The Dataverse S3 driver supports multipart upload for files over 4 GB.
The Dataverse S3 driver supports multi-part upload for large files (over 1 GB by default - see the min-part-size option in the table below to change this).

**Note:** The Dataverse Team is most familiar with AWS S3, and can provide support on its usage with Dataverse. Thanks to community contributions, the application's architecture also allows non-AWS S3 providers. The Dataverse Team can provide very limited support on these other providers. We recommend reaching out to the wider Dataverse community if you have questions.

Expand Down Expand Up @@ -526,21 +526,22 @@ Lastly, go ahead and restart your Payara server. With Dataverse deployed and the
S3 Storage Options
##################

=========================================== ================== ================================================================== =============
JVM Option Value Description Default value
=========================================== ================== ================================================================== =============
dataverse.files.storage-driver-id <id> Enable <id> as the default storage driver. ``file``
dataverse.files.<id>.bucket-name <?> The bucket name. See above. (none)
dataverse.files.<id>.download-redirect ``true``/``false`` Enable direct download or proxy through Dataverse. ``false``
dataverse.files.<id>.upload-redirect ``true``/``false`` Enable direct upload of files added to a dataset to the S3 store. ``false``
dataverse.files.<id>.ingestsizelimit <size in bytes> Maximum size of directupload files that should be ingested (none)
dataverse.files.<id>.url-expiration-minutes <?> If direct uploads/downloads: time until links expire. Optional. 60
dataverse.files.<id>.custom-endpoint-url <?> Use custom S3 endpoint. Needs URL either with or without protocol. (none)
dataverse.files.<id>.custom-endpoint-region <?> Only used when using custom endpoint. Optional. ``dataverse``
dataverse.files.<id>.path-style-access ``true``/``false`` Use path style buckets instead of subdomains. Optional. ``false``
dataverse.files.<id>.payload-signing ``true``/``false`` Enable payload signing. Optional ``false``
dataverse.files.<id>.chunked-encoding ``true``/``false`` Disable chunked encoding. Optional ``true``
=========================================== ================== ================================================================== =============
=========================================== ================== ========================================================================= =============
JVM Option Value Description Default value
=========================================== ================== ========================================================================= =============
dataverse.files.storage-driver-id <id> Enable <id> as the default storage driver. ``file``
dataverse.files.<id>.bucket-name <?> The bucket name. See above. (none)
dataverse.files.<id>.download-redirect ``true``/``false`` Enable direct download or proxy through Dataverse. ``false``
dataverse.files.<id>.upload-redirect ``true``/``false`` Enable direct upload of files added to a dataset to the S3 store. ``false``
dataverse.files.<id>.ingestsizelimit <size in bytes> Maximum size of directupload files that should be ingested (none)
dataverse.files.<id>.url-expiration-minutes <?> If direct uploads/downloads: time until links expire. Optional. 60
dataverse.files.<id>.min-part-size <?> Multipart direct uploads will occur for files larger than this. Optional. ``1024**3``
dataverse.files.<id>.custom-endpoint-url <?> Use custom S3 endpoint. Needs URL either with or without protocol. (none)
dataverse.files.<id>.custom-endpoint-region <?> Only used when using custom endpoint. Optional. ``dataverse``
dataverse.files.<id>.path-style-access ``true``/``false`` Use path style buckets instead of subdomains. Optional. ``false``
dataverse.files.<id>.payload-signing ``true``/``false`` Enable payload signing. Optional ``false``
dataverse.files.<id>.chunked-encoding ``true``/``false`` Disable chunked encoding. Optional ``true``
=========================================== ================== ========================================================================= =============

Reported Working S3-Compatible Storage
######################################
Expand Down
34 changes: 34 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
import org.primefaces.model.file.UploadedFile;
import javax.json.Json;
import javax.json.JsonObject;
import javax.json.JsonObjectBuilder;
import javax.json.JsonArray;
import javax.json.JsonReader;
import org.apache.commons.httpclient.HttpClient;
Expand Down Expand Up @@ -1722,6 +1723,7 @@ public String getRsyncScriptFilename() {
return rsyncScriptFilename;
}

@Deprecated
public void requestDirectUploadUrl() {


Expand All @@ -1743,6 +1745,38 @@ public void requestDirectUploadUrl() {
PrimeFaces.current().executeScript("uploadFileDirectly('" + url + "','" + storageIdentifier + "')");
}

public void requestDirectUploadUrls() {

Map<String, String> paramMap = FacesContext.getCurrentInstance().getExternalContext().getRequestParameterMap();

String sizeString = paramMap.get("fileSize");
long fileSize = Long.parseLong(sizeString);

S3AccessIO<?> s3io = FileUtil.getS3AccessForDirectUpload(dataset);
if (s3io == null) {
FacesContext.getCurrentInstance().addMessage(uploadComponentId,
new FacesMessage(FacesMessage.SEVERITY_ERROR,
BundleUtil.getStringFromBundle("dataset.file.uploadWarning"),
"Direct upload not supported for this dataset"));
}
JsonObjectBuilder urls = null;
String storageIdentifier = null;
try {
storageIdentifier = FileUtil.getStorageIdentifierFromLocation(s3io.getStorageLocation());
urls = s3io.generateTemporaryS3UploadUrls(dataset.getGlobalId().asString(), storageIdentifier, fileSize);

} catch (IOException io) {
logger.warning(io.getMessage());
FacesContext.getCurrentInstance().addMessage(uploadComponentId,
new FacesMessage(FacesMessage.SEVERITY_ERROR,
BundleUtil.getStringFromBundle("dataset.file.uploadWarning"),
"Issue in connecting to S3 store for direct upload"));
}

PrimeFaces.current().executeScript(
"uploadFileDirectly('" + urls.build().toString() + "','" + storageIdentifier + "','" + fileSize + "')");
}

public void uploadFinished() {
// This method is triggered from the page, by the <p:upload ... onComplete=...
// attribute.
Expand Down

0 comments on commit cf1ac95

Please sign in to comment.