Skip to content

Commit

Permalink
Merge pull request #8891 from GlobalDataverseCommunityConsortium/GDCC…
Browse files Browse the repository at this point in the history
…/DC-1

GDCC/Globus and Big Data Support
  • Loading branch information
pdurbin committed Sep 19, 2022
2 parents 454f3f1 + c554ecc commit 1435dcc
Show file tree
Hide file tree
Showing 47 changed files with 2,787 additions and 550 deletions.
28 changes: 28 additions & 0 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,34 @@ To configure the options mentioned above, an administrator must set two JVM opti
``./asadmin create-jvm-options "-Ddataverse.files.<id>.public=true"``
``./asadmin create-jvm-options "-Ddataverse.files.<id>.ingestsizelimit=<size in bytes>"``

.. _globus-support:

Globus File Transfer
--------------------

Note: Globus file transfer is still experimental but feedback is welcome! See :ref:`support`.

Users can transfer files via `Globus <ttps://www.globus.org>`_ into and out of datasets when their Dataverse installation is configured to use a Globus accessible S3 store and a community-developed `dataverse-globus <https://github.com/scholarsportal/dataverse-globus>`_ "transfer" app has been properly installed and configured.

Due to differences in the access control models of a Dataverse installation and Globus, enabling the Globus capability on a store will disable the ability to restrict and embargo files in that store.

As Globus aficionados know, Globus endpoints can be in a variety of places, from data centers to personal computers. This means that from within the Dataverse software, a Globus transfer can feel like an upload or a download (with Globus Personal Connect running on your laptop, for example) or it can feel like a true transfer from one server to another (from a cluster in a data center into a Dataverse dataset or vice versa).

Globus transfer uses a very efficient transfer mechanism and has additional features that make it suitable for large files and large numbers of files:

* robust file transfer capable of restarting after network or endpoint failures
* third-party transfer, which enables a user accessing a Dataverse installation in their desktop browser to initiate transfer of their files from a remote endpoint (i.e. on a local high-performance computing cluster), directly to an S3 store managed by the Dataverse installation

Globus transfer requires use of the Globus S3 connector which requires a paid Globus subscription at the host institution. Users will need a Globus account which could be obtained via their institution or directly from Globus (at no cost).

The setup required to enable Globus is described in the `Community Dataverse-Globus Setup and Configuration document <https://docs.google.com/document/d/1mwY3IVv8_wTspQC0d4ddFrD2deqwr-V5iAGHgOy4Ch8/edit?usp=sharing>`_ and the references therein.

As described in that document, Globus transfers can be initiated by choosing the Globus option in the dataset upload panel. (Globus, which does asynchronous transfers, is not available during dataset creation.) Analogously, "Globus Transfer" is one of the download options in the "Access Dataset" menu and optionally the file landing page download menu (if/when supported in the dataverse-globus app).

An overview of the control and data transfer interactions between components was presented at the 2022 Dataverse Community Meeting and can be viewed in the `Integrations and Tools Session Video <https://youtu.be/3ek7F_Dxcjk?t=5289>`_ around the 1 hr 28 min mark.

See also :ref:`Globus settings <:GlobusBasicToken>`.

Data Capture Module (DCM)
-------------------------

Expand Down
34 changes: 33 additions & 1 deletion doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -676,7 +676,7 @@ In addition to having the type "remote" and requiring a label, Trusted Remote St
These and other available options are described in the table below.

Trusted remote stores can range from being a static trusted website to a sophisticated service managing access requests and logging activity
and/or managing access to a secure enclave. For specific remote stores, consult their documentation when configuring the remote store in your Dataverse installation.
and/or managing access to a secure enclave. See :doc:`/developers/big-data-support` for additional information on how to use a trusted remote store. For specific remote stores, consult their documentation when configuring the remote store in your Dataverse installation.

Note that in the current implementation, activites where Dataverse needs access to data bytes, e.g. to create thumbnails or validate hash values at publication will fail if a remote store does not allow Dataverse access. Implementers of such trusted remote stores should consider using Dataverse's settings to disable ingest, validation of files at publication, etc. as needed.

Expand Down Expand Up @@ -2982,3 +2982,35 @@ The URL of an LDN Inbox to which the LDN Announce workflow step will send messag
++++++++++++++++++++++++++

The list of parent dataset field names for which the LDN Announce workflow step should send messages. See :doc:`/developers/workflows` for details.

.. _:GlobusBasicToken:

:GlobusBasicToken
+++++++++++++++++

GlobusBasicToken encodes credentials for Globus integration. See :ref:`globus-support` for details.

:GlobusEndpoint
+++++++++++++++

GlobusEndpoint is Globus endpoint id used with Globus integration. See :ref:`globus-support` for details.

:GlobusStores
+++++++++++++

A comma-separated list of the S3 stores that are configured to support Globus integration. See :ref:`globus-support` for details.

:GlobusAppURL
+++++++++++++

The URL where the `dataverse-globus <https://github.com/scholarsportal/dataverse-globus>`_ "transfer" app has been deployed to support Globus integration. See :ref:`globus-support` for details.

:GlobusPollingInterval
++++++++++++++++++++++

The interval in seconds between Dataverse calls to Globus to check on upload progress. Defaults to 50 seconds. See :ref:`globus-support` for details.

:GlobusSingleFileTransfer
+++++++++++++++++++++++++

A true/false option to add a Globus transfer option to the file download menu which is not yet fully supported in the dataverse-globus app. See :ref:`globus-support` for details.
5 changes: 4 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/DatasetLock.java
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,10 @@ public enum Reason {

/** DCM (rsync) upload in progress */
DcmUpload,


/** Globus upload in progress */
GlobusUpload,

/** Tasks handled by FinalizeDatasetPublicationCommand:
Registering PIDs for DS and DFs and/or file validation */
finalizePublication,
Expand Down
67 changes: 50 additions & 17 deletions src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@
import edu.harvard.iq.dataverse.engine.command.impl.SubmitDatasetForReviewCommand;
import edu.harvard.iq.dataverse.externaltools.ExternalTool;
import edu.harvard.iq.dataverse.externaltools.ExternalToolServiceBean;
import edu.harvard.iq.dataverse.globus.GlobusServiceBean;
import edu.harvard.iq.dataverse.export.SchemaDotOrgExporter;
import edu.harvard.iq.dataverse.externaltools.ExternalToolHandler;
import edu.harvard.iq.dataverse.makedatacount.MakeDataCountLoggingServiceBean;
Expand Down Expand Up @@ -251,6 +252,8 @@ public enum DisplayMode {
LicenseServiceBean licenseServiceBean;
@Inject
DataFileCategoryServiceBean dataFileCategoryService;
@Inject
GlobusServiceBean globusService;

private Dataset dataset = new Dataset();

Expand Down Expand Up @@ -334,7 +337,7 @@ public void setSelectedHostDataverse(Dataverse selectedHostDataverse) {
private Boolean hasRsyncScript = false;

private Boolean hasTabular = false;


/**
* If the dataset version has at least one tabular file. The "hasTabular"
Expand Down Expand Up @@ -1191,7 +1194,7 @@ public String getComputeUrl(FileMetadata metadata) {
} catch (IOException e) {
logger.info("DatasetPage: Failed to get storageIO");
}
if (settingsWrapper.isTrueForKey(SettingsServiceBean.Key.PublicInstall, false)) {
if (isHasPublicStore()) {
return settingsWrapper.getValueForKey(SettingsServiceBean.Key.ComputeBaseUrl) + "?" + this.getPersistentId() + "=" + swiftObject.getSwiftFileName();
}

Expand Down Expand Up @@ -1828,15 +1831,21 @@ public void updateOwnerDataverse() {

// initiate from scratch: (isolate the creation of a new dataset in its own method?)
init(true);
// rebuild the bred crumbs display:
// rebuild the bread crumbs display:
dataverseHeaderFragment.initBreadcrumbs(dataset);
}
}

public boolean rsyncUploadSupported() {

return settingsWrapper.isRsyncUpload() && DatasetUtil.isAppropriateStorageDriver(dataset);
return settingsWrapper.isRsyncUpload() && DatasetUtil.isRsyncAppropriateStorageDriver(dataset);
}

public boolean globusUploadSupported() {
return settingsWrapper.isGlobusUpload() && settingsWrapper.isGlobusEnabledStorageDriver(dataset.getEffectiveStorageDriverId());
}



private String init(boolean initFull) {

Expand Down Expand Up @@ -2006,10 +2015,10 @@ private String init(boolean initFull) {
}
} catch (RuntimeException ex) {
logger.warning("Problem getting rsync script(RuntimeException): " + ex.getLocalizedMessage());
FacesContext.getCurrentInstance().addMessage(null, new FacesMessage(FacesMessage.SEVERITY_ERROR, "Problem getting rsync script:", ex.getLocalizedMessage()));
FacesContext.getCurrentInstance().addMessage(null, new FacesMessage(FacesMessage.SEVERITY_ERROR, "Problem getting rsync script:", ex.getLocalizedMessage()));
} catch (CommandException cex) {
logger.warning("Problem getting rsync script (Command Exception): " + cex.getLocalizedMessage());
FacesContext.getCurrentInstance().addMessage(null, new FacesMessage(FacesMessage.SEVERITY_ERROR, "Problem getting rsync script:", cex.getLocalizedMessage()));
FacesContext.getCurrentInstance().addMessage(null, new FacesMessage(FacesMessage.SEVERITY_ERROR, "Problem getting rsync script:", cex.getLocalizedMessage()));
}
}

Expand Down Expand Up @@ -2065,7 +2074,7 @@ private String init(boolean initFull) {
updateDatasetFieldInputLevels();
}

if (settingsWrapper.isTrueForKey(SettingsServiceBean.Key.PublicInstall, false)){
if (isHasPublicStore()){
JH.addMessage(FacesMessage.SEVERITY_WARN, BundleUtil.getStringFromBundle("dataset.message.label.fileAccess"),
BundleUtil.getStringFromBundle("dataset.message.publicInstall"));
}
Expand Down Expand Up @@ -2178,6 +2187,10 @@ private void displayLockInfo(Dataset dataset) {
BundleUtil.getStringFromBundle("file.rsyncUpload.inProgressMessage.details"));
lockedDueToDcmUpload = true;
}
if (dataset.isLockedFor(DatasetLock.Reason.GlobusUpload)) {
JH.addMessage(FacesMessage.SEVERITY_WARN, BundleUtil.getStringFromBundle("file.globusUpload.inProgressMessage.summary"),
BundleUtil.getStringFromBundle("file.globusUpload.inProgressMessage.details"));
}
//This is a hack to remove dataset locks for File PID registration if
//the dataset is released
//in testing we had cases where datasets with 1000 files were remaining locked after being published successfully
Expand Down Expand Up @@ -2899,7 +2912,7 @@ public String editFileMetadata(){

public String deleteDatasetVersion() {
DeleteDatasetVersionCommand cmd;

Map<Long, String> deleteStorageLocations = datafileService.getPhysicalFilesToDelete(dataset.getLatestVersion());
boolean deleteCommandSuccess = false;
try {
Expand All @@ -2911,7 +2924,7 @@ public String deleteDatasetVersion() {
JH.addMessage(FacesMessage.SEVERITY_FATAL, BundleUtil.getStringFromBundle("dataset.message.deleteFailure"));
logger.severe(ex.getMessage());
}

if (deleteCommandSuccess && !deleteStorageLocations.isEmpty()) {
datafileService.finalizeFileDeletes(deleteStorageLocations);
}
Expand Down Expand Up @@ -5026,7 +5039,7 @@ public boolean isFileAccessRequestMultiButtonRequired(){
}
for (FileMetadata fmd : workingVersion.getFileMetadatas()){
//Change here so that if all restricted files have pending requests there's no Request Button
if ((!this.fileDownloadHelper.canDownloadFile(fmd) && (fmd.getDataFile().getFileAccessRequesters() == null
if ((!this.fileDownloadHelper.canDownloadFile(fmd) && (fmd.getDataFile().getFileAccessRequesters() == null
|| ( fmd.getDataFile().getFileAccessRequesters() != null
&& !fmd.getDataFile().getFileAccessRequesters().contains((AuthenticatedUser)session.getUser()))))){
return true;
Expand Down Expand Up @@ -5754,7 +5767,7 @@ public boolean isFileDeleted (DataFile dataFile) {

return dataFile.getDeleted();
}

public String getEffectiveMetadataLanguage() {
return getEffectiveMetadataLanguage(false);
}
Expand All @@ -5765,16 +5778,16 @@ public String getEffectiveMetadataLanguage(boolean ofParent) {
}
return mdLang;
}

public String getLocaleDisplayName(String code) {
String displayName = settingsWrapper.getBaseMetadataLanguageMap(false).get(code);
if(displayName==null && !code.equals(DvObjectContainer.UNDEFINED_METADATA_LANGUAGE_CODE)) {
//Default (for cases such as :when a Dataset has a metadatalanguage code but :MetadataLanguages is no longer defined).
displayName = new Locale(code).getDisplayName();
displayName = new Locale(code).getDisplayName();
}
return displayName;
return displayName;
}

public Set<Entry<String, String>> getMetadataLanguages() {
return settingsWrapper.getBaseMetadataLanguageMap(false).entrySet();
}
Expand All @@ -5786,7 +5799,7 @@ public List<String> getVocabScripts() {
public String getFieldLanguage(String languages) {
return fieldService.getFieldLanguage(languages,session.getLocaleCode());
}

public void setExternalStatus(String status) {
try {
dataset = commandEngine.submit(new SetCurationStatusCommand(dvRequestService.getDataverseRequest(), dataset, status));
Expand Down Expand Up @@ -6017,7 +6030,7 @@ public void validateTerms(FacesContext context, UIComponent component, Object va
}
}
}

public boolean downloadingRestrictedFiles() {
if (fileMetadataForAction != null) {
return fileMetadataForAction.isRestricted();
Expand All @@ -6029,4 +6042,24 @@ public boolean downloadingRestrictedFiles() {
}
return false;
}


//Determines whether this Dataset uses a public store and therefore doesn't support embargoed or restricted files
public boolean isHasPublicStore() {
return settingsWrapper.isTrueForKey(SettingsServiceBean.Key.PublicInstall, StorageIO.isPublicStore(dataset.getEffectiveStorageDriverId()));
}

public void startGlobusTransfer() {
ApiToken apiToken = null;
User user = session.getUser();
if (user instanceof AuthenticatedUser) {
apiToken = authService.findApiTokenByUser((AuthenticatedUser) user);
} else if (user instanceof PrivateUrlUser) {
PrivateUrlUser privateUrlUser = (PrivateUrlUser) user;
PrivateUrl privUrl = privateUrlService.getPrivateUrlFromDatasetId(privateUrlUser.getDatasetId());
apiToken = new ApiToken();
apiToken.setTokenString(privUrl.getToken());
}
PrimeFaces.current().executeScript(globusService.getGlobusDownloadScript(dataset, apiToken));
}
}
21 changes: 11 additions & 10 deletions src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,23 +16,17 @@
import edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand;
import edu.harvard.iq.dataverse.engine.command.impl.GetDatasetStorageSizeCommand;
import edu.harvard.iq.dataverse.export.ExportService;
import edu.harvard.iq.dataverse.globus.GlobusServiceBean;
import edu.harvard.iq.dataverse.harvest.server.OAIRecordServiceBean;
import edu.harvard.iq.dataverse.search.IndexServiceBean;
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.workflows.WorkflowComment;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;

import java.io.*;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.*;
import java.util.logging.FileHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
Expand Down Expand Up @@ -96,6 +90,12 @@ public class DatasetServiceBean implements java.io.Serializable {
@EJB
SystemConfig systemConfig;

@EJB
GlobusServiceBean globusServiceBean;

@EJB
UserNotificationServiceBean userNotificationService;

private static final SimpleDateFormat logFormatter = new SimpleDateFormat("yyyy-MM-dd'T'HH-mm-ss");

@PersistenceContext(unitName = "VDCNet-ejbPU")
Expand Down Expand Up @@ -1130,4 +1130,5 @@ public void deleteHarvestedDataset(Dataset dataset, DataverseRequest request, Lo
hdLogger.warning("Failed to destroy the dataset");
}
}

}
21 changes: 17 additions & 4 deletions src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -650,8 +650,8 @@ public String init() {
setUpRsync();
}

if (settingsService.isTrueForKey(SettingsServiceBean.Key.PublicInstall, false)){
JH.addMessage(FacesMessage.SEVERITY_WARN, getBundleString("dataset.message.publicInstall"));
if (isHasPublicStore()){
JH.addMessage(FacesMessage.SEVERITY_WARN, getBundleString("dataset.message.label.fileAccess"), getBundleString("dataset.message.publicInstall"));
}

return null;
Expand Down Expand Up @@ -3051,13 +3051,21 @@ public boolean rsyncUploadSupported() {
// ToDo - rsync was written before multiple store support and currently is hardcoded to use the DataAccess.S3 store.
// When those restrictions are lifted/rsync can be configured per store, the test in the
// Dataset Util method should be updated
if (settingsWrapper.isRsyncUpload() && !DatasetUtil.isAppropriateStorageDriver(dataset)) {
if (settingsWrapper.isRsyncUpload() && !DatasetUtil.isRsyncAppropriateStorageDriver(dataset)) {
//dataset.file.upload.setUp.rsync.failed.detail
FacesMessage message = new FacesMessage(FacesMessage.SEVERITY_ERROR, BundleUtil.getStringFromBundle("dataset.file.upload.setUp.rsync.failed"), BundleUtil.getStringFromBundle("dataset.file.upload.setUp.rsync.failed.detail"));
FacesContext.getCurrentInstance().addMessage(null, message);
}

return settingsWrapper.isRsyncUpload() && DatasetUtil.isAppropriateStorageDriver(dataset);
return settingsWrapper.isRsyncUpload() && DatasetUtil.isRsyncAppropriateStorageDriver(dataset);
}

// Globus must be one of the upload methods listed in the :UploadMethods setting
// and the dataset's store must be in the list allowed by the GlobusStores
// setting
public boolean globusUploadSupported() {
return settingsWrapper.isGlobusUpload()
&& settingsWrapper.isGlobusEnabledStorageDriver(dataset.getEffectiveStorageDriverId());
}

private void populateFileMetadatas() {
Expand Down Expand Up @@ -3093,4 +3101,9 @@ public boolean isFileAccessRequest() {
public void setFileAccessRequest(boolean fileAccessRequest) {
this.fileAccessRequest = fileAccessRequest;
}

//Determines whether this Dataset uses a public store and therefore doesn't support embargoed or restricted files
public boolean isHasPublicStore() {
return settingsWrapper.isTrueForKey(SettingsServiceBean.Key.PublicInstall, StorageIO.isPublicStore(dataset.getEffectiveStorageDriverId()));
}
}
Loading

0 comments on commit 1435dcc

Please sign in to comment.