1 | 1.1.1 | Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints | 15 #13

sync-by-unito · 2022-10-07T00:27:10Z

References:

Backlog Grooming Board

Problem Statement

Prior to this work, Dataverse is capable of storing up to about 1TB in S3.

Proposed Solution

The first part was to integrate Globus as a large file transfer mechanism, into Dataverse. This was done by building on the work already done in the Borealis, the Canadian Dataverse Repository (Formerly Scholar’s Portal Fork of Dataverse) to allow Dataverse to integrate with Globus. This integration allows files bigger than a terabyte to be transferred from within Dataverse and to an S3 store via globus.

The second part addresses very large files; items up upward of a petabyte or so are not realistic for DV to store. This solution enables Dataverse to manage datasets where one or more of the files is referenced rather than being directly stored within a Dataverse repository. In this solution, the large file remains in its original location and then is referenced from Dataverse.

Acceptance Criteria

Discussion with Dataverse community members with related work,
Set up Globus environment at NESE
Design and implement code to call API to interact with Globus endpoints,
Test integration

Associated Issues:

See comments below for latest update.

┆Issue is synchronized with this Smartsheet row by Unito

mreekie · 2022-10-07T17:38:35Z

This issue represents a deliverable funded by the NIH
This deliverable supports the NIH Initiative to Improve Access to NIH-funded Data

Aim 1: Support the sharing of very large datasets (>TBs) by integrating the metadata in the repository with the data in the research computing storage

An increasing number of research studies deal with very large datasets (>TB to PBs). When the study is completed or ready to be distributed, it is not always feasible nor desirable to deposit the data in the repository. Instead, in this project we propose to publish the metadata to the repository for discoverability of the study and access the data remotely from the research computing cluster or cloud storage. In this scenario, the data does not need to be downloaded to the user’s computer but can be viewed, explored, and analyzed directly in the research computing environment. The Harvard Dataverse repository will leverage the Northeast Storage Exchange (NESE) and the New England Research Cloud (NERC) to provide storage and compute for these very large datasets by finding and accessing them through the repository and keeping the metadata and data connected via a persistent link. These two services - NESE and NERC - are large-scale multi-institutional infrastructure components of the Massachusetts Green High Performance Computing Center (MGHPCC) -- a five member public-private partnership between Boston University, Harvard University, Massachusetts Institute of Technology, Northeastern University, and the University of Massachusetts. MGHPCC is a $90 million facility that has the capacity to grow up to 768 rack, 15 MW of power and 1 terabit of network capacity in the current 95,000 sq. ft data center. One of the key integration points to support large data transfers is to incorporate Globus endpoints. Globus is a distributed data transfer technology developed at University of Chicago that is becoming ubiquitous for research computing services. This will allow the realistic transfer of TBs of data in less than an hour. Globus will also be a front end of NESE Tape, a 100+ PB tape library within MGHPCC. The integration of the repository with research computing is one of the components of a Data Commons that will facilitate collaboration, dissemination, preservation and validation of data-centric research.

Related Deliverables:
2.1.1 | 1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model | 10
2.1.1 | 1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model | 10
4.1.1 | 1 | Assess cost recovery model | 10
4.1.2 | 1 | Improve metadata registry UX/UI and integration with remote storage and computation based on user feedback from the previous years | 10

This work also represents a deliverable funded internally.

Harvard Data Commons MVP: Objective 1
Objective: Integrate Harvard Research Computing environments and Harvard repositories to facilitate publishing data and/or metadata throughout the research project lifecycle
Publish datasets, w/ data and metadata (GBs size)
Publish metadata only and reference data in RC (TBs size)

Work package 1: Review and assess an existing open-source Globus connector tool
Work package 2: Implement the rest of the connector tool to support use cases A and B
Work package 3: Extend Dataverse UI to support the connector tool in a user-friendly way
Work package 4: Beta test with real data and users

This picture shows how the the Harvard Data Commons work maps to Dataverse work.

This is a closer look at the Harvard Datacommons work: GDCC DataCommons Objective 1 Task Tracking

This is not a public link

mreekie · 2022-10-08T15:58:49Z

Next step:

Sync up with @qqmyers to summarize this work in a non-technical way.
Are we finished with this NIH deliverable? (yes)
write a final update and add it to the description.
- Update the notes from discussion/update from Jim
- Run them by Jim & Phil for correctness.
- Incorporate changes

mreekie · 2022-10-11T18:02:46Z

Summary
This work enabled a minimum viable process for supporting much larger files.
The code associated with this minimum viable process is included in Dataverse 5.12.

Prior to this work, Dataverse was capable of storing up to about 1TB in S3.

The first part was to integrate Globus as a large file transfer mechanism, into Dataverse. This was done by building on the work already done in the Borealis, the Canadian Dataverse Repository (Formerly Scholar’s Portal Fork of Dataverse) to allow Dataverse to integrate with Globus. This integration allows files bigger than a terabyte to be transferred from within Dataverse and to an S3 store via globus.

The second part addresses very large files; items up upward of a petabyte or so are not realistic for DV to store. This solution enables Dataverse to manage datasets where one or more of the files is referenced rather than being directly stored within a Dataverse repository. In this solution, the large file remains in its original location and then is referenced from Dataverse.

What's left to complete the overall minimum viable process?

The Harvard Globus Endpoint is at the beginnings of it's setup in NESE

What's not in this deliverable?

The software has not seen use with users. The Borealis, Dataverse Globus UI, was created as a proof of concept. There were some modifications to enable this work to be completed, but the interface is still at the level of a proof-of-concept from the point of view of usability.
The under-the-hood setup for using Globus with a Dataverse instance is still complicated.

mreekie · 2022-10-12T20:02:48Z

Who:

Jim
definitely Len
There's discussion here about what's left and how to handle that.
- e.g. Maintenance in NESE\
Leonid

mreekie · 2022-11-04T19:16:55Z

Updated today. Met with Stefano and Len.

There was a second meeting with Scott this week.
The establishment of the Globus endpoint at NESE is one of the last big things to be done for this deliverable.

(1.1.1) Now that these changes have been released in Dataverse 5.12, the focus has shifted to setting up a production environment for researchers to use the remote large data support. Discussions are continuing over the next weeks around allocating storage and establishing a Globus endpoint for Dataverse from the Northeast Storage Exchange (NESE).

mreekie · 2022-12-02T19:31:40Z

Globus support needs to be revised.
You can upload the data.
If you get the data, it takes so long that you get a timeout.
This use case needs more work.
There are discussions going on with the library about who will do the additional work and who will pay.
There are discussions around payment for the storage iteself

qqmyers · 2022-12-02T19:37:27Z

Slow download is not a problem with the current implementation using a Globus S3 connector over a file system. It will be a limit when the underlying storage is a tape system, where using the Globus S3 connector may also not be as useful as using their file connector. See IQSS/dataverse#9123 for details. scholarsportal/dataverse-globus#2 is also relevant.

mreekie · 2022-12-05T18:07:47Z

November: 2022

(1.1.1) Focused shifted to setting up a production environment for researchers to use the remote large data support. Discussions are continuing over the next weeks around allocating storage and establishing a Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE).

mreekie · 2022-12-15T19:18:46Z

Discussions are proceeding around next steps

Summary from talking with Jim,

There was a meeting Monday where Scott Yokel, Len, Stefano, Jim and others went over the proposal/requirements/design options doc Jim put together. Planning to talk with Scholars Portal/Borealis about their interest in this as well.

Next step:

Include a brief update.
Follow-up with Stefano on this deliverable and any follow-on work in the second year.

mreekie · 2022-12-15T20:38:09Z

Reviewed today. Closing out 2022. This work closes out the end of February.

Discussion with Dataverse community members with related work,
Set up Globus environment at NESE

The following two deliverables are still being worked.

Design and implement code to call API to interact with Globus endpoints,
Test integration

We have an MVP representing a connection from HDV and globus but it does not support large files yet. This requires additional design and implementation steps.

mreekie · 2022-12-15T20:49:11Z

Last updated: Thu Dec 15 2022 before I left for the holiday
Report: Dec 2022

Planning continues around supporting the Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE) and moving beyond the MVP. The MVP enables connection from Harvard Dataverse to the storage but does not support large files.

90%

mreekie · 2023-01-10T21:17:10Z

priority discussion with Stefano;

Can Jim help us size the last two issues for this?

mreekie · 2023-02-02T20:41:22Z

Last Update: approx Dec 20, 2022
(1.1.1) Planning continues around supporting the Globus endpoint for
Dataverse at the Northeast Storage Exchange (NESE) and moving beyond
the MVP. The MVP enables connection from Harvard Dataverse to the
Globus endpoint and storage but does not support real time browsing for
large files yet due to specific technological characteristics of tape support.

90%

mreekie · 2023-02-07T12:32:33Z

Monthly report.
(1.1.1) Planning continues around supporting the Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE) and moving beyond the MVP. The MVP enables connection from Harvard Dataverse to the Globus endpoint and storage but does not support real time browsing for large files yet due to specific technological characteristics of tape support. Technical plan for this last step is anchored in issue 9123

90%

mreekie · 2023-04-10T18:03:34Z

March:

(1.1.1) This activity was completed at an extent of 90% in year 1 and transferred to year 2.

mreekie · 2023-04-17T21:11:30Z

This activity was completed at an extent of 90% in year 1. Year 2 work toward completion will be tracked as deliverable (2.1.1a).

NIH AIM:1 YR:2 TASK:1.1.1 | 2.1.1B | (started yr1) Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints #80

Draft summary:

This activity was completed at an extent of 90% in year 1. The dataverse integration required for the MVP was released as part of Dataverse 5.12. This involved integrating Dataverse with Globus endpoints to enable remote storage of large data files while maintaining metadata in Dataverse. The focus then shifted to setting up a production environment using Globus endpoint for Dataverse at the Northeast Storage Exchange (NESE). The challenge is that NESE can’t support real time browsing for large files yet due to specific technological characteristics of tape support. This is where work will continue during year 2. Year 2 work toward completion will be tracked as aim:1 yr:2 task:1a (2.1.1A) starting at 90% complete.

mreekie self-assigned this Oct 7, 2022

mreekie transferred this issue from IQSS/dataverse Mar 3, 2023

mreekie added the pm.GREI https://docs.google.com/document/d/1RdifpHJDFqx8Y8-Dsv_VnnTgezjNHKpSyRei4cw3C-k/edit?usp=sharing label Mar 3, 2023

mreekie added the pm.GREI-d-1.1.1 NIH, yr1, aim1, task1: MVP for registering metadata in the repository label Mar 16, 2023

mreekie added this to IQSS Dataverse Project Mar 18, 2023

mreekie moved this to Reporting Deliverables in IQSS Dataverse Project Mar 18, 2023

mreekie mentioned this issue Apr 10, 2023

NIH AIM:1 YR:2 TASK:1.1.1 | 2.1.1 | Test and apply metadata registry for large datasets and integration with research computing for a few NIH-funded projects, piloting the cost recovery model #36

Closed

mreekie closed this as completed Apr 17, 2023

github-project-automation bot moved this from ℹ Reporting Deliverables to Clear of the Backlog in IQSS Dataverse Project Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 | 1.1.1 | Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints | 15 #13

1 | 1.1.1 | Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints | 15 #13

sync-by-unito bot commented Oct 7, 2022 •

edited

Loading

mreekie commented Oct 7, 2022 •

edited

Loading

mreekie commented Oct 8, 2022 •

edited

Loading

mreekie commented Oct 11, 2022 •

edited

Loading

mreekie commented Oct 12, 2022 •

edited

Loading

mreekie commented Nov 4, 2022 •

edited

Loading

mreekie commented Dec 2, 2022 •

edited

Loading

qqmyers commented Dec 2, 2022

mreekie commented Dec 5, 2022

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Jan 10, 2023

mreekie commented Feb 2, 2023 •

edited

Loading

mreekie commented Feb 7, 2023 •

edited

Loading

mreekie commented Apr 10, 2023

mreekie commented Apr 17, 2023 •

edited

Loading

1 | 1.1.1 | Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints | 15 #13

1 | 1.1.1 | Minimum Viable Product (MVP) for registering metadata in the repository and connecting the metadata to the data in the research computing remote storage (NESE), including Globus endpoints | 15 #13

Comments

sync-by-unito bot commented Oct 7, 2022 • edited Loading

mreekie commented Oct 7, 2022 • edited Loading

mreekie commented Oct 8, 2022 • edited Loading

mreekie commented Oct 11, 2022 • edited Loading

mreekie commented Oct 12, 2022 • edited Loading

mreekie commented Nov 4, 2022 • edited Loading

mreekie commented Dec 2, 2022 • edited Loading

qqmyers commented Dec 2, 2022

mreekie commented Dec 5, 2022

mreekie commented Dec 15, 2022 • edited Loading

mreekie commented Dec 15, 2022 • edited Loading

mreekie commented Dec 15, 2022 • edited Loading

mreekie commented Jan 10, 2023

mreekie commented Feb 2, 2023 • edited Loading

mreekie commented Feb 7, 2023 • edited Loading

mreekie commented Apr 10, 2023

mreekie commented Apr 17, 2023 • edited Loading

sync-by-unito bot commented Oct 7, 2022 •

edited

Loading

mreekie commented Oct 7, 2022 •

edited

Loading

mreekie commented Oct 8, 2022 •

edited

Loading

mreekie commented Oct 11, 2022 •

edited

Loading

mreekie commented Oct 12, 2022 •

edited

Loading

mreekie commented Nov 4, 2022 •

edited

Loading

mreekie commented Dec 2, 2022 •

edited

Loading

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Dec 15, 2022 •

edited

Loading

mreekie commented Feb 2, 2023 •

edited

Loading

mreekie commented Feb 7, 2023 •

edited

Loading

mreekie commented Apr 17, 2023 •

edited

Loading