Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project: Pilot Large Data Support Service #178

Open
11 of 23 tasks
cmbz opened this issue Feb 9, 2024 · 5 comments
Open
11 of 23 tasks

Project: Pilot Large Data Support Service #178

cmbz opened this issue Feb 9, 2024 · 5 comments
Assignees
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository Project: Large Data Support Pilot Pilot of large data support services using NESE resources

Comments

@cmbz
Copy link
Contributor

cmbz commented Feb 9, 2024

Overview

Pilot Harvard Dataverse large data support services using NESE tape resources for several datasets from Harvard affiliates.

Tasks

  • Identify prospective data collections
  • Confirm pilot participants
  • Coordinate with pilot collection owners to estimate collection size and curation needs
  • Create Harvard Dataverse collections
  • Coordinate with NESE support staff for tape provisioning
  • Coordinate with data collection owners, NESE support, and Dataverse team to upload data
  • Coordinate with IQSS finance re. costs
  • Assess pilot and propose workflow improvements

Process Development and Management

  • Develop an intake process including basic and extended consultation process
  • Develop and launch RT queue, including service attributes (will map to service offerings)
  • Develop large data collection intake form @sbarbosadataverse
  • Define what is included in the Large Data Technical Support service component
  • Define what is included in the Large Data Administration service component
  • Define what is included in the Large Data Monitoring service component
  • End-user instructions for NESE data access, add to Dataverse documentation in Guide and HDV support website
  • Document large data curation services process (see hdv-curation issues on large data)

Team

Pilot Participants

The Oregon-Massachusetts Mammography Database (OMAMA-DB)

See also: https://fly.cs.umb.edu/omama/ and https://github.com/IQSS/dataverse-HDV-Curation/issues/443

  • Size: 8T
  • Contact: Daniel Haehn, SEAS & UMass Boston

GEOS-Chem 1 and 10 year Benchmark data

See also:
https://help.hmdc.harvard.edu/Ticket/Display.html?id=361614
https://github.com/IQSS/dataverse-HDV-Curation/issues/456

DrivAerNet: A Parametric Car Dataset for Data-driven Aerodynamic Design and Graph-Based Drag Prediction

See also: https://github.com/IQSS/dataverse-HDV-Curation/issues/444

  • moving their ‘streamlined’ dataset (<1TB) to the DeCoDE lab’s dataverse
  • <1TBish
  • File sizes: Largest: 1.6G and smallest: 400bytes,
  • Total number of files: not sure, user quoted 10^6, but was fuzzy on that.
  • File types: .tar.gz, txt, vtk, stl, .sh, data will be packed in .zip format
  • By mid to end of May: the full 16TB.

Under discussion for support

-https://help.hmdc.harvard.edu/Ticket/Display.html?id=362556
-https://help.hmdc.harvard.edu/Ticket/Display.html?id=358471

Related RT Tickets

Issues

Related

Resources

@cmbz cmbz added Project: Large Data Support Pilot Pilot of large data support services using NESE resources Harvard Dataverse Issues related to Harvard Dataverse Repository Dataverse Project Issues related to Dataverse Project software labels Mar 2, 2024
@cmbz
Copy link
Contributor Author

cmbz commented Apr 1, 2024

Status: March 2024

Closed

@cmbz
Copy link
Contributor Author

cmbz commented Apr 10, 2024

Status: April 2024

Improvements made to Globus: "The PR fixes the issue with multifile Globus transfers out/downloads not working for draft datasets. It also improves handling of cases where ineligible files are selected for download or Globus transfer by only showing the download/transfer mechanisms that will work (on some files) given the files in the dataset and by improving the UI messages to indicate that files may not be eligible either because the user doesn't have permission (restricted or embargoed), or because the files can't be downloaded/transferred (i.e. files not in a Globus store when the user tries a Globus transfer or files in a Globus store that doesn't support normal downloads when the user selects download.)"

@sbarbosadataverse
Copy link

sbarbosadataverse commented May 21, 2024

Status: May 2024

  • Added new Process Development and Management section to this issue
  • Tested Globus/NESE download for OMAMA dataset, download successful but found that guestbook is not collecting downloads, could be due to the dataset being in "draft"
  • Leonid created a demo dataset for testing Globus access for non Harvard affiliates: https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/URDDBC
  • Ceilyn revised documentation for large data services, budget planner, and internal HDV service components
  • New section added to record groups interested, but not yet confirmed for the pilot

@cmbz
Copy link
Contributor Author

cmbz commented May 28, 2024

Status: June 2024

@sbarbosadataverse
Copy link

sbarbosadataverse commented Jul 8, 2024

Status: July 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository Project: Large Data Support Pilot Pilot of large data support services using NESE resources
Projects
None yet
Development

No branches or pull requests

2 participants