New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Capture Module (rsync support) #3145

Closed
pdurbin opened this Issue May 27, 2016 · 6 comments

Comments

@pdurbin
Member

pdurbin commented May 27, 2016

@pameyer @bmckinney and I met yesterday to discuss what we're calling a "Data Capture Module" or "DCM" for short. http://guides.dataverse.org/en/4.3.1/installation/prep.html#architecture-and-components lists a number of optional components for Dataverse (Shibboleth, rApache, Rserve, Geoconnect, etc.) and "Data Capture Module" will be added to the list. The DCM's main role in the architecture is facilitating large file transfer (#952), especially via non-HTTP mechanism such as rsync.

The Minimum Viable Product (MVP) for the Data Capture Module includes support for rsync (#2960) but other mechanisms are under consideration such as Globus (#2728, #952), Aspera, and SFTP. https://data.sbgrid.org already supports rsync and we expect to be reusing code from that service, cleaning it up and generalizing it.

The task list for the Data Capture Module is still very much in flux but I'm creating this issue so that I have an issue number to associate a branch with as I start committing some code on the Dataverse side, especially API endpoints and the ability for Dataverse to talk to the DCM.

@pameyer

This comment has been minimized.

Contributor

pameyer commented May 27, 2016

For the DCM (and data uploads generally), "rsync" is short-hand for rsync over ssh. Client-side checksums are also part of the DCM: we'll have to decide how we want to handle the difference in hash functions (switching hashes or multi-hash support in Dataverse). Sorting that out might be out of scope for DCM MVP.

pdurbin added a commit to pdurbin/dataverse that referenced this issue May 31, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 16, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 16, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 17, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 17, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 21, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 22, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 23, 2016

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jun 23, 2016

@pdurbin pdurbin referenced this issue Jun 24, 2016

Closed

SWORD upload limit, change from int to long #2169

0 of 2 tasks complete
@pdurbin

This comment has been minimized.

Member

pdurbin commented Jun 24, 2016

@bmckinney and I met yesterday (notes at https://docs.google.com/document/d/1BSVqAqsc_KieqfFfg_CeKdV7HwDO1Y3UuC2VMO-RJDk/edit?usp=sharing ). I demo'ed https://github.com/pdurbin/dataverse/tree/3145-dcm and he's going to try to merge that branch with https://github.com/bmckinney/bio-dataverse/tree/feature/file-system-import so we can deploy the combined code at https://dv.sbgrid.org and hopefully get closer to a prototype of rsync support. I expect we'll need help from @pameyer to switch from my mock version of the Data Capture Module at https://github.com/sbgrid/data-capture-module/blob/master/api/dcm.py to more of the real thing. All code mentioned above is very preliminary at this point. We still need to meet with @landreev to discuss how to make rsync support compatible with file versioning.

@djbrooke

This comment has been minimized.

Contributor

djbrooke commented Sep 12, 2016

(Note to self, mostly) This is a parent issue of the items created and estimated in the 9/8 meeting, notes recorded here:

https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit

These will be created as new Github issues and linked here.

@pdurbin

This comment has been minimized.

Member

pdurbin commented Sep 13, 2016

@djbrooke thanks! Here are the related issues we created today:

  • #3347 Administration Changes to Support rsync
  • #3348 Workflow changes to allow rsync uploads
  • #3349 Viewing files uploaded using rsync
  • #3350 Add/Remove download options for files uploaded using rsync
  • #3351 Custom File Handling - rsync files
  • #3352 Add new APIs in support of rsync
  • #3353 File Import Batch job in support of rsync
  • #3354 Support SHA1 rather than MD5 as a checksum on a per file basis

pdurbin added a commit that referenced this issue Sep 20, 2016

Support SHA1 rather than MD5 as a checksum on a per file basis #3354
A dependency for rsync support (#3145) is the ability to persist SHA-1
checksums for files rather than MD5 checksums.

A new installation-wide configuration setting called
":FileFixityChecksumAlgorithm" has been added which can be set to
"SHA-1" to have Dataverse calculate and show SHA-1 checksums rather than
MD5 checksums.

In order to run this branch you must run the provided SQL upgrade
script: scripts/database/upgrades/3354-alt-checksum.sql

In addition, the Solr schema should be updated to the version in this
branch.
@pdurbin

This comment has been minimized.

Member

pdurbin commented Oct 30, 2016

#3249 is highly related in that ultimately, end users will need to know how to download the data via rsync or whatever mechanism. The focus of this issue to date has been researchers uploading data, not end-users downloading it.

@pdurbin

This comment has been minimized.

Member

pdurbin commented Jun 28, 2017

These days we're working in small chunks. To follow along, start at the next small chunk that's currently in the backlog: #3942. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment