Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial for data transfer workflow for large datasets #235

Open
jnywong opened this issue Jun 12, 2024 · 0 comments
Open

Tutorial for data transfer workflow for large datasets #235

jnywong opened this issue Jun 12, 2024 · 0 comments
Assignees

Comments

@jnywong
Copy link
Member

jnywong commented Jun 12, 2024

Context

Driven by the need for processing large bioscientific datasets for the Catalyst partner communities.

We propose data transfer workflow like the following:

a) users should stage their 'input' datasets in object storage buckets
b) if workflows support reading directly from object storage then use that else make a local copy from object storage to /tmp
c) use /tmp for any intermediate files created during a workflow pipeline
d) push 'output' data sets to object storage for persistence
e) strongly encourage community users to keep home directory storage to under 1GB per user
f) discourage use of shared expect for smaller datasets (100Gb total per community)

See 2i2c-org/infrastructure#4213

Proposal

Document the workflow as a tutorial to guide hub admins and end users through this recommended workflow.

Updates and actions

No response

@jnywong jnywong self-assigned this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant