Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

Closed
landreev opened this issue Apr 12, 2023 · 11 comments
Assignees
Labels
Status: Needs Input Applied to issues in need of input from someone currently unavailable

Comments

@landreev
Copy link
Collaborator

Will add details here.
There's an ongoing discussion of what's involved in the dedicated slack channel and a google doc.

@landreev landreev self-assigned this Apr 12, 2023
@landreev
Copy link
Collaborator Author

Transferred the first 20GB data sample as a test. Added to the draft dataset in the new Census collection in prod.: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6&version=DRAFT
Some, more sensitive details can be found in the email thread.

@landreev landreev changed the title Placeholder issue for handling US Census data transfer into production Assist US Census with transferring a multi-TB data collection into production Apr 19, 2023
@landreev
Copy link
Collaborator Author

Working now on the next stage of the project - attempting a bucket-to-bucket transfer test of a large-ish (~500GB) chunk of data, to estimate how much time it is going to take for the main, TB-sized transfer.
To run a bucket-to-bucket copy via aws api (basically, aws s3 cp s3://sourcebucket/xxx s3://destinationbucket/yyy) you need an aws role that has access to both buckets, read to the source, and write to the destination. (i.e., there is no way to auth with 2 different roles for source and dest.; I only learned this last week).
So we need an IAM role created for the aws account that owns the prod. bucket, so that Census could grant it read access to theirs. I cannot do this myself with aws cli, so we need the LTS to create it for us. So this is the current step, made a request via lts-prodops.

From census:
For the bucket to bucket , the design is to have the recipient (yourself) create the IAM principal, and we would grant the account/principal read access to our bucket. For the larger dataset this is likely going to be critical. We should have taken those steps this time, sorry for the extra work.

@landreev
Copy link
Collaborator Author

Had the IAM role created, passed it to the Census contact for granting read access to the data source.
Can't wait to try to actually move a few 100GBs between buckets!
("between buckets" means we will never need to transfer any data to our own aws nodes, the prod. servers. There the transfer is limited to something like 150MB/s. It is roughly the same rate we are getting when reading from our own prod. bucket, and the one US Census one. I'm really curious to see how fast it is going to be when it's contained within S3)

@landreev
Copy link
Collaborator Author

Unfortunately, we haven't been able to run a successful direct transfer today (I haven't been able to gain read access with the local AIM user that they tried to grant such on their end... something on their end, it appears).
I'm very much interested in testing this rather sooner than later; and just genuinely curious about the performance. So I'll try to either make myself available to re-test next week, if/when they make any tweaks on their end; or I will pass the task onto somebody else.

@landreev
Copy link
Collaborator Author

I don't know if this is a metaphor of some sorts for government work in general, but it is increasingly looking like two weeks had been spent figuring out how to do this data transfer the "smart way", only to discover that the smart way was in fact slower than the "dumb", brute force way, that had been available from the get go.

Specifically, the direct "bucket-to-bucket" transfer that we finally got to work does appear to be taking longer than the "round trip" method of copying the data from their bucket to our server, then copying to our bucket. (My best guess is that any potential speed advantage of copying directly was entirely offset by the fact that the source and destination buckets live in 2 different data center regions, us-east-2 and us-east-1, respectively).

@landreev landreev changed the title Assist US Census with transferring a multi-TB data collection into production Transfer US Census data releases and publish in production per release schedule (ongoing effort) Jun 14, 2023
@cmbz cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 20, 2023
@cmbz cmbz self-assigned this Jun 28, 2023
@cmbz cmbz removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 28, 2023
@cmbz
Copy link
Collaborator

cmbz commented Jun 28, 2023

Plan is to host latest batch of 26T of US Census data on NESE tape and make it available via a Globus link from a Harvard Dataverse dataset page.

Current status:

@cmbz cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 29, 2023
@cmbz
Copy link
Collaborator

cmbz commented Jun 29, 2023

  • The US Census team did not want to purse the NESE tape option due to 100MB minimum file size requirement (would have required them to reorganize and repackage their dataset files)
  • We will continue to work with them to find alternative storage strategies

@landreev landreev removed their assignment Jun 30, 2023
@cmbz
Copy link
Collaborator

cmbz commented Jul 5, 2023

@cmbz
Copy link
Collaborator

cmbz commented Jul 20, 2023

  • We met with Scott Yockel this morning to discuss NESE tape and NESE disk options for supporting the 26T USCB dataset.
  • @siacus is considering options for a new proposal to USCB for support.

@cmbz
Copy link
Collaborator

cmbz commented Aug 16, 2023

Meeting has been scheduled for 2023/09/20 with Harvard Library, Harvard Research Computing and Data, and other stakeholders to discuss proposed model for supporting USCB big data using NERC resources.

@cmbz
Copy link
Collaborator

cmbz commented Jan 3, 2024

2024/01/03
Update: Tim was informed of new NESE disk option for large data support on 2023/12/20. We are waiting to hear if they want to move forward with the option.

@cmbz cmbz closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Input Applied to issues in need of input from someone currently unavailable
Projects
None yet
Development

No branches or pull requests

4 participants