Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

landreev · 2023-04-12T14:57:50Z

Will add details here.
There's an ongoing discussion of what's involved in the dedicated slack channel and a google doc.

landreev · 2023-04-19T14:44:19Z

Transferred the first 20GB data sample as a test. Added to the draft dataset in the new Census collection in prod.: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6&version=DRAFT
Some, more sensitive details can be found in the email thread.

landreev · 2023-04-24T23:52:08Z

Working now on the next stage of the project - attempting a bucket-to-bucket transfer test of a large-ish (~500GB) chunk of data, to estimate how much time it is going to take for the main, TB-sized transfer.
To run a bucket-to-bucket copy via aws api (basically, aws s3 cp s3://sourcebucket/xxx s3://destinationbucket/yyy) you need an aws role that has access to both buckets, read to the source, and write to the destination. (i.e., there is no way to auth with 2 different roles for source and dest.; I only learned this last week).
So we need an IAM role created for the aws account that owns the prod. bucket, so that Census could grant it read access to theirs. I cannot do this myself with aws cli, so we need the LTS to create it for us. So this is the current step, made a request via lts-prodops.

From census:
For the bucket to bucket , the design is to have the recipient (yourself) create the IAM principal, and we would grant the account/principal read access to our bucket. For the larger dataset this is likely going to be critical. We should have taken those steps this time, sorry for the extra work.

landreev · 2023-04-26T14:58:48Z

Had the IAM role created, passed it to the Census contact for granting read access to the data source.
Can't wait to try to actually move a few 100GBs between buckets!
("between buckets" means we will never need to transfer any data to our own aws nodes, the prod. servers. There the transfer is limited to something like 150MB/s. It is roughly the same rate we are getting when reading from our own prod. bucket, and the one US Census one. I'm really curious to see how fast it is going to be when it's contained within S3)

landreev · 2023-04-28T23:18:28Z

Unfortunately, we haven't been able to run a successful direct transfer today (I haven't been able to gain read access with the local AIM user that they tried to grant such on their end... something on their end, it appears).
I'm very much interested in testing this rather sooner than later; and just genuinely curious about the performance. So I'll try to either make myself available to re-test next week, if/when they make any tweaks on their end; or I will pass the task onto somebody else.

landreev · 2023-05-15T14:28:41Z

I don't know if this is a metaphor of some sorts for government work in general, but it is increasingly looking like two weeks had been spent figuring out how to do this data transfer the "smart way", only to discover that the smart way was in fact slower than the "dumb", brute force way, that had been available from the get go.

Specifically, the direct "bucket-to-bucket" transfer that we finally got to work does appear to be taking longer than the "round trip" method of copying the data from their bucket to our server, then copying to our bucket. (My best guess is that any potential speed advantage of copying directly was entirely offset by the fact that the source and destination buckets live in 2 different data center regions, us-east-2 and us-east-1, respectively).

cmbz · 2023-06-28T13:54:12Z

Plan is to host latest batch of 26T of US Census data on NESE tape and make it available via a Globus link from a Harvard Dataverse dataset page.

Current status:

Currently working with FASRC to provision 40T of tapes in Northeast Storage Exchange (NESE) with Globus access to support 26T of US Census Data. Goal is to start the process on 2023/06/28 and begin data transfer as soon as possible.
@sbarbosadataverse has created Harvard Dataverse dataset page

cmbz · 2023-06-29T14:59:57Z

The US Census team did not want to purse the NESE tape option due to 100MB minimum file size requirement (would have required them to reorganize and repackage their dataset files)
We will continue to work with them to find alternative storage strategies

cmbz · 2023-07-05T18:39:49Z

Complete US Census Data Project Infrastructure Options Comparison Chart as input to next data hosting proposal to US Census team (@cmbz) (Waiting on input from FASRC and other stakeholders)
Update
- Still awaiting input
- USCB has been informed that a new proposal is being prepared, pending input

cmbz · 2023-07-20T15:13:20Z

We met with Scott Yockel this morning to discuss NESE tape and NESE disk options for supporting the 26T USCB dataset.
@siacus is considering options for a new proposal to USCB for support.

cmbz · 2023-08-16T15:35:50Z

Meeting has been scheduled for 2023/09/20 with Harvard Library, Harvard Research Computing and Data, and other stakeholders to discuss proposed model for supporting USCB big data using NERC resources.

cmbz · 2024-01-03T20:24:48Z

2024/01/03
Update: Tim was informed of new NESE disk option for large data support on 2023/12/20. We are waiting to hear if they want to move forward with the option.

landreev self-assigned this Apr 12, 2023

landreev changed the title ~~Placeholder issue for handling US Census data transfer into production~~ Assist US Census with transferring a multi-TB data collection into production Apr 19, 2023

landreev changed the title ~~Assist US Census with transferring a multi-TB data collection into production~~ Transfer US Census data releases and publish in production per release schedule (ongoing effort) Jun 14, 2023

cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 20, 2023

cmbz self-assigned this Jun 28, 2023

cmbz removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 28, 2023

cmbz assigned sbarbosadataverse Jun 28, 2023

cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jun 29, 2023

landreev removed their assignment Jun 30, 2023

cmbz assigned siacus Jul 20, 2023

pdurbin mentioned this issue Sep 8, 2023

incorporate parquet files? IQSS/dataverse#9897

Open

cmbz closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

landreev commented Apr 12, 2023

landreev commented Apr 19, 2023

landreev commented Apr 24, 2023

landreev commented Apr 26, 2023

landreev commented Apr 28, 2023

landreev commented May 15, 2023

cmbz commented Jun 28, 2023 •

edited

Loading

cmbz commented Jun 29, 2023 •

edited by pdurbin

Loading

cmbz commented Jul 5, 2023 •

edited

Loading

cmbz commented Jul 20, 2023

cmbz commented Aug 16, 2023

cmbz commented Jan 3, 2024

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218

Comments

landreev commented Apr 12, 2023

landreev commented Apr 19, 2023

landreev commented Apr 24, 2023

landreev commented Apr 26, 2023

landreev commented Apr 28, 2023

landreev commented May 15, 2023

cmbz commented Jun 28, 2023 • edited Loading

cmbz commented Jun 29, 2023 • edited by pdurbin Loading

cmbz commented Jul 5, 2023 • edited Loading

cmbz commented Jul 20, 2023

cmbz commented Aug 16, 2023

cmbz commented Jan 3, 2024

cmbz commented Jun 28, 2023 •

edited

Loading

cmbz commented Jun 29, 2023 •

edited by pdurbin

Loading

cmbz commented Jul 5, 2023 •

edited

Loading