-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer US Census data releases and publish in production per release schedule (ongoing effort) #218
Comments
Transferred the first 20GB data sample as a test. Added to the draft dataset in the new Census collection in prod.: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6&version=DRAFT |
Working now on the next stage of the project - attempting a bucket-to-bucket transfer test of a large-ish (~500GB) chunk of data, to estimate how much time it is going to take for the main, TB-sized transfer.
|
Had the IAM role created, passed it to the Census contact for granting read access to the data source. |
Unfortunately, we haven't been able to run a successful direct transfer today (I haven't been able to gain read access with the local AIM user that they tried to grant such on their end... something on their end, it appears). |
I don't know if this is a metaphor of some sorts for government work in general, but it is increasingly looking like two weeks had been spent figuring out how to do this data transfer the "smart way", only to discover that the smart way was in fact slower than the "dumb", brute force way, that had been available from the get go. Specifically, the direct "bucket-to-bucket" transfer that we finally got to work does appear to be taking longer than the "round trip" method of copying the data from their bucket to our server, then copying to our bucket. (My best guess is that any potential speed advantage of copying directly was entirely offset by the fact that the source and destination buckets live in 2 different data center regions, us-east-2 and us-east-1, respectively). |
Plan is to host latest batch of 26T of US Census data on NESE tape and make it available via a Globus link from a Harvard Dataverse dataset page. Current status:
|
|
|
|
Meeting has been scheduled for 2023/09/20 with Harvard Library, Harvard Research Computing and Data, and other stakeholders to discuss proposed model for supporting USCB big data using NERC resources. |
2024/01/03 |
Will add details here.
There's an ongoing discussion of what's involved in the dedicated slack channel and a google doc.
The text was updated successfully, but these errors were encountered: