How will we automate the conversion of hub data to parquet after syncing to S3? #20

bsweger · 2024-02-15T20:27:44Z

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

nickreich · 2024-02-15T22:19:13Z

One relevant consideration might be that a hub might choose to have its files submitted as parquet (this is allowable in a hub schema). In this case, would we still need to duplicate the data in the cloud?

bsweger · 2024-02-16T16:13:02Z

@nickreich Good note! I think there is value in making a distinction between "raw data" and "user/client-facing data" for all hubs, regardless of their submission format:

Using the same S3 structure for every hub simplifies our code (e.g., no need to perform different data sync operations for hubs that submit in parquet)
It gives us room for future data manipulations we might decide to do (re-partitioning, for example)

The data I've seen in hubs so far is very small by cloud standards, so I'm not worried about duplication. [edited to add: the "raw data" would be for our internal use--or maybe for use by teams who want access to their human-readable submission data--but we wouldn't want it accessible by clients such as hubUtils]

bsweger · 2024-02-16T16:47:04Z

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
?? What am I missing ??

Also noodling on some variables that would influence our choice:

What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?
Size of the individual model-output submissions
Administrative burden: what is the Hubverse appetite for expanding our reliance on AWS infrastructure from simple cloud storage to "a place where we need to maintain and update data conversion functions while also managing the associated permissions model". And for option 1, we'd have to manage the S3 triggers for everyones' hub bucket.
Cost

I don't have much experience with S3 triggers/lambdas and plan to spend some time learning how it works.

elray1 · 2024-02-16T17:09:30Z

r.e. "What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?" -- I think in general we should minimize burden on participating teams. So if data conversion fails but a team's contribution was valid, I'd like to say the submission was valid and the team is done with their work, and hub administrators have to follow up.

bsweger · 2024-03-21T14:04:53Z

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.

Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.

Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.

?? What am I missing ??

Revisiting the above options now that we've had some additional conversations and experiment learnings (provisioning AWS resources via code.

I believe we should use a cloud-based trigger to initiate conversions/transformations on model-output files submitted to a hub.

If our eventual goal is to allow submissions directly to cloud storage (i.e., without submitting a PR to the hub's repo), we don't want to rely on a github action to initiation a post-submission data conversion process
Now that we have additional clarity for using infrastructure-as-code to provision hub resources, I'm less concerned about administrative burden for a cloud-based trigger

Because we're already using AWS, I propose exploring the use of S3 triggers, which can invoke various actions when data is written or removed from an S3 bucket.

Will close this and follow-up with issues with more specific work re: experimentation with S3 triggers.

bsweger added infra Infrastructure enhancement New feature or request and removed infra Infrastructure labels Feb 15, 2024

bsweger self-assigned this Feb 16, 2024

bsweger mentioned this issue Feb 16, 2024

Automatically copy incoming S3 model-output data to our chosen organization strategy #12

Closed

bsweger mentioned this issue Mar 7, 2024

Create a test function to transform model-output data #37

Closed

5 tasks

bsweger mentioned this issue Mar 18, 2024

Bsweger/switch hubverse aws sync to rclone hubverse-org/hubverse-actions#13

Merged

bsweger added the cloud work related to cloud-enabled hubs label Mar 21, 2024

bsweger closed this as completed Mar 21, 2024

bsweger mentioned this issue Mar 22, 2024

Create proof-of-concept for using S3 triggers for automated conversion of model-output files #52

Closed

4 tasks

bsweger added this to the hubverse cloud sync milestone Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How will we automate the conversion of hub data to parquet after syncing to S3? #20

How will we automate the conversion of hub data to parquet after syncing to S3? #20

bsweger commented Feb 15, 2024

nickreich commented Feb 15, 2024

bsweger commented Feb 16, 2024 •

edited

Loading

bsweger commented Feb 16, 2024 •

edited

Loading

elray1 commented Feb 16, 2024

bsweger commented Mar 21, 2024

How will we automate the conversion of hub data to parquet after syncing to S3? #20

How will we automate the conversion of hub data to parquet after syncing to S3? #20

Comments

bsweger commented Feb 15, 2024

nickreich commented Feb 15, 2024

bsweger commented Feb 16, 2024 • edited Loading

bsweger commented Feb 16, 2024 • edited Loading

elray1 commented Feb 16, 2024

bsweger commented Mar 21, 2024

bsweger commented Feb 16, 2024 •

edited

Loading

bsweger commented Feb 16, 2024 •

edited

Loading