Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How will we automate the conversion of hub data to parquet after syncing to S3? #20

Closed
bsweger opened this issue Feb 15, 2024 · 5 comments
Assignees
Labels
cloud work related to cloud-enabled hubs enhancement New feature or request

Comments

@bsweger
Copy link
Collaborator

bsweger commented Feb 15, 2024

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

@bsweger bsweger added infra Infrastructure enhancement New feature or request and removed infra Infrastructure labels Feb 15, 2024
@nickreich
Copy link

One relevant consideration might be that a hub might choose to have its files submitted as parquet (this is allowable in a hub schema). In this case, would we still need to duplicate the data in the cloud?

@bsweger
Copy link
Collaborator Author

bsweger commented Feb 16, 2024

@nickreich Good note! I think there is value in making a distinction between "raw data" and "user/client-facing data" for all hubs, regardless of their submission format:

  • Using the same S3 structure for every hub simplifies our code (e.g., no need to perform different data sync operations for hubs that submit in parquet)
  • It gives us room for future data manipulations we might decide to do (re-partitioning, for example)

The data I've seen in hubs so far is very small by cloud standards, so I'm not worried about duplication. [edited to add: the "raw data" would be for our internal use--or maybe for use by teams who want access to their human-readable submission data--but we wouldn't want it accessible by clients such as hubUtils]

@bsweger
Copy link
Collaborator Author

bsweger commented Feb 16, 2024

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

  1. Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
  2. Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
  3. Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
  4. ?? What am I missing ??

Also noodling on some variables that would influence our choice:

  • What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?
  • Size of the individual model-output submissions
  • Administrative burden: what is the Hubverse appetite for expanding our reliance on AWS infrastructure from simple cloud storage to "a place where we need to maintain and update data conversion functions while also managing the associated permissions model". And for option 1, we'd have to manage the S3 triggers for everyones' hub bucket.
  • Cost

I don't have much experience with S3 triggers/lambdas and plan to spend some time learning how it works.

@elray1
Copy link

elray1 commented Feb 16, 2024

r.e. "What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?" -- I think in general we should minimize burden on participating teams. So if data conversion fails but a team's contribution was valid, I'd like to say the submission was valid and the team is done with their work, and hub administrators have to follow up.

@bsweger
Copy link
Collaborator Author

bsweger commented Mar 21, 2024

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

  1. Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
  2. Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
  3. Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
  4. ?? What am I missing ??

Revisiting the above options now that we've had some additional conversations and experiment learnings (provisioning AWS resources via code.

I believe we should use a cloud-based trigger to initiate conversions/transformations on model-output files submitted to a hub.

  • If our eventual goal is to allow submissions directly to cloud storage (i.e., without submitting a PR to the hub's repo), we don't want to rely on a github action to initiation a post-submission data conversion process
  • Now that we have additional clarity for using infrastructure-as-code to provision hub resources, I'm less concerned about administrative burden for a cloud-based trigger

Because we're already using AWS, I propose exploring the use of S3 triggers, which can invoke various actions when data is written or removed from an S3 bucket.

Will close this and follow-up with issues with more specific work re: experimentation with S3 triggers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud work related to cloud-enabled hubs enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants