Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a time-based sharding approach to data collection #19

Open
SensibleWood opened this issue Mar 31, 2022 · 0 comments
Open

Implement a time-based sharding approach to data collection #19

SensibleWood opened this issue Mar 31, 2022 · 0 comments
Labels
data Issue relates to the tooling data collected from data sources enhancement New feature or request

Comments

@SensibleWood
Copy link
Collaborator

User Story

As a tooling developer I want data to be collected consistently and without failing due to rate limits applied at any source code repository platform.

Detailed Requirement

GitHub (obviously) applies rate limits on API calls, which we rely on heavily to collect data. As we expand the number of topics we are collecting we need to be cognisant of the limits and amend our approach to spread the collection period over multiple hours.

There's a few approaches:

  1. A simple manual slicing of the workload based on the alphabet (low sophistication, much manual tweaking).
  2. Splitting the build into multiple steps to seed files for later processing (medium sophistication, limited manual tweaking).
  3. Splitting the build as per option 2 and using a dependency mechanism to allow a build to trigger others (high sophistication, largely automated)

Option 3 seems feasible. The most sensible option seems to be:

  • Run a "collection" mechanism to get the superset of repositories we will query for their metadata.
  • Based on the collected data shard the data set into multiple groups, each bound to a given schedule.
  • Write workflow files based on the known rate limits at a given repository platform, target data set and schedule.
  • Allow the builds to run of their own volition.

This approach should scale as we collect more data. The main thing to be aware of is the overall build time limits, although that should be "OK" as we have a fair amount of head room for the time being.

@SensibleWood SensibleWood added enhancement New feature or request data Issue relates to the tooling data collected from data sources labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Issue relates to the tooling data collected from data sources enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant