Quick check for commits via Github API for OGM incremental harvests #130
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR has an incremental (aka "daily") OGM harvest perform a single Github API call for each repository to list commits and see if any are on or after the incremental harvest "from" date.
While most of the harvest work performed on the git history of the cloned repository would be prohibtively slow using only the Github API, hence why it's much easier to clone the repository locally and work with it, this single API call can avoid the expensive operation of cloning repositories if no commits at all exist on or after the target date.
In this way, OGM incremental harvests can easily be run daily, as 99% of the time they will reach out to the api and determine there are no commits in the repository in question.
NOTE: a Github API token is not required for this new functionality, making it a low bar to include and test. The unauthenticated rate limit for this API route is 60 requests per hour, per IP, making it perfectly suitable for OGM "daily" harvests as only 20-30 requests are needed.
However, should we decide to create a Github API token, it can be set with an env var and avoid any rate limiting errors.
How can a reviewer manually see the effects of these changes?
While still no CLI command for the harvester, this behavior can be seen pretty quickly via a python shell:
Here is a snippet from the output:
Two things to note:
WARNING: ... Github API token not set, may encounter rate limiting.
GITHUB_API_TOKEN
is not set, 60 requests per hour is still supported, which should be plenty of requests for a deployed environment that will query 20-30 repositoriesINFO: ... No commits found after date '2024-01-10', skipping.
* example of a repository that will not be cloned, because no commits after the target date of
2024-01-10
Lastly, inspecting the variable
inc_records
will show only one record, despite very quickly checking all OGM repositories configured without the need to clone them all.Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)