Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick check for commits via Github API for OGM incremental harvests #130

Merged
merged 4 commits into from
Feb 5, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Feb 5, 2024

Purpose and background context

This PR has an incremental (aka "daily") OGM harvest perform a single Github API call for each repository to list commits and see if any are on or after the incremental harvest "from" date.

While most of the harvest work performed on the git history of the cloned repository would be prohibtively slow using only the Github API, hence why it's much easier to clone the repository locally and work with it, this single API call can avoid the expensive operation of cloning repositories if no commits at all exist on or after the target date.

In this way, OGM incremental harvests can easily be run daily, as 99% of the time they will reach out to the api and determine there are no commits in the repository in question.

NOTE: a Github API token is not required for this new functionality, making it a low bar to include and test. The unauthenticated rate limit for this API route is 60 requests per hour, per IP, making it perfectly suitable for OGM "daily" harvests as only 20-30 requests are needed.

However, should we decide to create a Github API token, it can be set with an env var and avoid any rate limiting errors.

How can a reviewer manually see the effects of these changes?

While still no CLI command for the harvester, this behavior can be seen pretty quickly via a python shell:

import logging

from harvester.config import Config, configure_logger
from harvester.harvest.ogm import OGMHarvester

logger = logging.getLogger("harvester")
configure_logger(logger, True)

FROM_DATE = "2024-01-10"  # this date is relatively recent, but confirmed at least one repo has commits after this date

harvester = OGMHarvester(
    harvest_type="incremental",
    from_date=FROM_DATE,
    remove_local_repos=False
)
inc_records = list(harvester.get_source_records())

Here is a snippet from the output:

2024-02-05 09:57:49,395 DEBUG harvester.harvest.ogm._get_source_records() line 103: Working on repository: edu.umd
2024-02-05 09:57:49,395 WARNING harvester.harvest.ogm._remote_repository_has_new_commits() line 294: Github API token not set, may encounter rate limiting.
2024-02-05 09:57:49,655 INFO harvester.harvest.ogm.get_modified_records() line 252: No commits found after date '2024-01-10', skipping.

Two things to note:

  • WARNING: ... Github API token not set, may encounter rate limiting.
    • if the env var GITHUB_API_TOKEN is not set, 60 requests per hour is still supported, which should be plenty of requests for a deployed environment that will query 20-30 repositories
    • however, if we determine we'd like to create a github token for this project to use, it will get picked up and utilized, and rate limits should never be encountered (I think 5k requests per hour)
  • INFO: ... No commits found after date '2024-01-10', skipping.
    * example of a repository that will not be cloned, because no commits after the target date of 2024-01-10

Lastly, inspecting the variable inc_records will show only one record, despite very quickly checking all OGM repositories configured without the need to clone them all.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
When performing an OGM incremental harvest, it is not ideal to clone an entire repository, only to
interrogate the git history locally and determine that no commits were made on or after the target date.

Though use of the Github API would be prohibitvely slow for the more complex parts of an incremental harvests,
a quick call to look for ANY commits on or after the target date is a great way to avoid cloning repositories
when it can be known in advance they will not produce any records for the harvest.

This change would allow true daily OGM harvests, as most harvests will quickly determine there were no
commits and therefore not need to clone the repository.

How this addresses that need:
* Use Github API to check for commits on or after an incremental harvest target date

Side effects of this change:
* Dramatic reduction in cloned repositories for incremental harvests.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-85
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, well-documented, and works as expected, might want to revisit GITHUB_API_TOKEN's optional status if we do start hitting those rate limit errors but based on what you've described, I think it's fine for now

harvester/harvest/ogm.py Show resolved Hide resolved
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good updates here! ✨ Just one non-blocking comment from me (that's really growing on me, and I'm sure future Jonavelle would appreciate 😄 )

harvester/harvest/ogm.py Outdated Show resolved Hide resolved
@ghukill ghukill merged commit 1d6dc88 into main Feb 5, 2024
5 checks passed
@ghukill ghukill deleted the GDT-85-ogm-github-api branch February 7, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants