Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PR Files stream #45

Closed
wants to merge 7 commits into from
Closed

Add PR Files stream #45

wants to merge 7 commits into from

Conversation

Ry-DS
Copy link
Contributor

@Ry-DS Ry-DS commented Nov 10, 2021

Notes:

  • Confirmed that on push, the PR parent's updated_at changes, which means that it would make a good replication key for PR files.

Points of discussion:

  • We could reduce the amount of API calls made if there exists a route that gets all changed files for all PRs in a repo (right now, we are wasting api requests on PRs with only 1-2 changed files). Any idea if one exists? Didn't find one in my own research. It would be something like /org/repo/pr_files (courtesy of @laurentS)
  • Is it needed to add the parent key pull_number with post_processing? I would feel that the target would create the neccessary foreign key relationships making this redundant, but am not too sure.
  • There are also several comments in Add PR Files stream oviohub/tap-github-meltano#6

@edgarrmondragon
Copy link
Member

Hi @Ry-DS!

We could reduce the amount of API calls made if there exists a route that gets all changed files for all PRs in a repo (right now, we are wasting api requests on PRs with only 1-2 changed files). Any idea if one exists? Didn't find one in my own research. It would be something like /org/repo/pr_files

Unfortunately, I don't think it makes sense for there to be an endpoint with all PR files for a repo, and the hierarchy repo > pr > files is perfectly fine. Upstream in the SDK, rate limiting will be addressed by https://gitlab.com/meltano/sdk/-/issues/140, maybe with sleeps every time the limit (the decision up to implementation) is reached.

Another issue with too many children records and switching back and forth between parent and child syncing would be triggering too many small batches in the target (see #31 (comment) for a similar case and how it's handled).

Is it needed to add the parent key pull_number with post_processing? I would feel that the target would create the necessary foreign key relationships making this redundant, but am not too sure.

AFAIK you don't need to add pull_number with post_processing, as the SDK will add any context keys auto-magically to the record for you. You'd just need to declare it in the schema.

On the other hand, filename is definitely not enough as the primary key (PK) for this stream. PKs are normally used by the target to upsert rows, so if two records (think coming from two different PRs that touched the same file) have the same filename, one of them would be lost. So (repo_id, pull_number, filename) is a unique identifier and a better PK.

cc @aaronsteers @laurentS

@ericboucher
Copy link
Contributor

There is no route that does that, but we could envision using a GraphQLStream for this in the future.
@aaronsteers @edgarrmondragon would that be reasonable?

@aaronsteers
Copy link
Contributor

For my part, I'm open to using GraphQL for this use case if it doing so can reduce pressure on rate limits. Architecturally, I can't see any technical blockers or reasons not to explore that option.

@Ry-DS
Copy link
Contributor Author

Ry-DS commented Nov 14, 2021

After chatting with Eric, we decided to keep this tap on REST for simplicity's sake as we don't have a graphql base stream implemented yet for github. We definitely think however this would be a worthwhile investment in the future. I opened #48 to track this, once a graphql base is made, we could consider migrating this stream over to graphql as a start.

@ericboucher
Copy link
Contributor

We know have a Grapqhl endpoint, so it's probably worth revisiting with that in mind. Closing for now.

@ericboucher ericboucher closed this Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants