Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dependents field on repositories stream #34

Closed
laurentS opened this issue Oct 25, 2021 · 7 comments · Fixed by #126 or #127
Closed

Add dependents field on repositories stream #34

laurentS opened this issue Oct 25, 2021 · 7 comments · Fixed by #126 or #127

Comments

@laurentS
Copy link
Contributor

We use the dependents count for a repository, which is currently fetched by grabbing the html page for the project (eg. https://github.com/facebook/react/network/dependents) and parsing the HTML. As I write this ticket, the link above returns 7,878,702 Repositories (and likewise for packages) and we grab these numbers.
Unfortunately, this info does not seem to exist anywhere in either the REST or graphQL APIs.

@aaronsteers Would you have any objection to me adding a request for that page to the repositories stream resulting in an extra field? Possibly behind some config option as it is fairly download heavy (the page above weighs 187kB).
Maybe in the post_process method? Ideally, the data will eventually be available in one of the APIs, and this can then be dropped.

@aaronsteers
Copy link
Contributor

@laurentS - I have no objection to adding the dependencies but what do you think about creating as a dedicated stream, as a child stream of repository?

@laurentS
Copy link
Contributor Author

Just to clarify, I'm talking about dependents, ie: the packages/repos that depend on the currently fetched one. So going "up" the dependency tree, as opposed to "down" with dependencies (for which there seems to be API endpoints, at least in graphQL).

Happy to do this as a child stream if it makes more sense. As far as I can see, it would be a single request/record per repo, with 2 data fields to start with (but potentially more in the future).

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 25, 2021

@laurentS - Thanks for clarifying the dependents vs dependencies. I think the bigger clarification though is whether you want just the count of dependents or if you'll also (now or in the future) want the listing. I first thought we wanted the list of them, which is why I suggested the child stream. If you do think you'll want the list of repos that depend on the active one, then I think this would be correctly modeled as a child stream of repository since it neatly generates a one-to-many mapping of child records (even though you are correct to say they are technically 'upstream').

If you only want the count of dependents, I could see this being a property of repositories as you suggest. Two considerations come to mind if adding as a property:

  1. For stability and performance, you probably would want to check that the field is selected before making the extra request. (I don't know if we have a pattern for this but it should be feasible and I see it being common enough that we'd want to have a pattern available.)
  2. I don't think the addition/subtraction of dependents will bump the incremental key for repositories. I don't know how important this is but want to call it out as something to consider.

@laurentS
Copy link
Contributor Author

There are all great points!

  • for now, we only need the count of dependents, so I think I'll go with making it a property of repositories. I've not considered getting the full list of them, as it would starting becoming fairly painful (and looking more like web scraping than API consumption). Also, we don't really use this info at the moment ;)
  • good point about checking if the field is selected. I had not really considered this, but it seems important.
  • again, I hadn't thought of this, thanks for calling it out. I'd agree with you, and leave the incremental key untouched. If nothing else, just because the values on github seem to be cached, and therefore probably not 100% up to date.

@ericboucher
Copy link
Contributor

@laurentS I'd love to revive this now that we have the GraphQl endpoints. I think we should aim to grab both dependents and dependencies

@ericboucher
Copy link
Contributor

ericboucher commented Mar 25, 2022

@ericboucher
Copy link
Contributor

Addressed in #126 and #127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants