Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making downloads.v1.json available to the public #3532

Open
joelverhagen opened this issue Feb 7, 2017 · 7 comments

Comments

@joelverhagen
Copy link
Member

commented Feb 7, 2017

Motivation

In the past, I've heard various requests for a programmatic way to download statistics. Today, users have two options:

  1. Scrape nuget.org's website. Which is not ideal for us or the caller.
  2. Query the search service. This is pretty good for the caller but is an unnecessary use of compute.

We should add an additional option: make our downloads.v1.json file open to the public. This file has the total downloads for every package (ID + version pair, that is), over all time.

@emgarten has an internal partner that is interested in this data and I have been asked more than once about acquiring this data.

Implementation

There is actually precedent for this kind of thing in today's index.json. Currently, we have the TotalStats/3.0.0-rc resource:

{
    "@id": "https://api.nuget.org/v3/stats0/totals.json",
    "@type": "TotalStats/3.0.0-rc",
    "comment": "Endpoint to get stats totals to display in nuget.org home page"
}

We should put this file behind the CDN with a relatively long cache time -- whatever the job frequency is for generating this file.

The main point here is that the data is available already to interested parties. But users pounding the search service or gallery just for download count is wasteful. No compute is necessary for this data. CDN + blob storage is cheap! The entry in the index.json should look something like this:

{
    "@id": "https://api.nuget.org/v3/stats0/downloads.json",
    "@type": "PackageDownloadStats/4.0.0",
    "comment": "Endpoint to get download counts for all packages"
}

Concerns

The download file is big. 16 megabytes of JSON right now I think. If we gzip it we'd bring it down to about 4 megabytes (this is an offline gzip so online might be a bit worse...). Perhaps callers don't want to download something so big when the response from the search service is teensy.

Also, this file would be a very easy way to enumerate all package IDs and versions on NuGet.org. This is already possible with the catalog but this would be a one-shot query to get everything. I think this is okay (maybe even good) but I figured I call it out.

Mentions

/cc @xavierdecoster, @emgarten

@scottbommarito

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

I'd expect that most people would vastly prefer hitting the search service rather than having to parse an entire 16MB file. Also, I'm not sure how Azure storage tiers work but I would be slightly concerned that if too many people are downloading such a huge file so frequently (especially people who don't understand that we only update stats once a day) we may run into issues, just like with the gallery crawlers who are making very expensive OData queries.

Regarding enumerating package IDs and versions, does downloads.v1.json contain deleted packages?

I think if we want to do this properly, we should create a blob storage API like v3 that returns data for each package. For example, https://api.nuget.org/v3/stats/{Package Id}/{Version}.json would yield a JSON file with just the statistics for that package, and https://api.nuget.org/v3/stats/{Package Id}/index.json would have the statistics for all the versions.

On the other hand, as long as storage can handle a 16MB file being downloaded frequently by all of our users who need statistics, I don't really see any negative to making the file public, even if we introduce another API.

@joelverhagen

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2017

@scottbommarito, great idea. I like your idea of a more structured approach as a full-fledged solution. With V3 index.json, we can naturally version these endpoints to provide different views or resolution in the future (e.g download count in last 6 weeks, rankings, etc).

However, I acknowledge the fact that getting the engineering time to make this full-fledged solution would be pretty tough. Especially considering this is a request made infrequently by few people. Making the downloads.v1.json available has its own benefits (fewer round trips, better for broad analysis of many packages) and has very low engineering cost.

In short, this solution is easy!

Also, I'm not sure how Azure storage tiers work but I would be slightly concerned that if too many people are downloading such a huge file so frequently (especially people who don't understand that we only update stats once a day) we may run into issues, just like with the gallery crawlers who are making very expensive OData queries.

I don't think this will be a problem, as the file will be behind a CDN. But I don't have much data other than the fact we have a V3 index.json that is very hot (lots of different people asking for it) and many .nupkgs larger than 16 megabytes. This would indeed be a unique combination of those two factors (volume of requests and size).

Regarding enumerating package IDs and versions, does downloads.v1.json contain deleted packages?

Great question. I did not know before checking just now, but yes, it contains deleted package versions. I think this is okay because the catalog has deleted versions as well.

@maartenba

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

As mentioned in #7120, would like to see this file available in public.
We currently index NuGet based off of the catalog, but still need to fetch from V2 API to get download counts for listed and unlisted packages (which feeds into our own ranking)

@joelverhagen

This comment has been minimized.

Copy link
Member Author

commented May 2, 2019

How frequently do you scan the catalog from the beginning? In other words, how do you know to refresh a package's download count?

Also, can you get by with only listed version download counts? V3 search had all listed versions' downloads.

@maartenba

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

Listed would work, as unlisted shouldnt be searchable in theory :-)

Right now we scan it every two weeks. Being able to grab this file at a more frequent interval and then run upserts in our implementation would be awesome.

@joelverhagen

This comment has been minimized.

Copy link
Member Author

commented May 2, 2019

In that case I would recommend using the V3 search until we plan on what to do with the making the download data public. There is no timeline at this point. Thanks for mentioning your scenario.

@maartenba

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

Thanks! Was mainly asking because it feels wasteful to do N queries to search instead of grabbing 1 centralized piece of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.