Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search rankings file is calculated by partial information #3778

Closed
skofman1 opened this issue Apr 12, 2017 · 7 comments
Closed

Search rankings file is calculated by partial information #3778

skofman1 opened this issue Apr 12, 2017 · 7 comments

Comments

@skofman1
Copy link
Contributor

Rankings file calculation takes into account only downloads with "update" or "install" operations, however this information is not provided my newer clients, and they don't send it as part of the header:
https://github.com/NuGet/NuGet.Jobs/blob/master/src/Search.GenerateAuxiliaryData/SqlScripts/Rankings.sql#L25

https://www.nuget.org/stats/packages/newtonsoft.json?groupby=Version&groupby=Operation

@ryuyu
Copy link
Contributor

ryuyu commented Apr 20, 2017

After investigating this a bit, it seems that we don't have the information on newer CDN log entries to filter this properly any more, and ignoring the filter results in some strange results.

We have 2 options here:

  1. Redo how we compute search relevance given the data we have.
  2. Ask client for changes to send the data we used to look for again.

@skofman1
Copy link
Contributor Author

skofman1 commented May 3, 2017

Lets try another approach: remove usage of rankings file when calculating search results (not the default list)

@skofman1
Copy link
Contributor Author

@ryuyu wrote:
Background
Our search result ordering depends on a few different things, including index match score, download count total, and rankings.
Rankings is an ordering based on the last 6 weeks of download data, weighting for installs over updates.
Today, rankings is based on broken data. The job that creates rankings filters download records from the database by "operation", but this is only reported by NuGet 2.x clients, which misses a LOT of download/installs.
This investigation was the see what happens to our search results when we stop using this broken data to order results.

Methodology
To test, I used a snapshot of the production lucene index and auxiliary search data copied locally.
I also generated a new Rankings file, created without filtering on operation name (ie including all NuGet 3.x/4.x downloads).

Query List Used:
Empty Query (this is used to generate front page view)
ASPNET
ASP.Net MVC
ASPNETCORE
ASPNET CORE
MVC
JQuery
JSON
Microsoft
NUnit
BootStrap
Selenium

(this list was chosen arbitrarily by packages on the front page and ASP NET request).

Test Matrix:
Using Rankings - This is what search would return today, using the "broken" rankings file
Using No Filtered Data Rankings - Search if we used unfiltered data to compute the rankings file
Ignore Rankings (Thresh=1200 Factor=20)
Ignore Rankings (Thresh=1200 Factor=10)
Ignore Rankings (Thresh=0 Factor=5) - Rankings file was completely ignored for result ordering and instead depended completely on download count. Note that Factor is a divisor and thus lower factor means a larger boost from download count.

Rankings_Investigation.xlsx

Analysis
Using unfiltered data to create rankings seemed to generally lead to a large amount of dependency packages rising to the top.
This was expected due to the fact that downloads from restore were discounted before. Note the appearance of Microsoft.aspnetcore.mvc.core when using unfiltered rankings file, but the distinct lack of that package from the other lists. Also note the appearance of many system.* packages in the empty query.
This also appears to push many third party packages off of the list when searching for things that are created in a microsoft namespace, ie aspnet.
Using only download count was marginally more promising. However, as we can see, certain packages that should be higher up aren't boosted enough, even if we up the download boost factor by 4x (newtonsoft.json is not even the second result). (Note: we have a huge boost for exact ID matches, which is why json package appears first in many of the results).
Another problem with this approach is that it would make it take longer for newer packages to move up the list as they would effectively have to "catch up" to download counts of older packages.
Relatively specific queries, ie Selenium, were fairly untouched by any of these changes. This stands to reason as the scope for specific terms is limited.

Conclusion
We have a few choices:

  1. Use unfiltered rankings data
    This would likely result is some rather dramatic changes to the results of some searches, especially of more common terms.
    On the other hand, it also doesn't ignore data from newer NuGet Clients.
  2. Drop rankings altogether and use only download data.
    Would also result in some rather dramatic result changes. Also has the downside of being less friendly to new packages than a 6 week approach.
    Threshold and factor tweaking would also take some time.
  3. Drop rankings and use a rolling 6 week download count.
    I wasn't able to test this one, but it is the halfway compromise between using download counts and using rankings.
    With this, we could theoretically offer an unchanged frontpage by using the old rankings file for empty queries, and using a rolling 6 week download boost for other queries.
  4. Leave things as they are right now and work with client team to get client to report operation again
    This is being investigated, but would seem to be non-trivial effort on the client team's part.

@skofman1
Copy link
Contributor Author

skofman1 commented Jun 5, 2017

We got an additional customer complaint regarding this: When you search for "protobuf", Google.Protobuf is nearly at the bottom with 500k+ downloads. Many many packages with less than 100 downloads are ranked higher. P.S. I was surprised there are no controls to reorder the search results (Most downloads, Last updated, etc.)

@joelverhagen
Copy link
Member

NUnit is now higher rank that Newtonsoft.Json, despite having much lower downloads. This is because the 2.x usage of NUnit has surpassed Newtonsoft.Json. The current download and weights values for the top 10 packages in the rankings file are:

PackageId Operation Downloads Weight
nunit Install 19615 19615
newtonsoft.json Install 17411 17411
entityframework Install 12920 12920
mysql.data Install 6279 6279
bootstrap Install 5824 5824
nuget.core Install 4523 4523
newtonsoft.json Update 4448 2224
jquery Install 4344 4344
htmlagilitypack Install 4079 4079
microsoft.aspnet.mvc Install 3356 3356
newtonsoft.json.net20.dll Install 3289 3289
jquery Update 2597 1298.5
entityframework Update 2242 1121
microsoft.aspnet.mvc Update 2052 1026
nunit Update 1719 859.5
bootstrap Update 1405 702.5
htmlagilitypack Update 231 115.5
mysql.data Update 167 83.5
nuget.core Update 19 9.5

You'll see that updates are worth half the weight of installs.

The weight of NUnit is 19615 + 859.5 = 20474.5.
The weight of Newtonsoft.Json is 17411 + 2224 = 19635.

This is why NUnit is higher.

@natebunton
Copy link

Finding packages in visual studio package manager is difficult without the proper relevance sorting. Developers may accidentally us a similar (and potentially malicious package), if they are not very careful.

@joelverhagen
Copy link
Member

The rankings file is no longer used by the primary search experience. Azure Search uses the total download count primarily and does not consider this install/update metric sent from V2 clients.

Issue #7186 tracks the fundamental problem of download count w.r.t. to direct vs. transitive dependencies.

Issue https://github.com/nuget/engineering/issues/1321 tracks the clean-up of these old reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants