Search rankings file is calculated by partial information #3778

skofman1 · 2017-04-12T16:27:54Z

Rankings file calculation takes into account only downloads with "update" or "install" operations, however this information is not provided my newer clients, and they don't send it as part of the header:
https://github.com/NuGet/NuGet.Jobs/blob/master/src/Search.GenerateAuxiliaryData/SqlScripts/Rankings.sql#L25

https://www.nuget.org/stats/packages/newtonsoft.json?groupby=Version&groupby=Operation

ryuyu · 2017-04-20T23:39:52Z

After investigating this a bit, it seems that we don't have the information on newer CDN log entries to filter this properly any more, and ignoring the filter results in some strange results.

We have 2 options here:

Redo how we compute search relevance given the data we have.
Ask client for changes to send the data we used to look for again.

skofman1 · 2017-05-03T22:31:58Z

Lets try another approach: remove usage of rankings file when calculating search results (not the default list)

skofman1 · 2017-05-30T19:31:08Z

@ryuyu wrote:
Background
Our search result ordering depends on a few different things, including index match score, download count total, and rankings.
Rankings is an ordering based on the last 6 weeks of download data, weighting for installs over updates.
Today, rankings is based on broken data. The job that creates rankings filters download records from the database by "operation", but this is only reported by NuGet 2.x clients, which misses a LOT of download/installs.
This investigation was the see what happens to our search results when we stop using this broken data to order results.

Methodology
To test, I used a snapshot of the production lucene index and auxiliary search data copied locally.
I also generated a new Rankings file, created without filtering on operation name (ie including all NuGet 3.x/4.x downloads).

Query List Used:
Empty Query (this is used to generate front page view)
ASPNET
ASP.Net MVC
ASPNETCORE
ASPNET CORE
MVC
JQuery
JSON
Microsoft
NUnit
BootStrap
Selenium

(this list was chosen arbitrarily by packages on the front page and ASP NET request).

Test Matrix:
Using Rankings - This is what search would return today, using the "broken" rankings file
Using No Filtered Data Rankings - Search if we used unfiltered data to compute the rankings file
Ignore Rankings (Thresh=1200 Factor=20)
Ignore Rankings (Thresh=1200 Factor=10)
Ignore Rankings (Thresh=0 Factor=5) - Rankings file was completely ignored for result ordering and instead depended completely on download count. Note that Factor is a divisor and thus lower factor means a larger boost from download count.

Rankings_Investigation.xlsx

Analysis
Using unfiltered data to create rankings seemed to generally lead to a large amount of dependency packages rising to the top.
This was expected due to the fact that downloads from restore were discounted before. Note the appearance of Microsoft.aspnetcore.mvc.core when using unfiltered rankings file, but the distinct lack of that package from the other lists. Also note the appearance of many system.* packages in the empty query.
This also appears to push many third party packages off of the list when searching for things that are created in a microsoft namespace, ie aspnet.
Using only download count was marginally more promising. However, as we can see, certain packages that should be higher up aren't boosted enough, even if we up the download boost factor by 4x (newtonsoft.json is not even the second result). (Note: we have a huge boost for exact ID matches, which is why json package appears first in many of the results).
Another problem with this approach is that it would make it take longer for newer packages to move up the list as they would effectively have to "catch up" to download counts of older packages.
Relatively specific queries, ie Selenium, were fairly untouched by any of these changes. This stands to reason as the scope for specific terms is limited.

Conclusion
We have a few choices:

Use unfiltered rankings data
This would likely result is some rather dramatic changes to the results of some searches, especially of more common terms.
On the other hand, it also doesn't ignore data from newer NuGet Clients.
Drop rankings altogether and use only download data.
Would also result in some rather dramatic result changes. Also has the downside of being less friendly to new packages than a 6 week approach.
Threshold and factor tweaking would also take some time.
Drop rankings and use a rolling 6 week download count.
I wasn't able to test this one, but it is the halfway compromise between using download counts and using rankings.
With this, we could theoretically offer an unchanged frontpage by using the old rankings file for empty queries, and using a rolling 6 week download boost for other queries.
Leave things as they are right now and work with client team to get client to report operation again
This is being investigated, but would seem to be non-trivial effort on the client team's part.

skofman1 · 2017-06-05T19:57:28Z

We got an additional customer complaint regarding this: When you search for "protobuf", Google.Protobuf is nearly at the bottom with 500k+ downloads. Many many packages with less than 100 downloads are ranked higher. P.S. I was surprised there are no controls to reorder the search results (Most downloads, Last updated, etc.)

joelverhagen · 2018-10-29T17:39:02Z

NUnit is now higher rank that Newtonsoft.Json, despite having much lower downloads. This is because the 2.x usage of NUnit has surpassed Newtonsoft.Json. The current download and weights values for the top 10 packages in the rankings file are:

PackageId	Operation	Downloads	Weight
nunit	Install	19615	19615
newtonsoft.json	Install	17411	17411
entityframework	Install	12920	12920
mysql.data	Install	6279	6279
bootstrap	Install	5824	5824
nuget.core	Install	4523	4523
newtonsoft.json	Update	4448	2224
jquery	Install	4344	4344
htmlagilitypack	Install	4079	4079
microsoft.aspnet.mvc	Install	3356	3356
newtonsoft.json.net20.dll	Install	3289	3289
jquery	Update	2597	1298.5
entityframework	Update	2242	1121
microsoft.aspnet.mvc	Update	2052	1026
nunit	Update	1719	859.5
bootstrap	Update	1405	702.5
htmlagilitypack	Update	231	115.5
mysql.data	Update	167	83.5
nuget.core	Update	19	9.5

You'll see that updates are worth half the weight of installs.

The weight of NUnit is 19615 + 859.5 = 20474.5.
The weight of Newtonsoft.Json is 17411 + 2224 = 19635.

This is why NUnit is higher.

natebunton · 2019-02-15T18:09:25Z

Finding packages in visual studio package manager is difficult without the proper relevance sorting. Developers may accidentally us a similar (and potentially malicious package), if they are not very careful.

joelverhagen · 2019-12-17T20:28:29Z

The rankings file is no longer used by the primary search experience. Azure Search uses the total download count primarily and does not consider this install/update metric sent from V2 clients.

Issue #7186 tracks the fundamental problem of download count w.r.t. to direct vs. transitive dependencies.

Issue https://github.com/nuget/engineering/issues/1321 tracks the clean-up of these old reports.

skofman1 added Area: Statistics Priority - 1 Type:Bug labels Apr 12, 2017

skofman1 added this to the S117 - 2017.4.17 milestone Apr 12, 2017

skofman1 added the Pillar: Experience label Apr 15, 2017

skofman1 assigned ryuyu Apr 17, 2017

ryuyu removed this from the S117 - 2017.4.17 milestone Apr 20, 2017

skofman1 mentioned this issue Apr 28, 2017

Search results do not show right results and are different for VS and gallery #3843

Closed

skofman1 added this to the S118 - 2017.5.08 milestone May 3, 2017

skofman1 mentioned this issue May 10, 2017

[Statistics] Reduce DB size by removing unused data - Operation dimension #3625

Closed

skofman1 removed the Type:Bug label May 11, 2017

skofman1 closed this as completed May 26, 2017

skofman1 reopened this May 30, 2017

skofman1 removed this from the S118 - 2017.5.08 milestone May 30, 2017

skofman1 unassigned ryuyu May 30, 2017

skofman1 mentioned this issue May 30, 2017

Investigate improving search results sorting by using last 6 weeks of downloads #4028

Closed

skofman1 mentioned this issue Jul 11, 2017

Heavily weight download count in search results #4368

Closed

skofman1 mentioned this issue Jul 20, 2017

Improve search relevance on NuGet.org #4124

Closed

skofman1 added Area: Search and removed Area: Statistics labels Mar 6, 2018

skofman1 mentioned this issue Aug 29, 2018

Nuget.org search results are not based on 'popularity' #5332

Closed

joelverhagen closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search rankings file is calculated by partial information #3778

Search rankings file is calculated by partial information #3778

skofman1 commented Apr 12, 2017

ryuyu commented Apr 20, 2017

skofman1 commented May 3, 2017

skofman1 commented May 30, 2017

skofman1 commented Jun 5, 2017

joelverhagen commented Oct 29, 2018

natebunton commented Feb 15, 2019

joelverhagen commented Dec 17, 2019

Search rankings file is calculated by partial information #3778

Search rankings file is calculated by partial information #3778

Comments

skofman1 commented Apr 12, 2017

ryuyu commented Apr 20, 2017

skofman1 commented May 3, 2017

skofman1 commented May 30, 2017

skofman1 commented Jun 5, 2017

joelverhagen commented Oct 29, 2018

natebunton commented Feb 15, 2019

joelverhagen commented Dec 17, 2019