Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search relevance on NuGet.org #4124

Open
anangaur opened this issue Jun 13, 2017 · 70 comments

Comments

@anangaur
Copy link
Member

commented Jun 13, 2017

  1. Search for Microsoft Bot Connector vs. Microsoft Bot Connector - obvious that search is broken here.
    Reference: https://twitter.com/adpedley/status/874460132725235712
  2. #2789: Support partial search terms in search: E.g. "Microsoft.Bot.Con". Search has no results:
    image
  3. #782: Provide gestures for search filters
  4. Enable auto complete in the search box.

@anangaur anangaur changed the title Improve basic search results Improve search on NuGet.org Jun 21, 2017

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2017

Added a couple more.

@unniravindranathan

This comment has been minimized.

Copy link

commented Jul 20, 2017

Terms to test: "UWP"

@skendrot

This comment has been minimized.

Copy link

commented Jul 20, 2017

Search for uwp. Results are not as expected. Not sure what I expect, but not what is given where the UWP community toolkit (with 50k downloads) is below items with 500 downloads

@skofman1

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

Hi @skendrot , NuGet.org search results are heavily based on popularity. Unfortunately, the current popularity calculation is based on partial information, and is essential incorrect: #3778

We are investigating how to fix this issue and provide accurate search results. You can track the attached issue for updates.

@onovotny

This comment has been minimized.

Copy link

commented Jul 21, 2017

Even for popularity, it's not quite accurate either....

Search for Rx has Rx.NET on page 6 of the results…which no one will ever see.
https://preview.nuget.org/packages?q=rx&page=6

System.Reactive, which is Rx.NET, has 347k downloads for the main package alone (for 3.x, there are 7-8 packages).

There is no package on the result set in the previous 5 pages that has more downloads.

@svick

This comment has been minimized.

Copy link

commented Jul 27, 2017

@skofman1 I think they should be based on popularity much more. For example, when I search for "csharp", hoping to find Microsoft.CodeAnalysis.CSharp, some of the results are:

Position Package Downloads
2 iTextSharp 1 800 000
3 CefSharp.WinForms 180 000
6 Tarantool.CSharp 2 500
7 Hallmanac.Funqy.CSharp 30
8 Microsoft.CodeAnalysis.Scripting.CSharp 36 000
9 aliyun.oss.csharp.sdk.netstandard1.6 0
10 web-csharp-001-core 330
13 Microsoft.CodeAnalysis.CSharp 6 000 000

I don't see any reason why a package that nobody ever downloaded should be higher than a package that was downloaded six million times for this search term.

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

Linking other search issue: #2405

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

@skofman1 perhaps we need a search category for "New and upcoming"/"Rising stars" packages apart from "Popular packages"

@hudo

This comment has been minimized.

Copy link

commented Aug 15, 2017

Would like to add another example: Entity Framework Core. Its similar to EF6, which is always 1st in the list.

@mungojam

This comment has been minimized.

Copy link

commented Aug 15, 2017

Is there any tracking? So if somebody searches for term x and then navigates to package y, perhaps that should increase the prominence of package y for term x. Maybe that's too easy to game though.

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2017

More reports on broken search. In this case its related to metadata relevancy (seems like the tags are given more than required weightage than required?)
https://twitter.com/AnthonySteele/status/901018032969068550

@agr

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2017

Related: #4687

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Oct 3, 2017

++ #4731

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Oct 3, 2017

Similar:
Search for "Microsoft graph" vs. "Microsoft.Graph"

@enyim

This comment has been minimized.

Copy link

commented Oct 19, 2017

Being based on popularity still does not explain why do we have tens of unrelated packages before the one that actually has the matching id?

screen shot 2017-10-20 at 12 13 06 am

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2017

@enyim Could be same as mentioned in my previous comment. The spaces do not fare well in the NuGet.org search currently -- something that should be fixed.

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2017

From NuGet/Home#6179
Search for microsoft.aspnetcore.static -- No Results
Need to search for the full ID: microsoft.aspnetcore.staticfiles

@jskeet

This comment has been minimized.

Copy link

commented Nov 17, 2017

Not sure whether this is the right place to raise this, but the popularity sensitivity does seem to be pretty extreme.

Consider a search for "google cloud storage". I'd expect a package that has all three search terms in the title (and as tags) to rank higher than one which only has "storage" in the title, and doesn't have either "google" or "cloud" in the title, text or tags... but supposedly Windows.AzureStorage is a better match than Google.Cloud.Storage.V1.

I realize search ranking is difficult and often subjective, but this feels like a pretty clear case where popularity is contributing way too much to ranking.

@onovotny

This comment has been minimized.

Copy link

commented Dec 4, 2017

Finding another NuGet search issue:
https://www.nuget.org/packages?q=bouncycastle

Looking at that URL, the main BouncyCastle library is properly at the top. However, my Portable.BouncyCastle library, with 1.2M downloads, is “below the fold,” and below “Portable.BouncyCastle-Signed” with only 50k downloads. My library also has been updated far more recently than the other one (3m ago vs >2yrs ago).

I would expect my library to be fourth on that search list, at the very least.

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2018

From #5332:
Search for 'pinvoke' on nuget.org
it will give you packages in random order where packages with as low as 120 downloads are in top list.

Now, search for 'pinvoke author:andrew',
it will give andrew arnott's pinvoke libraries which has hundreds of thousands downloads.
Should first search query not have listed these packages in top list?

@pebezo

This comment has been minimized.

Copy link

commented Jan 28, 2018

The ordering is terrible. If I search for linq2db the first entry is a "satellite" package that has the search term at the end of the name with 5K downloads and last updated over two years ago. The next two are similar, but with even fewer downloads (less than 200).

The right match is the fouth item with an exact (100%) match and 95K downloads updated less than 24 hours. How in the world are the first three entries above this one?!

@RichiCoder1

This comment has been minimized.

Copy link

commented Mar 1, 2018

Constantly run into this, especially when I'm doing odd searches like looking for all packages in a "namespace" (aka AspNetCore.Security.) and not getting any (or terribly ordered results). Also while searching like ananguar's where I search for a relatively innocuous tag and get a bunch of irrelevant packages before the one I was actually interested (in an order that doesn't on the surface make much sense).

@anangaur

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2018

From Abhijeet on mail:
When I search using following text:

  1. cosmosdb (all lower-case)
    a. https://www.nuget.org/packages?q=documentdb
    b. The package was showed in the middle of the page.
  2. CosmosDb (camelCase)
    a. https://www.nuget.org/packages?q=DocumentDb
    b. The search performed worst in this case (was showed as the last package in the search results)

Does “case” makes a difference in search results?

But what made me ask this question is, the top downloaded package is this with 2+ million downloads and that package is not shown on the top.
image

@leastprivilege

This comment has been minimized.

Copy link

commented May 19, 2019

...same goes for IdentityModel

Screenshot 2019-05-19 09 40 20

@SQL-MisterMagoo

This comment has been minimized.

Copy link

commented May 23, 2019

I don't have a specific search term to add to the list, but would love to see filtering and sorting.

Possible filters/sorts would be :

  • Dependencies
  • Last updated
  • Author
  • License
  • Download total
  • Downloads in last 90 days
  • Prerelease flag
  • Tags

Basically any information you have decided would be useful to show me on the package view would potentially be useful in a search.

@Advanium

This comment has been minimized.

Copy link

commented May 29, 2019

Are there any operators to remove specific terms from the result? Such as "serial number -port" because I'm not interested in libraries interacting with serial ports.

@joelverhagen

This comment has been minimized.

Copy link
Member

commented May 29, 2019

@Advanium, great idea! Do you think you could file a separate issue for this specific case? We will likely tackle search relevancy/ranking separately from search syntax changes.

@bgrainger

This comment has been minimized.

Copy link

commented Jun 25, 2019

Another example: "MySqlConnector" is listed as the 8th result for the query MySqlConnector, even though it's an exact match and has the most downloads of any result.

image

@loic-sharma

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

@bgrainger We’re working on a new search service which strongly favors exact matches on the package id. Stay tuned! :)

@karann-msft karann-msft changed the title Improve search on NuGet.org Improve search relevance on NuGet.org Aug 22, 2019

@karann-msft

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

We just announced the new NuGet search!

image

https://twitter.com/karann9/status/1166065246248656896?s=20

@304NotModified

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

@karann-msft

When search for "log", NLog isn't in the first page. It was first on item 2. I don't get it. The name almost matches and it has the tag "log"

https://www.nuget.org/packages?q=log

it does list packages which are far less popular, so I'm confused...


https://www.nuget.org/packages?q=logging

With search for "logging", NLog is on the 4th page. And bypassed by packages with 34 downloads per day (compared to 10,374 per day for NLog)

update:

missed the comparison I'm afraid: https://www.nuget.org/experiments/search-sxs?q=log.

image


Worst results for logging to file now :(

https://www.nuget.org/experiments/search-sxs?q=log+file

image

@anas2204

This comment has been minimized.

Copy link

commented Aug 26, 2019

Another one: Search for "FluentAssertion" and the main library isn't even there on the page
image

When you finally put an "s" at the end, it appears. Isn't there a way to rank the library downloaded 21M times higher? (or even IN the list!)
image

@skofman1

@karann-msft
On Nuget.org website, "fluentassertion" returns as "Not found" whereas "fluentassertions" finds the package. Even "fluent assertion" finds the appropriate packages.

Is there a problem with the 1st query?

@loic-sharma

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Hey @304NotModified, thanks for reporting that! The "log" case should be fixed very soon by two changes:

  1. Tag boosting (#7381): We'll boost packages whose tags match the search query. This will make NLog the first result on the log query.
  2. Tokenization improvements on initialisms (#6964): Currently we chop inputs like "FooBARBaz" into "Foo" and "BARBaz". We will improve our search service to instead chop "FooBARBaz" into "Foo", "BAR", and "Baz" (and "NLog" into "N" and "Log"). We're still evaluating this change, but I expect this would also make NLog the first result on the log query.

@anas2204 We're considering supporting misspellings (#7386). However, we're concerned about typosquatting attacks: evil Bob may upload a malicious package named "NLoog" and hope that people accidentally choose that package when they search "NLog". We're doing our due diligence in this area to make sure we maintain a safe ecosystem!

@304NotModified

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

@loic-sharma thanks, that makes senses.

@ThomasArdal

This comment has been minimized.

Copy link

commented Aug 27, 2019

When searching for elmah.io the search result shows the elmah.io package in the first place (correct). The second place is System.IO which doesn't make a lot of sense in my head. I need to scroll to the 5th place to find the Elmah.Io.Client package.

When searching for serilog.sinks.elmahio, the actual package with that id is on the 6th place. Even though a package like Serilog.Sinks.Console is more popular, it is not more relevant. I think that when inputting an exact match for a package id, the found package should always be first.

@ThomasArdal

This comment has been minimized.

Copy link

commented Aug 27, 2019

we're concerned about typosquatting attacks: evil Bob may upload a malicious package named "NLoog" and hope that people accidentally choose that package when they search "NLog".

I see the point, but reserved prefixes will help with "attacks" like this 👍

@304NotModified

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

I think that when inputting an exact match for a package id, the found package should always be first.

Agree on that (but only if there is a really exact match)

@ThomasArdal

This comment has been minimized.

Copy link

commented Aug 27, 2019

Agree on that (but only if there is a really exact match)

Sure. A search for Nmog should not match NLog, but both nlog, NLog and Nlog should 👍

@joelverhagen

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

We chose not to put exact match always first because there are common searches where this falls over.

  • Search "entity". You very likely want EntityFramework or similar. Not the Entity package with 3000 downloads.
  • Similarly "testing".
  • "json". People want almost all of the time want Newtonsoft.Json not the "JSON" package.
  • ... Many other examples of genericly named unpopular packages.

Our search click data and user feedback shows that people do not usually want the generic name package. They want the popular one.

For cases where the name is more specific ("nlog") and the exact match is very popular, of course it should be first.

For cases where the package is not hugely popular but still an exact match and the ID is long and specific ("serilog.sinks.elmahio") I agree it should probably be first. We will likely improve this with another fix that we are planning.

Another approach we considered was to follow what other package managers do for exact match and put that result in a special colored box of with a special label. This is less a relevancy thing and more a special search result that comes back from some searches in parallel with "relevant" results. That idea is tracked here:
#7463. Please leave feedback on that issue if you have thoughts 😀.

As with all search relevancy changes, it's a balancing act that we tried to get right on this iteration but certainly we missed some cases. From our user feedback and current top queries, we know we improved overall but cases like this still are not perfect yet. Keep the examples flowing and we will weigh them against the rest of the search and non-search work items in our backlog.

@onovotny

This comment has been minimized.

Copy link

commented Aug 28, 2019

Search for microsoft.extensions.hosting.windowsservice turns up nothing on the first several pages of results. Cannot actually locate the package by search.

@joelverhagen

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@onovotny, thanks for taking the time to report this issue.

For the microsoft.extensions.hosting.windowsservice case, there are two factors that are messing us up. First, the package ID actually ends in an "s" so the "windowsservice" part is not recognized. We are considering fuzzy matching (#7386) but it needs more evaluation before it's ready.

With the correctly spelled query (microsoft.extensions.hosting.windowsservice) the package still is "losing" in our current algorithm because it has such a low download count (4,302 today). This will likely be mitigated by another change we are going to A/B test where we give a big boost to any package matching all dot-split tokens.

The exact match UI mentioned in #7463 is probably the "right fix" here but even that would not pick on your original search query because of the missing "s" and the end. Perhaps you can leave a comment on that issue regarding your thoughts on exact match + misspellings (if that's such a thing? not sure how it would work or of the details).

In the very short term, an exact match can be forced with packageid:microsoft.extensions.hosting.windowsservices.

@Jericho

This comment has been minimized.

Copy link

commented Aug 29, 2019

Has anybody reported an issue since the new search went live when paging through search results using the NuGet.Protocol.Core.v3 version 4.3.0-beta1-2418 nuget package? Specifically, I'm seeing duplicates. For example a given package is returned on page 2 and again on page 3. I will raise a new issue and provide C# code sample to reproduce but I thought I would mention it here first.

@joelverhagen

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

Thanks @Jericho. Yes, please file a separate issue with repro for non-search relevancy issues.

@Jericho

This comment has been minimized.

Copy link

commented Aug 29, 2019

@joelverhagen Issue #7494 raised

@jorgensigvardsson

This comment has been minimized.

Copy link

commented Sep 11, 2019

Exact matches should always be shown first. If you search for Microsoft.AspNetCore.Mvc.Api.Analyzers, it shows up on the third result page.

@fileman

This comment has been minimized.

Copy link

commented Sep 16, 2019

If you search Microsoft.AspNetCore.Components.Authorization in Visual Studio or nuget.org (at page 7)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.