Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ranking of exact matches in page titles #437

Open
hirasso opened this issue Sep 17, 2023 · 8 comments
Open

Improve ranking of exact matches in page titles #437

hirasso opened this issue Sep 17, 2023 · 8 comments
Labels
collecting opinions Leave a +1 or a comment! improvement Not a bug

Comments

@hirasso
Copy link

hirasso commented Sep 17, 2023

Hi there,

we just updated pagefind to 1.0.2 for the docs for swup and it's amazing! Thanks for all your work on this project.

Playing around with it, I noticed something that might or might not be possible to generalize: When I search for "Plugins", I'm getting the following results:

image

Intuitively I'd think that a page who's main heading (h1) exactly matches the search term ("Plugins", highlighted in the screenshot above) should be rated highest. Sure it's possible for us to manually give the "plugins" page a very high rating – but maybe there is a way for pagefind to get smart enough to return pages with an exact match in the main heading first?

I'd be happy to hear your opinion on that before we start implementing manual ranking.

@hirasso hirasso changed the title Shouldn't exact matches for headings be weighed highest? Would it be possible for exact matches in headings to be ranked highest? Sep 17, 2023
@hirasso hirasso changed the title Would it be possible for exact matches in headings to be ranked highest? Would it be possible for exact matches in the main heading to be ranked highest? Sep 17, 2023
@bglw
Copy link
Contributor

bglw commented Sep 17, 2023

Ah! Interesting.

So first for reasoning on why the results are in that order: headings are given very strong priority — though /api/properties/ contains an h2 element of plugins, which is also ranked very favorably, causing it to appear so highly in the rankings.
The other reason is that the page is so short — term frequency plays a big role in ranking, so a short page with multiple matching words will rank better than a long page with a lower density of words.

Intuitively I'd think that a page who's main heading (h1) exactly matches the search term should be rated highest

On an intuitive level I agree! The trouble is that currently, with the data Pagefind has on hand at the time of ranking, this isn't known. By the time that level of data is loaded into the front end, the rankings are locked in.

When ranking, all we see for these results is:

// Page A word match locations
[{
  "weight": 6,
  "location": 23
}, {
  "weight": 1,
  "location": 27
}]

// Page B word match locations
[{
  "weight": 7,
  "location": 0
}, {
  "weight": 1,
  "location": 12
}, {
  "weight": 1,
  "location": 21
},
/* -- more -- */
]

That weight: 7, location: 0 word is the h1, but we don't know that it's the only word in the h1.


I'm sure there's a creative solution here, but nothing immediately comes to mind.

One option would just be to bump h1 elements default weighting across the board to compensate, but I'd be wary of that impacting other sites in the wrong direction, if they're currently ranking well.

I do want to expose a new configuration option for mapping element selectors to custom weightings, meaning the default h1 rank could be a per-site implementation if people want to tailor their results. But this doesn't solve the "making Pagefind smart enough" goal of doing better by default.

Another option is finding some way to opt h1 elements (or generally high-ranked words) out of the term frequency penalty — but I'll need to think on that further.


Sorry for the essay! Since I have no immediate bright ideas, I think implementing manual ranking is a good first step.

(Changing your h1 elements to data-pagefind-weight="10" would definitely push Plugins into at least second place).

@bglw bglw changed the title Would it be possible for exact matches in the main heading to be ranked highest? Find a way to improve ranking of exact matches in page titles Sep 17, 2023
@bglw bglw added improvement Not a bug collecting opinions Leave a +1 or a comment! labels Sep 17, 2023
@hirasso
Copy link
Author

hirasso commented Sep 17, 2023

No need to apologize, I love the essay! 😄 ...very interesting to get to know more about pagefind's internals.

From my experience with implementing search, it's very possible that too many built-in assumptions ("smartness") could hurt more then help. After giving it more thought I realized that maybe even the assumption that a perfect match in the h1 should be ranked highest doesn't hold true. What about if we change the title of the page to "Plugin Ecosystem" or "Plugins Overview" sometime in the future? This would immediately break the ranking again.

From other search engines I know the concept of "pinning" pages to the very top for specific search terms. Maybe that could be a feature idea for pagefind, as well.

Something like data-pagefind-pin="plugin,plugins" or even with regex support: data-pagefind-pin="plugins?" could pin the page "Plugins" to the very top if users would search for "plugin" or "plugins".

It could also be a meta tag in the <head> of the page, like e.g.:

<!-- pin the page to the top when searching for "plugin" or "plugins": -->
<meta name="pagefind:pin" content="plugin,plugins">

Just an idea without knowing if this could be feasible with the architecture of pagefind 🙂

@daun
Copy link

daun commented Sep 17, 2023

Given the implementation details kindly explained by @bglw, I'd opt for manual ranking here. Intuitively, for most sites, ranking pages the highest that have an exact match between search term and h1/title makes sense, but if that's not encoded in the ranking data, ranking them during display should work as well. I can imagine treating the h1 as a special title property of a page apart from the other contents, as that's how they're handled in display as well. But that's probably opening a whole different can of worms...

@bglw
Copy link
Contributor

bglw commented Sep 18, 2023

Good directions to think about, thanks to both of ya 🙏

I need to brush up CloudCannon's documentation and search, so that might prove a good place to experiment with some pinning or special casing on a site I'm familiar with. Will update here if I land on anything!

@bglw
Copy link
Contributor

bglw commented Oct 18, 2023

Some new thoughts on this; I think I'm going to try get metadata into the index in a way that it can be queried as part of a search. This would allow you to do a freeform search for the word plugins, or run a search specifically for title metadata containing the word plugin, or some combination of the two.

Still more ideation needed, but I think having that data combined with exposing more configuration on how it is used when ranking will allow people to tailor search to their site content.

@bglw bglw changed the title Find a way to improve ranking of exact matches in page titles Improve ranking of exact matches in page titles May 16, 2024
@clydebarrow
Copy link

My request is not to exactly match titles, but filenames:

It would be nice IMHO if a search that exactly matched the name of a page would deliver that page as the top hit. E.g. searching for "I2C" would put a page called "I2C.html" at the top (without case sensitivity of course.)

Are filenames even used at present?

@bglw
Copy link
Contributor

bglw commented May 16, 2024

Filenames are currently unused apart from building URLs, but the URLs are present in the result fragments, so what you're after would still be solved by #532, as it is the precursor to matching files based on any "non-content" fields

@quadratecode
Copy link

I agree that matches in page titles are not adquately weighed. After tinkering with all available parameters I could not get exact matches in the title tag with data-pagefind-weight="10" to show up on the top of the results list. Here is an example:

image

There are also some other rankings which I do not understand - e.g. here:

image

In the above image, the first result has 6 matches while the second one has 49 matches. Changing the parameters did not seem to do much.

If you want, you can try it yourself here: https://www.zhlaw.ch/

Anyways - just also wanted to say: Amazing project, thank you so much for your work and offering this under a permissible license! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collecting opinions Leave a +1 or a comment! improvement Not a bug
Projects
None yet
Development

No branches or pull requests

5 participants