New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching algorithm #320

Closed
polyvertex opened this Issue Sep 18, 2018 · 13 comments

Comments

Projects
None yet
2 participants
@polyvertex
Member

polyvertex commented Sep 18, 2018

@cpriest noticed that KP's matching algorithm produces unnatural results in some cases.
For instance, item named RHI3 - Click (With Valid Hash) (Bookmark) gets a better score than Google Chrome Dev when searching for ch.

Ref: initial chat

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Sep 20, 2018

Member

To reduce the number of unnatural results, starting with next release, an acronym will match consecutive words only.

For instance, search string kl will be matched as an acronym of Keypirinha Launcher but not of Keypirinha is a Launcher. Resulting in Keypirinha Launcher item getting pushed up compared to Keypirinha is a Launcher.

Member

polyvertex commented Sep 20, 2018

To reduce the number of unnatural results, starting with next release, an acronym will match consecutive words only.

For instance, search string kl will be matched as an acronym of Keypirinha Launcher but not of Keypirinha is a Launcher. Resulting in Keypirinha Launcher item getting pushed up compared to Keypirinha is a Launcher.

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Sep 20, 2018

cpriest commented Sep 20, 2018

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Sep 20, 2018

Member

Does KP pay attention to capital letter input vs capital letter match?
[...] I would think that 'KP' should prioritize the former and 'kp' the latter.

This is subject to debate but I would tend to disagree on that. For instance, you may expect different results for different plugins. However, regarding acronyms, they are currently matched only on capital letters, regardless of the case used in the search string. So that kl (or KL, or kL) is the acronym of Keypirinha Launcher but not keypirinha launcher. The latter is still a match, but with a lesser score than the former.

Member

polyvertex commented Sep 20, 2018

Does KP pay attention to capital letter input vs capital letter match?
[...] I would think that 'KP' should prioritize the former and 'kp' the latter.

This is subject to debate but I would tend to disagree on that. For instance, you may expect different results for different plugins. However, regarding acronyms, they are currently matched only on capital letters, regardless of the case used in the search string. So that kl (or KL, or kL) is the acronym of Keypirinha Launcher but not keypirinha launcher. The latter is still a match, but with a lesser score than the former.

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Sep 21, 2018

Ahh, I miswrote what I meant.

I meant KL and Keypirinha is a Launcher and kl for say kleenex website (Bookmark)

So given these two items:

  1. Keypirinha is a Launcher
  2. kleenex website (Bookmark)

I would rank them (personally, for this trite example) as:

  • KL --> 1, 2
  • kl --> 2, 1

I use an IDE called PhpStorm which uses this to great effect, basically capitals matching capitals at \b boundaries or lowercase/Uppercase boundary (as in DependencyInjectionContainer) gives higher weight to capital acronym matches, but otherwise slightly greater weight to consecutive letter matches.

I don't know if you've ever used fzf but the fuzzy matching algorithm there is documented pretty clearly in comments, at least the thoughts behind them are pretty well.

Incedentally I had worked on my own JS version of the fzf matching using purely regular expressions. I hadn't gotten to the point of testing it with large datasets (besides for speed).

Anyways, just some food for thought.

cpriest commented Sep 21, 2018

Ahh, I miswrote what I meant.

I meant KL and Keypirinha is a Launcher and kl for say kleenex website (Bookmark)

So given these two items:

  1. Keypirinha is a Launcher
  2. kleenex website (Bookmark)

I would rank them (personally, for this trite example) as:

  • KL --> 1, 2
  • kl --> 2, 1

I use an IDE called PhpStorm which uses this to great effect, basically capitals matching capitals at \b boundaries or lowercase/Uppercase boundary (as in DependencyInjectionContainer) gives higher weight to capital acronym matches, but otherwise slightly greater weight to consecutive letter matches.

I don't know if you've ever used fzf but the fuzzy matching algorithm there is documented pretty clearly in comments, at least the thoughts behind them are pretty well.

Incedentally I had worked on my own JS version of the fzf matching using purely regular expressions. I hadn't gotten to the point of testing it with large datasets (besides for speed).

Anyways, just some food for thought.

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Sep 21, 2018

Member

Case-sensitive search has been considered very early during KP's development and is even implemented as an option in some search-related functions internally, including the fuzzy matching algorithm, which also supports character mismatch as an option (permanently disabled).

But in the context of Keypirinha, I'm not quite sure this would be of any benefit to somehow "require" user to enable capital letters in order to enforce a particular type of search that did not give the expected result in the first place using case-insensitive search. Especially once #221 will be implemented, which allows greater benefit at the cost of taking little time (once) to teach Keypirinha how you want things to be found.

Perhaps case-sensitive search could be added as an option (disabled by default). But this still implies some changes internally that I'm quite reluctant to make ATM in addition to the fact that you are the first and only user to ever mention it at all. I do not wish to close the door to it though. Thanks for taking the time to elaborate on your use-case.

Member

polyvertex commented Sep 21, 2018

Case-sensitive search has been considered very early during KP's development and is even implemented as an option in some search-related functions internally, including the fuzzy matching algorithm, which also supports character mismatch as an option (permanently disabled).

But in the context of Keypirinha, I'm not quite sure this would be of any benefit to somehow "require" user to enable capital letters in order to enforce a particular type of search that did not give the expected result in the first place using case-insensitive search. Especially once #221 will be implemented, which allows greater benefit at the cost of taking little time (once) to teach Keypirinha how you want things to be found.

Perhaps case-sensitive search could be added as an option (disabled by default). But this still implies some changes internally that I'm quite reluctant to make ATM in addition to the fact that you are the first and only user to ever mention it at all. I do not wish to close the door to it though. Thanks for taking the time to elaborate on your use-case.

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Sep 23, 2018

no problem. I have quite a few things indexed, finding them is actually more difficult than I would expect. Being a CLI person I'm used to typing the first few letters and tab to auto-complete.

I think the big problem I'm running into is that consecutive letter matches are just not ranked as highly as acronym finding.

Also, you're right about #221 that would take care of most of my needs. Reading early on in the docs that it functions that way just meant I had to find it manually the first time. Sometimes it's not even in the 1st 200 results though (but rarely).

cpriest commented Sep 23, 2018

no problem. I have quite a few things indexed, finding them is actually more difficult than I would expect. Being a CLI person I'm used to typing the first few letters and tab to auto-complete.

I think the big problem I'm running into is that consecutive letter matches are just not ranked as highly as acronym finding.

Also, you're right about #221 that would take care of most of my needs. Reading early on in the docs that it functions that way just meant I had to find it manually the first time. Sometimes it's not even in the 1st 200 results though (but rarely).

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Sep 23, 2018

Member

I hear you. The aforementioned modification related to acronym matching might be useful in your case. On the other hand it's not trivial to find the right balance in score weighting so that it feels right to most use cases. Every time a user opened an issue here to indicate the search algorithm does not work as expected, their use case was different. I hope #221 will make at least some users happier.

Member

polyvertex commented Sep 23, 2018

I hear you. The aforementioned modification related to acronym matching might be useful in your case. On the other hand it's not trivial to find the right balance in score weighting so that it feels right to most use cases. Every time a user opened an issue here to indicate the search algorithm does not work as expected, their use case was different. I hope #221 will make at least some users happier.

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Sep 25, 2018

If it's possible, perhaps exposing those weights into the configuration would let some of us tweak it to the desired levels. Just a thought.

Thanks again, I nominated Keypirinha to Life Hackers best app launcher yesterday. They had rescinded Executors #1 status and replaced it with Launchy but didn't even have Keypirinha mentioned. I was quite surprised.

cpriest commented Sep 25, 2018

If it's possible, perhaps exposing those weights into the configuration would let some of us tweak it to the desired levels. Just a thought.

Thanks again, I nominated Keypirinha to Life Hackers best app launcher yesterday. They had rescinded Executors #1 status and replaced it with Launchy but didn't even have Keypirinha mentioned. I was quite surprised.

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Sep 26, 2018

Member

If it's possible, perhaps exposing those weights into the configuration would let some of us tweak it to the desired levels. Just a thought.

This leads to new questions like how to categorize weights and how to apply them. For instance, would you categorize per category of item (bookmark, URL, Calc result, ...) ? Type of file? Both? ... In that regard, Keypirinha is "item-type agnostic" in the sense that new item categories can be created via the plugin API, which means the user would have to acquire knowledge of the categories created by some plugins in order to have good weight balance in their final config. That's really a lot of time spent on configuring Keypirinha for the user for such little benefits IMHO.

In addition to that, users would still pop up here for exactly the same issue (i.e. search algo gives unexpected results in some specific/corner cases). Probably because at some point, they may want to have those weight to vary for a particular kind of search.

All in all I do not believe it is a good solution and to be honest it's been a long while this feature has been removed from project's to-do list.

Thanks again, I nominated Keypirinha to Life Hackers best app launcher yesterday. [...]

Thanks for the nice feedback and for encouraging the project :)

Member

polyvertex commented Sep 26, 2018

If it's possible, perhaps exposing those weights into the configuration would let some of us tweak it to the desired levels. Just a thought.

This leads to new questions like how to categorize weights and how to apply them. For instance, would you categorize per category of item (bookmark, URL, Calc result, ...) ? Type of file? Both? ... In that regard, Keypirinha is "item-type agnostic" in the sense that new item categories can be created via the plugin API, which means the user would have to acquire knowledge of the categories created by some plugins in order to have good weight balance in their final config. That's really a lot of time spent on configuring Keypirinha for the user for such little benefits IMHO.

In addition to that, users would still pop up here for exactly the same issue (i.e. search algo gives unexpected results in some specific/corner cases). Probably because at some point, they may want to have those weight to vary for a particular kind of search.

All in all I do not believe it is a good solution and to be honest it's been a long while this feature has been removed from project's to-do list.

Thanks again, I nominated Keypirinha to Life Hackers best app launcher yesterday. [...]

Thanks for the nice feedback and for encouraging the project :)

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Sep 26, 2018

Definitely, I was going to get involved in a new JS launcher gaining popularity named 'hain' but the original developer didn't have the time to work on it and though he gave me access to work on it, I wasn't interested in taking it on by myself.

This leads to new questions like how to categorize weights and how to apply them. For instance, would you categorize per category of item (bookmark, URL, Calc result, ...)?

I would expose it as it is presently used/written. The values must be hard-coded now, no? I'm, making the assumption that the ItemCatalog is searched by a central matching algorithm, except in special cases where a 'live search' is being performed (ala WebSuggest).

Type of file? Both? ... In that regard, Keypirinha is "item-type agnostic" in the sense that new item categories can be created via the plugin API, which means the user would have to acquire knowledge of the categories created by some plugins in order to have good weight balance in their final config. That's really a lot of time spent on configuring Keypirinha for the user for such little benefits IMHO.

Are you saying that each category has its own levers for weighting?

In addition to that, users would still pop up here for exactly the same issue (i.e. search algo gives unexpected results in some specific/corner cases). Probably because at some point, they may want to have those weight to vary for a particular kind of search.

If 90% of the people are happy with it as is, then I wouldn't even publicize it much. For those that do ask, tell them how to get to the config for that.


A thought I just had, that may be kindof fun (at least I think it would be), could be using the users interaction between their search and their selection to auto-adjust the weights.

So if I type in ch and scroll down 50 items and launch something, the delta between the 'top hit' and the 'user hit' could be looked at and adjusted. Perhaps even using some standard ML library would be enough, dunno.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

cpriest commented Sep 26, 2018

Definitely, I was going to get involved in a new JS launcher gaining popularity named 'hain' but the original developer didn't have the time to work on it and though he gave me access to work on it, I wasn't interested in taking it on by myself.

This leads to new questions like how to categorize weights and how to apply them. For instance, would you categorize per category of item (bookmark, URL, Calc result, ...)?

I would expose it as it is presently used/written. The values must be hard-coded now, no? I'm, making the assumption that the ItemCatalog is searched by a central matching algorithm, except in special cases where a 'live search' is being performed (ala WebSuggest).

Type of file? Both? ... In that regard, Keypirinha is "item-type agnostic" in the sense that new item categories can be created via the plugin API, which means the user would have to acquire knowledge of the categories created by some plugins in order to have good weight balance in their final config. That's really a lot of time spent on configuring Keypirinha for the user for such little benefits IMHO.

Are you saying that each category has its own levers for weighting?

In addition to that, users would still pop up here for exactly the same issue (i.e. search algo gives unexpected results in some specific/corner cases). Probably because at some point, they may want to have those weight to vary for a particular kind of search.

If 90% of the people are happy with it as is, then I wouldn't even publicize it much. For those that do ask, tell them how to get to the config for that.


A thought I just had, that may be kindof fun (at least I think it would be), could be using the users interaction between their search and their selection to auto-adjust the weights.

So if I type in ch and scroll down 50 items and launch something, the delta between the 'top hit' and the 'user hit' could be looked at and adjusted. Perhaps even using some standard ML library would be enough, dunno.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Oct 8, 2018

Member

Sorry for the lag.

There are nice features in hain like this double pane thing to have content preview. I had several high CPU and memory usage issues with it though... If I may, why did choose not to take on its dev?

Are you saying that each category has its own levers for weighting?

That is one of the questions raised by the idea of having such a feature implemented. But again, I'm not particularly fond of the idea itself which, I think, just artificially compensates a flaw in the search algorithm.

A thought I just had, that may be kindof fun (at least I think it would be), could be using the users interaction between their search and their selection to auto-adjust the weights.

I tried something very similar - at least it sounds like it - during early stage of KP's development. While it seemed to be an improvement during the first quick tests, which were successful, it eventually shown itself to produce pretty weird results on the long run. Even by tweaking things a little to fit my own use case, at some point there were items being pushed up out of the blue. Bottom line is that it worked just fine technically-wise, but the algo felt incomplete/messy to the end-user. Unusable as-is.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

Sincerely appreciated, thanks! KP's application (as opposed to its package) is written in C++ but is not open source ATM.

Member

polyvertex commented Oct 8, 2018

Sorry for the lag.

There are nice features in hain like this double pane thing to have content preview. I had several high CPU and memory usage issues with it though... If I may, why did choose not to take on its dev?

Are you saying that each category has its own levers for weighting?

That is one of the questions raised by the idea of having such a feature implemented. But again, I'm not particularly fond of the idea itself which, I think, just artificially compensates a flaw in the search algorithm.

A thought I just had, that may be kindof fun (at least I think it would be), could be using the users interaction between their search and their selection to auto-adjust the weights.

I tried something very similar - at least it sounds like it - during early stage of KP's development. While it seemed to be an improvement during the first quick tests, which were successful, it eventually shown itself to produce pretty weird results on the long run. Even by tweaking things a little to fit my own use case, at some point there were items being pushed up out of the blue. Bottom line is that it worked just fine technically-wise, but the algo felt incomplete/messy to the end-user. Unusable as-is.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

Sincerely appreciated, thanks! KP's application (as opposed to its package) is written in C++ but is not open source ATM.

@cpriest

This comment has been minimized.

Show comment
Hide comment
@cpriest

cpriest Oct 9, 2018

There are nice features in hain like this double pane thing to have the content preview. I had several high CPU and memory usage issues with it though... If I may, why did choose not to take on its dev?

Long story, I was very excited about it in the beginning, having written a pretty sophisticated plugin for it, submitted a few patches. Ultimately it was formed into an organization and I was made an administrator. Unfortunately, the original creator Heejin didn't have any time for it and I didn't have time to take on the project by myself. After some time of a few PR's I reviewed/merged a few developers expressed interest. I had suggested I was pretty familiar with the base code and re-writing it might be the best course, Heejin himself agreed as well.

I ended up forming a new org with the new members and worked on some important sections of code for the new project (the primary of which was the matching algorithm).

Sometime later, Heejin found time to work on it and afaik, it's been evolving here and there. I don't actively use it any longer now that I've found KP, it replaced both executor and hain for me.


I built the fuzzy matching algorithm in JavaScript based on the amazing junegunn/fzf project. In particular his algo.go file has explicit detail on the levers, reasons for them, etc with a V1/V2 comparison. A good read none-the-less. My JS based version is here: regex_fzf.js though it's not battle-tested. My next step was to build a minimal-GUI to see a live-listed ranking but I haven't gotten back to that just yet.

It may be a bit tough to follow as-is, it was a test-bed of sorts but leveraged the speed of native regex to score 100k lists in <100ms.

Are you saying that each category has its own levers for weighting?

This would be the fzf/regex_fzf versions I was talking about, though they are not exposed to the end-user, they are levers none-the-less.

See this line 68 to see the 'levers' I'm referring to. Mind you, these were not written of my own design entirely. If I were to expose algorithm controls to the end user they would be simpler indicators that affected these internal levers.

That is one of the questions raised by the idea of having such a feature implemented. But again, I'm not particularly fond of the idea itself which, I think, just artificially compensates a flaw in the search algorithm.

Perhaps you should try out fzf and see what you think, it's immensely useful anyways if your a keyboard jockey, available for windows/*nix and macOS probably. Dissecting and understanding the fzf algorithm was interesting to me.

My own implementation of it was a foray into an area I had interest in anyways.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

Sincerely appreciated, thanks! KP's application (as opposed to its package) is written in C++ but is not open source ATM.

Perhaps there may be a way to expose a pluggable search algorithm in binary form for KP such that an intrepid OCD person such as myself could write an alternate pluggable algorithm. Perhaps it could just be a python plugin, I've often found Python/JS are plenty fast even for demanding 100k scoring algorithms.

cpriest commented Oct 9, 2018

There are nice features in hain like this double pane thing to have the content preview. I had several high CPU and memory usage issues with it though... If I may, why did choose not to take on its dev?

Long story, I was very excited about it in the beginning, having written a pretty sophisticated plugin for it, submitted a few patches. Ultimately it was formed into an organization and I was made an administrator. Unfortunately, the original creator Heejin didn't have any time for it and I didn't have time to take on the project by myself. After some time of a few PR's I reviewed/merged a few developers expressed interest. I had suggested I was pretty familiar with the base code and re-writing it might be the best course, Heejin himself agreed as well.

I ended up forming a new org with the new members and worked on some important sections of code for the new project (the primary of which was the matching algorithm).

Sometime later, Heejin found time to work on it and afaik, it's been evolving here and there. I don't actively use it any longer now that I've found KP, it replaced both executor and hain for me.


I built the fuzzy matching algorithm in JavaScript based on the amazing junegunn/fzf project. In particular his algo.go file has explicit detail on the levers, reasons for them, etc with a V1/V2 comparison. A good read none-the-less. My JS based version is here: regex_fzf.js though it's not battle-tested. My next step was to build a minimal-GUI to see a live-listed ranking but I haven't gotten back to that just yet.

It may be a bit tough to follow as-is, it was a test-bed of sorts but leveraged the speed of native regex to score 100k lists in <100ms.

Are you saying that each category has its own levers for weighting?

This would be the fzf/regex_fzf versions I was talking about, though they are not exposed to the end-user, they are levers none-the-less.

See this line 68 to see the 'levers' I'm referring to. Mind you, these were not written of my own design entirely. If I were to expose algorithm controls to the end user they would be simpler indicators that affected these internal levers.

That is one of the questions raised by the idea of having such a feature implemented. But again, I'm not particularly fond of the idea itself which, I think, just artificially compensates a flaw in the search algorithm.

Perhaps you should try out fzf and see what you think, it's immensely useful anyways if your a keyboard jockey, available for windows/*nix and macOS probably. Dissecting and understanding the fzf algorithm was interesting to me.

My own implementation of it was a foray into an area I had interest in anyways.

Wish I could help you with the coding effort as I'm proficient in c-c#, JS, PHP, Python and some (keep it 10 feet away, please, Java), most of the K&R languages in general.

Sincerely appreciated, thanks! KP's application (as opposed to its package) is written in C++ but is not open source ATM.

Perhaps there may be a way to expose a pluggable search algorithm in binary form for KP such that an intrepid OCD person such as myself could write an alternate pluggable algorithm. Perhaps it could just be a python plugin, I've often found Python/JS are plenty fast even for demanding 100k scoring algorithms.

@polyvertex

This comment has been minimized.

Show comment
Hide comment
@polyvertex

polyvertex Oct 9, 2018

Member

I'm closing this since v2.20 has been released and acronyms are now matched on consecutive words only, as described above.

This does not mean I'm closing the door to this conversation, I'll get back later to it!

Member

polyvertex commented Oct 9, 2018

I'm closing this since v2.20 has been released and acronyms are now matched on consecutive words only, as described above.

This does not mean I'm closing the door to this conversation, I'll get back later to it!

@polyvertex polyvertex closed this Oct 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment