REPL.fuzzyscore heuristics could be better #49466

jakobnissen · 2023-04-23T08:04:53Z

The heuristics don't work very well in practise, partly because a mismatch near the beginning of the string completely messes up the algorithm, whereas numerous insertions/deletions don't do much.

For example

julia> REPL.fuzzyscore("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
-68.0

julia> REPL.fuzzyscore("parasid", "supercalifragilisticexpialidocious")
6.59

Or

julia> REPL.fuzzyscore("IJilia", "IJulia")
-3.0933333333333337

julia> REPL.fuzzyscore("IJilia", "MotionCaptureJointCalibration")
5.5633333333333335

Maybe it would be better to use something like Lehvenstein by default with some slight heuristics on top?

tecosaur · 2023-04-23T09:58:48Z

For reference, I've just used Optimal String Alignment Distance in a project of mine, and it seems to work rather well.

I'm also considering using the Longest Common Subsequence to highlight "this bit looks similar" in the match.

vortex73 · 2023-04-27T05:43:26Z

Could you tell me how to reproduce this issue? And direct me to any related src .

jakobnissen · 2023-04-27T09:03:08Z

@vortex73 You can reproduce it like this:

julia> using REPL

julia> REPL.fuzzyscore("IJilia", "IJulia")
-3.0933333333333337

julia> REPL.fuzzyscore("IJilia", "MotionCaptureJointCalibration")
5.5633333333333335

The code is in stdlib/REPL/src/docview.jl. Search for "Fuzzy Search Algorithm" - it's the set of functions immediately following.

vortex73 · 2023-04-27T13:40:46Z

Thanks! I'm studying the source code and just want to understand what the desired output might be? An example would be hugely helpful

jakobnissen · 2023-04-27T14:11:56Z

So we'd want a score where similar strings get a higher score than dissimilar strings. We also want the function to be relatively fast and memory-light. There is no strict objective criteria for what a "more similar string is", but I think we can all agree that "IJilia" is closer to "IJulia" than "MotionCaptureJointCalibration" is.

tecosaur · 2023-04-27T14:20:19Z

It would probably be worth giving some more examples of what OSA-based scores looks like in practice:

julia> DataToolkitBase.stringsimilarity("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
0.9705882352941176

julia> DataToolkitBase.stringsimilarity("parasid", "supercalifragilisticexpialidocious")
0.20588235294117652

julia> DataToolkitBase.stringsimilarity("IJilia", "IJulia")
0.8333333333333334

julia> DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
0.1724137931034483

julia> @btime DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
  282.022 ns (2 allocations: 576 bytes)
0.1724137931034483

tecosaur · 2023-04-27T14:25:48Z

Also: for helpful "did you mean?"-type messages, I think having that work well well could be useful across a fair few other places in the Julia code base, perhaps it could be worth having a 'private'/not-part-of-the-API Base.stringsimilarity for that purpose?

vortex73 · 2023-04-27T17:02:31Z

I've understood the code base and the problem. @jakobnissen had asked for a lehvenstein's approach, but doesn't even that fall prey to the mismatch at the starting? Please correct me as i'm new to this and guidance would be helpful.

jakobnissen · 2023-04-28T08:24:24Z

a lehvenstein's approach, but doesn't even that fall prey to the mismatch at the starting?

No, that works fine.

julia> using REPL

julia> REPL.levenshtein("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
1

The old heuristics were not particularly helpful. These new heuristics should be easier to reason about, since the score is between 0 and 1, and also yield much more intuitive results. Closes #49466 See #49562 Co-authored-by: TEC <git@tecosaur.net> Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>

jakobnissen added stdlib:REPL Julia's REPL (Read Eval Print Loop) good first issue Indicates a good issue for first-time contributors to Julia labels Apr 26, 2023

PhoenixFlame101 mentioned this issue Apr 29, 2023

Fix REPL.fuzzyscore() function using Levenshtein distance and converting to a range from 0 to 1 #49562

Closed

jakobnissen mentioned this issue May 1, 2023

Validate uses of AbstractString and string indices tecosaur/DataToolkitBase.jl#4

Closed

jakobnissen mentioned this issue Jul 4, 2023

Update REPL.fuzzyscore to use string distance #50412

Merged

KristofferC closed this as completed in #50412 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REPL.fuzzyscore heuristics could be better #49466

REPL.fuzzyscore heuristics could be better #49466

jakobnissen commented Apr 23, 2023

tecosaur commented Apr 23, 2023

vortex73 commented Apr 27, 2023

jakobnissen commented Apr 27, 2023

vortex73 commented Apr 27, 2023 •

edited

jakobnissen commented Apr 27, 2023

tecosaur commented Apr 27, 2023

tecosaur commented Apr 27, 2023

vortex73 commented Apr 27, 2023

jakobnissen commented Apr 28, 2023

REPL.fuzzyscore heuristics could be better #49466

REPL.fuzzyscore heuristics could be better #49466

Comments

jakobnissen commented Apr 23, 2023

tecosaur commented Apr 23, 2023

vortex73 commented Apr 27, 2023

jakobnissen commented Apr 27, 2023

vortex73 commented Apr 27, 2023 • edited

jakobnissen commented Apr 27, 2023

tecosaur commented Apr 27, 2023

tecosaur commented Apr 27, 2023

vortex73 commented Apr 27, 2023

jakobnissen commented Apr 28, 2023

vortex73 commented Apr 27, 2023 •

edited