Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REPL.fuzzyscore heuristics could be better #49466

Closed
jakobnissen opened this issue Apr 23, 2023 · 9 comments · Fixed by #50412
Closed

REPL.fuzzyscore heuristics could be better #49466

jakobnissen opened this issue Apr 23, 2023 · 9 comments · Fixed by #50412
Labels
good first issue Indicates a good issue for first-time contributors to Julia stdlib:REPL Julia's REPL (Read Eval Print Loop)

Comments

@jakobnissen
Copy link
Contributor

The heuristics don't work very well in practise, partly because a mismatch near the beginning of the string completely messes up the algorithm, whereas numerous insertions/deletions don't do much.

For example

julia> REPL.fuzzyscore("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
-68.0

julia> REPL.fuzzyscore("parasid", "supercalifragilisticexpialidocious")
6.59

Or

julia> REPL.fuzzyscore("IJilia", "IJulia")
-3.0933333333333337

julia> REPL.fuzzyscore("IJilia", "MotionCaptureJointCalibration")
5.5633333333333335

Maybe it would be better to use something like Lehvenstein by default with some slight heuristics on top?

@tecosaur
Copy link
Contributor

For reference, I've just used Optimal String Alignment Distance in a project of mine, and it seems to work rather well.

image

I'm also considering using the Longest Common Subsequence to highlight "this bit looks similar" in the match.

@jakobnissen jakobnissen added stdlib:REPL Julia's REPL (Read Eval Print Loop) good first issue Indicates a good issue for first-time contributors to Julia labels Apr 26, 2023
@vortex73
Copy link

Could you tell me how to reproduce this issue? And direct me to any related src .

@jakobnissen
Copy link
Contributor Author

@vortex73 You can reproduce it like this:

julia> using REPL

julia> REPL.fuzzyscore("IJilia", "IJulia")
-3.0933333333333337

julia> REPL.fuzzyscore("IJilia", "MotionCaptureJointCalibration")
5.5633333333333335

The code is in stdlib/REPL/src/docview.jl. Search for "Fuzzy Search Algorithm" - it's the set of functions immediately following.

@vortex73
Copy link

vortex73 commented Apr 27, 2023

Thanks! I'm studying the source code and just want to understand what the desired output might be? An example would be hugely helpful

@jakobnissen
Copy link
Contributor Author

So we'd want a score where similar strings get a higher score than dissimilar strings. We also want the function to be relatively fast and memory-light. There is no strict objective criteria for what a "more similar string is", but I think we can all agree that "IJilia" is closer to "IJulia" than "MotionCaptureJointCalibration" is.

@tecosaur
Copy link
Contributor

It would probably be worth giving some more examples of what OSA-based scores looks like in practice:

julia> DataToolkitBase.stringsimilarity("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
0.9705882352941176

julia> DataToolkitBase.stringsimilarity("parasid", "supercalifragilisticexpialidocious")
0.20588235294117652

julia> DataToolkitBase.stringsimilarity("IJilia", "IJulia")
0.8333333333333334

julia> DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
0.1724137931034483

julia> @btime DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
  282.022 ns (2 allocations: 576 bytes)
0.1724137931034483

@tecosaur
Copy link
Contributor

Also: for helpful "did you mean?"-type messages, I think having that work well well could be useful across a fair few other places in the Julia code base, perhaps it could be worth having a 'private'/not-part-of-the-API Base.stringsimilarity for that purpose?

@vortex73
Copy link

I've understood the code base and the problem. @jakobnissen had asked for a lehvenstein's approach, but doesn't even that fall prey to the mismatch at the starting? Please correct me as i'm new to this and guidance would be helpful.

@jakobnissen
Copy link
Contributor Author

a lehvenstein's approach, but doesn't even that fall prey to the mismatch at the starting?

No, that works fine.

julia> using REPL

julia> REPL.levenshtein("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
1

KristofferC pushed a commit that referenced this issue Aug 14, 2023
The old heuristics were not particularly helpful. These new heuristics
should be easier to reason about, since the score is between 0 and 1,
and also yield much more intuitive results.

Closes #49466 
See #49562

Co-authored-by: TEC <git@tecosaur.net>
Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indicates a good issue for first-time contributors to Julia stdlib:REPL Julia's REPL (Read Eval Print Loop)
Projects
None yet
3 participants