New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REPL.fuzzyscore heuristics could be better #49466
Comments
Could you tell me how to reproduce this issue? And direct me to any related src . |
@vortex73 You can reproduce it like this: julia> using REPL
julia> REPL.fuzzyscore("IJilia", "IJulia")
-3.0933333333333337
julia> REPL.fuzzyscore("IJilia", "MotionCaptureJointCalibration")
5.5633333333333335 The code is in |
Thanks! I'm studying the source code and just want to understand what the desired output might be? An example would be hugely helpful |
So we'd want a score where similar strings get a higher score than dissimilar strings. We also want the function to be relatively fast and memory-light. There is no strict objective criteria for what a "more similar string is", but I think we can all agree that |
It would probably be worth giving some more examples of what OSA-based scores looks like in practice: julia> DataToolkitBase.stringsimilarity("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
0.9705882352941176
julia> DataToolkitBase.stringsimilarity("parasid", "supercalifragilisticexpialidocious")
0.20588235294117652
julia> DataToolkitBase.stringsimilarity("IJilia", "IJulia")
0.8333333333333334
julia> DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
0.1724137931034483
julia> @btime DataToolkitBase.stringsimilarity("IJilia", "MotionCaptureJointCalibration")
282.022 ns (2 allocations: 576 bytes)
0.1724137931034483 |
Also: for helpful "did you mean?"-type messages, I think having that work well well could be useful across a fair few other places in the Julia code base, perhaps it could be worth having a 'private'/not-part-of-the-API |
I've understood the code base and the problem. @jakobnissen had asked for a lehvenstein's approach, but doesn't even that fall prey to the mismatch at the starting? Please correct me as i'm new to this and guidance would be helpful. |
No, that works fine. julia> using REPL
julia> REPL.levenshtein("bupercalifragilisticexpialidocious", "supercalifragilisticexpialidocious")
1 |
The old heuristics were not particularly helpful. These new heuristics should be easier to reason about, since the score is between 0 and 1, and also yield much more intuitive results. Closes #49466 See #49562 Co-authored-by: TEC <git@tecosaur.net> Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>
The heuristics don't work very well in practise, partly because a mismatch near the beginning of the string completely messes up the algorithm, whereas numerous insertions/deletions don't do much.
For example
Or
Maybe it would be better to use something like Lehvenstein by default with some slight heuristics on top?
The text was updated successfully, but these errors were encountered: