Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

levenshtein automata not matching Japanese Characters correctly #38

Open
PSeitz opened this issue Mar 19, 2017 · 15 comments
Open

levenshtein automata not matching Japanese Characters correctly #38

PSeitz opened this issue Mar 19, 2017 · 15 comments
Labels

Comments

@PSeitz
Copy link

PSeitz commented Mar 19, 2017

Hi,
awesome library and documentation. The explanations are really nice to read.

I have an example that should probably match but, doesn't. I guess it has to do something with the utf8-ranges matching. (Btw: This means sushi can't be burned :)

let keys = vec!["寿司は焦げられない"];
let set = try!(Set::from_iter(keys.iter()));

let lev = try!(Levenshtein::new("寿司は焦げられない", 2));
let stream = set.search(lev).into_stream();
let keys = try!(stream.into_strs());
@BurntSushi
Copy link
Owner

Thanks for the report. This does appear to be a bug.

In the future, please provide a minimal runnable program that reproduces your issue. Here's one I came up with:

extern crate fst;

use fst::{Set, Levenshtein, IntoStreamer};

fn main() {
    let k = "寿司は焦げられない";
    let keys = vec![k];
    let set = Set::from_iter(keys.iter()).unwrap();

    let lev = Levenshtein::new(k, 2).unwrap();
    let stream = set.search(lev).into_stream();
    let keys = stream.into_strs().unwrap();
    println!("{:?}", keys);
}

I don't know the cause of it off the top of my head, but I fear this may be due to a bug in the Levenshtein automaton construction itself. If that's the case, I don't know when it will get fixed.

@BurntSushi BurntSushi added the bug label Mar 19, 2017
@PSeitz
Copy link
Author

PSeitz commented Apr 19, 2017

Hi,

I took a closer look on the issue with some insight, although there are some things I don't completely understand with regard to the states data-structures.

The issue seems to occur when there are two characters with same first byte.
When we alter the orignal example to these:

let keys = vec!["来探"];
let lev = Levenshtein::new("来探", 1).unwrap();
--> fails

let keys = vec!["来食"];
let lev = Levenshtein::new("来食", 1).unwrap();
--> correct

来 bytes [230, 157, 165]
探 bytes [230, 142, 162]
食 bytes [233, 163, 159]

I printed the states from the dfa, but I'm not sure how to read them, are there multiple entry points or does it always start at state 0? What does is_match mean?

Although my understanding is, that following the maching from state 0, I should be able to complete the original byte sequences:

  • 来探 -> 230, 157, 165, 230, 142, 162
  • 来食 -> 230, 157, 165, 233, 163, 159

"来探" Failing state machine
This is the only path from state 0, and the first 142 should be 157 ?
230 -> 142 -> 162 -> 230 -> 142 -> 162
Also, there are states which are not reachable from state 0.

"来食" Correct state machine
There are multiple paths from state 0 and the original byte sequence is part of it.
230 -> 157 -> 165 -> 233 -> 163 -> 159

Btw, if you like the states representation, I can create a pull request for it.

@BurntSushi
Copy link
Owner

@PSeitz I think you're on your own here. I haven't looked at that code in a long time, and I'd need to spend a couple of hours loading up my context to help you. (At which point, I would probably just try to go ahead and fix the bug.)

You probably want some background reading:

http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

http://fulmicoton.com/posts/levenshtein/

@PSeitz
Copy link
Author

PSeitz commented Jul 24, 2017

Thanks for the links. I think I understood the basic concepts. Theses examples although all use characters in their states and not bytes, which makes them a lot easier, because a new character is just one new step.

I don't really understand how a multibyte character is transferred to multiple steps here, and as far as I could tell this seems to be where the issue is.

@BurntSushi
Copy link
Owner

@PSeitz Yes, they are encoded as UTF-8 automata. I don't know any well documented place where they are explained, but you can see the utf8-ranges crate docs for a little bit of explanation. (This crate is used in the implementation here.)

@WarpspeedSCP
Copy link

Um, correct me if I'm wrong, but in cases like this, isn't it always better to compare individual bytes rather than "characters", since then it becomes a matter of comparing two numbers?

@BurntSushi
Copy link
Owner

@WarpspeedSCP I don't understand your question. Please show examples and tie it back to the specific implementation in this crate.

@WarpspeedSCP
Copy link

WarpspeedSCP commented Dec 18, 2017

What I meant was that comparing groups of bytes at a time as characters wouldn't be as easy as comparing bytes individually in the string to arrive at an answer. For example, you wouldn't just know which of 😎 and 😕 comes first, but if you looked at the bytes that represent them, it is possible to come to a conclusion. If one has fewer bytes it becomes immediately apparent, for example. But I don't know how utf8 works, so I am probably wrong.

@BurntSushi
Copy link
Owner

BurntSushi commented Dec 18, 2017

@WarpspeedSCP Sorry, but what you're saying doesn't make any sense to me. I suspect you aren't using the words "bytes" and "character" correctly, but I don't know for sure. Moreover, this talk about "which one coming first" doesn't make any sense to me; what does ordering have to do with this ticket? Please write out an example in more detail.

I'd also like to encourage you to carefully read the comments in this issue. Drive by comments typically aren't helpful, and it honestly kind of feels like that's what your comments here are. For example, a couple comments up I linked to the docs for the utf8-ranges crate as a way of explaining how the UTF-8 automata are constructed. If you read those docs, then the fact that the implementation is in fact byte based should be very clear. And if isn't, then something is wrong with the docs and I'd appreciate it if you pointed out those problems to me.

Making matters more complex is that regardless of the implementation strategy, the concept of character in the abstract sense still matters. In particular, while the Levenshtein automata in this crate are byte oriented, they measure differences between strings in terms of characters. The specific interpretation of character used in this crate is that of a single Unicode scalar value. This is intended as a less than ideal approximation (where a better approximation would be a Unicode grapheme cluster) because it is simpler to implement.

@WarpspeedSCP
Copy link

WarpspeedSCP commented Dec 19, 2017

What I meant by character, was the set of bytes that represent a Unicode code point.

For example,
来 is represented by the set of bytes [230, 157, 165], and
探 is represented by the set of bytes [230, 142, 162], right?

I now understand that your library works on the bytes of the string and not the characters.
So at that time, I had thought that just comparing the bytes directly at each state would be enough to say whether or not two strings were equal. If I were to do this with your library, would this be right?-

A state machine has been constructed for 来 and 探 is queried from it.-

let machine = fst::Levenshtein::new("来", 0).unwrap();
let query = vec!["探"];
let set = fst::Set::from_iter(query.iter()).unwrap();
let stream = set.search(machine).into_stream();
let results = stream.into_strs().unwrap();

Now results wouldn't contain anything since there were no matches, right?

The first byte would match, but the second would not, so the machine would output that there is no match. This was after I had skimmed the docs for this library, as well as the article just once. Is this how the library works?

@BurntSushi
Copy link
Owner

BurntSushi commented Dec 19, 2017

comparing the bytes directly at each state would be enough to say whether or not two strings were equal

This ticket isn't about equality. It's about building a Levenshtein automaton and matching keys that are within a certain edit distance of a query. The edit distance is measured in units of Unicode scalar values. The automaton is built such that its transitions are byte based.

This program emits no results, which is correct:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 0).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

This program also emits no results, but is incorrect. This is presumably the bug:

extern crate fst;
extern crate fst_levenshtein;

use fst::{IntoStreamer, Set};
use fst_levenshtein::Levenshtein;

fn main() {
    let set = Set::from_iter(vec!["探"]).unwrap();
    let query = Levenshtein::new("来", 1).unwrap();
    let results = set.search(query).into_stream().into_strs().unwrap();
    println!("{:?}", results);
}

@WarpspeedSCP
Copy link

WarpspeedSCP commented Dec 20, 2017

Thank you, now I understand why this bug is occurring. In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

@BurntSushi
Copy link
Owner

In the second case, the search returns only the first byte [230], which is not a valid character, and won't print, right?

It returns nothing. Returning something that is invalid UTF-8 would be an egregious bug.

@WarpspeedSCP
Copy link

Ok, so it either outputs a valid answer or nothing.

@PSeitz
Copy link
Author

PSeitz commented Dec 28, 2017

I had another look on the issue, and the problem is the overwrite, where the old branch gets orphaned. That's the reason, why this appears only with same byte characters

This also fits the initial observation, where next state is overwritten from 230 -> 142 to 230 -> 157

fn add_utf8_range(
&mut self,
overwrite: bool,
from: usize,
to: usize,
range: &Utf8Range,
) {
for b in range.start as usize..range.end as usize + 1 {
if overwrite || self.dfa.states[from].next[b].is_none() {
self.dfa.states[from].next[b] = Some(to);
}
}
}

So I tried with a simple merge of the old states nexts into the new state nexts. This fixes the problem in this case, but I think is generally not allowed, because we allow to much?

Also some cases would require to do this recursively, and that is not feasible performance wise.

    fn add_utf8_range(
        &mut self,
        overwrite: bool,
        from: usize,
        to: usize,
        range: &Utf8Range,
    ) {
        for b in range.start as usize..range.end as usize + 1 {
            if overwrite || self.dfa.states[from].next[b].is_none() {
                if let Some(state) = self.dfa.states[from].next[b] {
                    //merge existing into new state
                    for i in 0..256 {
                        if let Some(si) = self.dfa.states[state].next[i] {
                            self.dfa.states[to].next[i] = Some(si);
                        }
                    }
                }
                self.dfa.states[from].next[b] = Some(to);
            }
        }
    }

@BurntSushi BurntSushi changed the title No Hit for Japanese Characters levenshtein automata not matching Japanese Characters correctly Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants