Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy prefix searches? #2

Closed
emk opened this issue Nov 13, 2015 · 4 comments · Fixed by #3
Closed

Fuzzy prefix searches? #2

emk opened this issue Nov 13, 2015 · 4 comments · Fixed by #3

Comments

@emk
Copy link

emk commented Nov 13, 2015

Wow, what a slick library. I was just evaluating this for use in building custom indices, and I'm really impressed. I think I'm going to get a lot of mileage out of this. :-)

However, I ran into one use case that would be very handy: Fuzzy prefix matching. Given a string Masach, I can search for Levenshtein::new("Masach", 1) or Regex::new("Masach.*"). But what I'd really love to do is autocomplete to Massachusetts, with the extra s.

It seems like it should be possible to build something like:

// Match anything with a prefix that would be matched by A.
impl<A: Automaton> Automaton for Prefix<A> { ... }

But I'm not quite sure how to go about something like that. Is this basically trivial, or is it much harder than it looks?

@BurntSushi
Copy link
Owner

Seems like a great idea!

Your implementation idea is clever, but I'm not sure it can work as is today. In particular, I think the underlying automaton needs to be modified. In today's implementation, as soon as the key gets beyond the edit distance of the query, all subsequent inputs are rejected and that part of the FST is automatically pruned. To support prefix queries, the automaton needs to encode that all future inputs are accepted once the initial prefix is known to match. This seems like a reasonable addition, but it will require diving into the guts of the Levenshtein construction code.

Both the Levenshtein and Regex DFA construction are "proof of concept." They need to be more thoughtfully implemented, because right now, they are slow and not very memory conscious. I'd say the implementation is borderline naive. Unless someone else wants to attack it, I'll keep this use case in mind and see if we can support it easily during the rewrite.

@BurntSushi
Copy link
Owner

You could, in theory, implement this yourself outside of the fst crate without touching automata, but you'd have to pay a small cost. Namely, you could use a plain Levenshtein automaton to search for all matching prefixes in the FST. Once you have all matching prefixes, you should be able to do a range query to find all keys between {PREFIX} and {PREFIX}\xff, where both bounds are inclusive. But you'd have to do a range query for each matching prefix.

You'd have to do this manually because the current lookup code doesn't support returning matches that don't correspond to final states. In the above algorithm, you'd need to return prefixes of keys, which obviously may not end at final states.

The better solution is absolutely to just make the underlying automaton support this use case, because then you can do it one query.

@gereeter
Copy link
Contributor

Shouldn't the following work?

enum PrefixState<A: Automaton> {
    Done,
    Running(A::State) // This cannot be a match state
}

struct Prefix<A>(A);

impl<A: Automaton> Automaton for Prefix<A> {
    type State = PrefixState<A>;
    fn start(&self) -> PrefixState<A> {
        let inner = self.0.start();
        if self.0.is_match(inner) {
            Done
        } else {
            Running(inner)
        }
    }

    fn is_match(&self, state: &PrefixState<A>) -> bool {
        match *state {
            Done => true,
            Running(_) => false
        }
    }

    fn can_match(&self, state: &PrefixState<A>) -> bool {
        match *state {
            Done => true,
            Running(ref inner) => self.0.can_match(inner)
        }
    }

    fn accept(&self, state: &PrefixState<A>, byte: u8) -> PrefixState<A> {
        match *self {
            Done => Done,
            Running(ref inner) => {
                let next_inner = self.0.accept(inner, byte);
                if self.0.is_match(&next_inner) {
                    Done
                } else {
                    Running(next_inner)
                }
            }
        }
    } 
}

Admittedly, it would probably still be more efficient to just merge all the accept states in the underlying automaton.

@BurntSushi
Copy link
Owner

@gereeter I think that might work, ya! Getting to that steady state is the right trick I think. The perf difference should be minimal or non-existent I think.

fulmicoton added a commit to fulmicoton/fst that referenced this issue Jan 20, 2016
fulmicoton added a commit to fulmicoton/fst that referenced this issue Jan 20, 2016
BurntSushi added a commit that referenced this issue Jan 20, 2016
Closes #2 Implement Clone and adds len() to MmapReadOnly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants