Fuzzy prefix searches? #2

emk · 2015-11-13T20:00:41Z

Wow, what a slick library. I was just evaluating this for use in building custom indices, and I'm really impressed. I think I'm going to get a lot of mileage out of this. :-)

However, I ran into one use case that would be very handy: Fuzzy prefix matching. Given a string Masach, I can search for Levenshtein::new("Masach", 1) or Regex::new("Masach.*"). But what I'd really love to do is autocomplete to Massachusetts, with the extra s.

It seems like it should be possible to build something like:

// Match anything with a prefix that would be matched by A.
impl<A: Automaton> Automaton for Prefix<A> { ... }

But I'm not quite sure how to go about something like that. Is this basically trivial, or is it much harder than it looks?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2015-11-14T02:43:55Z

Seems like a great idea!

Your implementation idea is clever, but I'm not sure it can work as is today. In particular, I think the underlying automaton needs to be modified. In today's implementation, as soon as the key gets beyond the edit distance of the query, all subsequent inputs are rejected and that part of the FST is automatically pruned. To support prefix queries, the automaton needs to encode that all future inputs are accepted once the initial prefix is known to match. This seems like a reasonable addition, but it will require diving into the guts of the Levenshtein construction code.

Both the Levenshtein and Regex DFA construction are "proof of concept." They need to be more thoughtfully implemented, because right now, they are slow and not very memory conscious. I'd say the implementation is borderline naive. Unless someone else wants to attack it, I'll keep this use case in mind and see if we can support it easily during the rewrite.

BurntSushi · 2015-11-14T02:47:18Z

You could, in theory, implement this yourself outside of the fst crate without touching automata, but you'd have to pay a small cost. Namely, you could use a plain Levenshtein automaton to search for all matching prefixes in the FST. Once you have all matching prefixes, you should be able to do a range query to find all keys between {PREFIX} and {PREFIX}\xff, where both bounds are inclusive. But you'd have to do a range query for each matching prefix.

You'd have to do this manually because the current lookup code doesn't support returning matches that don't correspond to final states. In the above algorithm, you'd need to return prefixes of keys, which obviously may not end at final states.

The better solution is absolutely to just make the underlying automaton support this use case, because then you can do it one query.

gereeter · 2015-11-15T00:36:18Z

Shouldn't the following work?

enum PrefixState<A: Automaton> {
    Done,
    Running(A::State) // This cannot be a match state
}

struct Prefix<A>(A);

impl<A: Automaton> Automaton for Prefix<A> {
    type State = PrefixState<A>;
    fn start(&self) -> PrefixState<A> {
        let inner = self.0.start();
        if self.0.is_match(inner) {
            Done
        } else {
            Running(inner)
        }
    }

    fn is_match(&self, state: &PrefixState<A>) -> bool {
        match *state {
            Done => true,
            Running(_) => false
        }
    }

    fn can_match(&self, state: &PrefixState<A>) -> bool {
        match *state {
            Done => true,
            Running(ref inner) => self.0.can_match(inner)
        }
    }

    fn accept(&self, state: &PrefixState<A>, byte: u8) -> PrefixState<A> {
        match *self {
            Done => Done,
            Running(ref inner) => {
                let next_inner = self.0.accept(inner, byte);
                if self.0.is_match(&next_inner) {
                    Done
                } else {
                    Running(next_inner)
                }
            }
        }
    } 
}

Admittedly, it would probably still be more efficient to just merge all the accept states in the underlying automaton.

BurntSushi · 2015-11-15T01:00:57Z

@gereeter I think that might work, ya! Getting to that steady state is the right trick I think. The perf difference should be minimal or non-existent I think.

Closes #2 Implement Clone and adds len() to MmapReadOnly

BurntSushi added the enhancement label Nov 14, 2015

gereeter mentioned this issue Nov 15, 2015

Add various Automaton adapters #3

Merged

BurntSushi closed this as completed in #3 Nov 15, 2015

fulmicoton added a commit to fulmicoton/fst that referenced this issue Jan 20, 2016

Closes BurntSushi#2 Implement Clone and adds len() to MmapReadOnly

8400271

fulmicoton added a commit to fulmicoton/fst that referenced this issue Jan 20, 2016

Closes BurntSushi#2 Implement Clone and adds len() to MmapReadOnly

068a853

BurntSushi added a commit that referenced this issue Jan 20, 2016

Merge pull request #9 from fulmicoton/master

6ead7f3

Closes #2 Implement Clone and adds len() to MmapReadOnly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy prefix searches? #2

Fuzzy prefix searches? #2

emk commented Nov 13, 2015

BurntSushi commented Nov 14, 2015

BurntSushi commented Nov 14, 2015

gereeter commented Nov 15, 2015

BurntSushi commented Nov 15, 2015

Fuzzy prefix searches? #2

Fuzzy prefix searches? #2

Comments

emk commented Nov 13, 2015

BurntSushi commented Nov 14, 2015

BurntSushi commented Nov 14, 2015

gereeter commented Nov 15, 2015

BurntSushi commented Nov 15, 2015