-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy prefix searches? #2
Comments
Seems like a great idea! Your implementation idea is clever, but I'm not sure it can work as is today. In particular, I think the underlying automaton needs to be modified. In today's implementation, as soon as the key gets beyond the edit distance of the query, all subsequent inputs are rejected and that part of the FST is automatically pruned. To support prefix queries, the automaton needs to encode that all future inputs are accepted once the initial prefix is known to match. This seems like a reasonable addition, but it will require diving into the guts of the Levenshtein construction code. Both the Levenshtein and Regex DFA construction are "proof of concept." They need to be more thoughtfully implemented, because right now, they are slow and not very memory conscious. I'd say the implementation is borderline naive. Unless someone else wants to attack it, I'll keep this use case in mind and see if we can support it easily during the rewrite. |
You could, in theory, implement this yourself outside of the You'd have to do this manually because the current lookup code doesn't support returning matches that don't correspond to final states. In the above algorithm, you'd need to return prefixes of keys, which obviously may not end at final states. The better solution is absolutely to just make the underlying automaton support this use case, because then you can do it one query. |
Shouldn't the following work? enum PrefixState<A: Automaton> {
Done,
Running(A::State) // This cannot be a match state
}
struct Prefix<A>(A);
impl<A: Automaton> Automaton for Prefix<A> {
type State = PrefixState<A>;
fn start(&self) -> PrefixState<A> {
let inner = self.0.start();
if self.0.is_match(inner) {
Done
} else {
Running(inner)
}
}
fn is_match(&self, state: &PrefixState<A>) -> bool {
match *state {
Done => true,
Running(_) => false
}
}
fn can_match(&self, state: &PrefixState<A>) -> bool {
match *state {
Done => true,
Running(ref inner) => self.0.can_match(inner)
}
}
fn accept(&self, state: &PrefixState<A>, byte: u8) -> PrefixState<A> {
match *self {
Done => Done,
Running(ref inner) => {
let next_inner = self.0.accept(inner, byte);
if self.0.is_match(&next_inner) {
Done
} else {
Running(next_inner)
}
}
}
}
} Admittedly, it would probably still be more efficient to just merge all the accept states in the underlying automaton. |
@gereeter I think that might work, ya! Getting to that steady state is the right trick I think. The perf difference should be minimal or non-existent I think. |
Closes #2 Implement Clone and adds len() to MmapReadOnly
Wow, what a slick library. I was just evaluating this for use in building custom indices, and I'm really impressed. I think I'm going to get a lot of mileage out of this. :-)
However, I ran into one use case that would be very handy: Fuzzy prefix matching. Given a string
Masach
, I can search forLevenshtein::new("Masach", 1)
orRegex::new("Masach.*")
. But what I'd really love to do is autocomplete toMassachusetts
, with the extras
.It seems like it should be possible to build something like:
But I'm not quite sure how to go about something like that. Is this basically trivial, or is it much harder than it looks?
The text was updated successfully, but these errors were encountered: