Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited Multi-Regex as the keys in the FST #58

Closed
yazaddaruvala opened this issue Apr 6, 2018 · 4 comments
Closed

Limited Multi-Regex as the keys in the FST #58

yazaddaruvala opened this issue Apr 6, 2018 · 4 comments

Comments

@yazaddaruvala
Copy link

yazaddaruvala commented Apr 6, 2018

Note: I know I can search with a regex, fst.search(Regex::new(...)), but it doesn't satisfy my usecase.

Note: This suggestion is very similar to multi-regex, but hopefully by only introducing a limited feature set from regex we can keep the state machine small, and therefore really fast. I think an FST with "complex states" would be more efficient than RegexSet, however, you're the expert, so I'd trust you if you said otherwise.

Summary: My usecase basically needs a RegexSet with a few million Regexs, I know its crazy. But this is why I'd prefer the FST to RegexSet, I'm hoping for better memory efficiency (Ideally, improved throughput too).

What are your thoughts around adding limited regex features into the keys of the FST?
For example: A system which parses log files for a 500+ route API

// The URLs in log files have unique ids, and so they need to be mapped back into their original "route patten" for appropriate metrics aggregations.
// There are simpler ways to solve this problem, like logging the "routing pattern" along with the URL. However, this was the simplest example to describe my usecase.

// Used to map the values in the FST to other things.
let names = vec!["GET_FOO_BAR_ALL", "GET_FOO_BAR_ONE", "PUT_FOO_BAR_ONE"]

let mut build = MapBuilder::new(wtr)?;
build.insert("GET:/foo/.*/bar", 0).unwrap();
build.insert("GET:/foo/.*/bar/.*", 1).unwrap();
build.insert("POST:/foo/.*/bar/.*", 2).unwrap();
...

log_lines.map(line -> route_mapper.search(line))

The only real change would be to create states which can transition back to itself (i.e. the graph becomes kind-of cyclic), but "transitioning back to itself" is equivalent to "the state remains unchanged" (i.e. the graph remains acyclic).

From my minimal understanding, the features (this might not be exhaustive) which cause regexs to be significantly less performant than an FST:

  • Any Greedy matcher/state [Don't need]
  • Capture Groups [Don't need]
  • Possibly, alternations (but I really hope not) [Do need]

Edit: The big performance consideration I didn't take into account actually was that because states can be shared, any state which can loop back to itself will basically get stuck on the traversal "frontier", and need to be re-checked for each character streaming through the machine. But I suppose this is alright, because its "pay for what you use"?

@BurntSushi
Copy link
Owner

I think you have an interesting use case. My suspicion at this point is that it would be better to prototype this out of tree, and maybe even maintain a separate crate and get experience with it. If it works well, then we can see about merging your changes back in. There are certainly some public API questions to answer here, and that's hard to do without actually playing with an implementation.

I am in part saying this because I just personally don't have the bandwidth to work on something like this or even code review it thoroughly in a PR. My suspicion is that it will be a pretty gnarly change. The structure of the state machine is itself embedded into the binary format of the FST, so changes would need to trickle down to that level.

I'd be happy to mentor this at a distance and answer questions, especially about parts of the code that are opaque or difficult to understand.

@yazaddaruvala
Copy link
Author

Actually, I need this by tomorrow! You don't have time? This is a disaster :)

Totally fair, I get that this would be a tonne of work. Mostly, at this stage, given I have a limited knowledge of FSTs, I'm trying to evaluate if theoretically FSTs can even handle this usecase. You're clearly an expert on string matching, and so I was just trying to get your opinion on the problem (possible other approaches like RegexSet), and see if you might already know some non-starters which eliminate the possibility for any of these features within an FST.

Regarding mentoring: I don't know how far I'll go with this, but if I do, I want to be cognizant of your time. So know I'm never in a rush for answers, take your time, and please just let me know if my questions ever become too much, too often.

Hopefully, you have time to think about some of these questions:
Given "complex states" like \w, ., or *

  • Do you feel like based on my limited description of the problem, looking at other techniques/datastructures might be beneficial? If so, which?
  • Because of the sharing of states, FSTs with "complex states" would continue to be more efficient than RegexSet?
  • FSTs with "complex states" would still be as performant for state transitions? If not, is the degradation going to be a non-starter.
    • Given we would need some special set of bits to mean these complex transitions (Which might not be doable, or would require extra bits i.e. less compact).
  • Are there any commonly used regex features (other than the ones I called out), which are non-starters for FSTs?
    • Do you think alternations like (A|B) would work reasonably well in a FST?

@BurntSushi
Copy link
Owner

:-)

Hmm, so I basically just don't have any intuition on this because I've never really given it any thought. Part of the point of FSTs is that they do prefix and suffix compression, which is only really useful by taking advantage of the redundancy in the keys you store. If you have a million regexes, is there still redundant structure?

I think the only thing I can say is to try and experiment with this. You might consider experimenting on it from scratch with a simpler in-memory data structure. Jumping straight into this crate will be difficult, because you'll really need to have a firm grasp on the structure of the machine in order to know how to encode it to disk.

I do think you're on the right track with respect to limiting yourself to specific aspects of regexes. People have actually done some remarkable work in characterizing exactly what it is that causes DFAs to exhibit exponential memory growth. (There's a lot of fancy lingo there, but I think that if you read it carefully, you'll find that their results probably match your intuition. If your eyes glaze over, page 3 contains some comparatively light material with some examples.) With respect to your specific use case, I don't think you can say "alternations are fine" or "alternations aren't fine," but rather, it is the nature of those alternations that matter. Do they have significant overlap? If so, then you start getting state blow up. If not, then maybe you're fine.

Apologies for not answering your questions directly, but I just don't have access to those answers at my fingertips. :)

One other thought here: if you're looking at matching many regexes, you might consider checking out Hyperscan, which is supposed to specialize in handling huge pattern databases. I don't know if a million regexes is considered "huge" or "really fucking huge," but it might be an interesting baseline to test against. (Among open source software, Hyperscan is, without a doubt, one of the world's most sophisticated regex engines. Up there with PCRE2. So if any regex engine could handle your use case out of the box, it would be Hyperscan.)

@yazaddaruvala
Copy link
Author

I'm going to close this request for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants