New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new experimental feature: QueryGraphs #56
Conversation
This is the first step to QueryGraphs on production. There are no precise measurements yet, but it gives us power to execute queries that would be completely impossible to optimize before. For example: ``` ursadb> select into iterator {1? 2? 3? 4? 5? 6? 7? 8?}; [2020-04-14 23:17:11.177] [info] { "result": { "file_count": 5, "iterator": "1c36d313", "mode": "iterator" }, "type": "select" } ``` There are only 5 files (out of 50k) in the result, a reduction factor of 1/10k. Not bad. The performance is... better than expected (there are basically no optimisations yet), but needs testing on bigger datasets. The most important caveat is that this is not stable yet. It's relatively easy to kill the DB using a query like: ``` select {?? ?? ?? ??}; ``` We need a way to prune query graphs during expansion, but I'll leave it for future PRs (this one is big enough already). To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS: ``` cmake -DEXPERIMENTAL_QUERY_GRAPHS=ON .. make ```
return result; | ||
} | ||
|
||
QueryResult QueryGraph::run(const QueryFunc &oracle) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not a strange fate that we should suffer so much fear and doubt for so small a thing?
Small benchmark from me:
Times in ms, number of matched files to the right. All tasks performed on a dataset with 51277 random 40kbyte files (straight out of /dev/urandom). Before you comment - yes, I know that's not realistic data. The only index used is gram3, but that's just to keep the data clean - results should generalise to other graph types. First of all, observe that the new implementation is much faster in almost all (or even all?) tasks involving wildcards. Second, notice that the old implementation just stright out refused to run some queries. This is not a problem anymore. And finally, the most important thing - notice how in many cases the old code returned something like 48076 files (so almost complete database - 93.7% files), where the new implementation returned just 4001. And there are even more extreme examples. Even with this change we're still far from optimal though - for example, Anyway, is it really an improvement for real-world queries? I believe so - for example, masking a single byte like in To summarise, as soon as I manage to stabilise this feature more we'll get a huge preformance and precision boost for ursadb. |
This is the first step to QueryGraphs on production. There are no precise
measurements yet, but it gives us power to execute queries that would be
completely impossible to optimize before. For example:
There are only 5 files (out of 50k) in the result, a reduction factor of
1/10k. Not bad. The performance is... better than expected (there are
basically no optimisations yet), but needs testing on bigger datasets.
The most important caveat is that this is not stable yet. It's relatively
easy to kill the DB using a query like:
We need a way to prune query graphs during expansion, but I'll leave it
for future PRs (this one is big enough already).
To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS: