Add a new experimental feature: QueryGraphs #56

msm-code · 2020-04-14T21:44:57Z

This is the first step to QueryGraphs on production. There are no precise
measurements yet, but it gives us power to execute queries that would be
completely impossible to optimize before. For example:

ursadb> select into iterator {1? 2? 3? 4? 5? 6? 7? 8?};
[2020-04-14 23:17:11.177] [info] {
    "result": {
        "file_count": 5,
        "iterator": "1c36d313",
        "mode": "iterator"
    },
    "type": "select"
}

There are only 5 files (out of 50k) in the result, a reduction factor of
1/10k. Not bad. The performance is... better than expected (there are
basically no optimisations yet), but needs testing on bigger datasets.

The most important caveat is that this is not stable yet. It's relatively
easy to kill the DB using a query like:

select {?? ?? ?? ??};

We need a way to prune query graphs during expansion, but I'll leave it
for future PRs (this one is big enough already).

To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS:

cmake -DEXPERIMENTAL_QUERY_GRAPHS=ON ..
make

This is the first step to QueryGraphs on production. There are no precise measurements yet, but it gives us power to execute queries that would be completely impossible to optimize before. For example: ``` ursadb> select into iterator {1? 2? 3? 4? 5? 6? 7? 8?}; [2020-04-14 23:17:11.177] [info] { "result": { "file_count": 5, "iterator": "1c36d313", "mode": "iterator" }, "type": "select" } ``` There are only 5 files (out of 50k) in the result, a reduction factor of 1/10k. Not bad. The performance is... better than expected (there are basically no optimisations yet), but needs testing on bigger datasets. The most important caveat is that this is not stable yet. It's relatively easy to kill the DB using a query like: ``` select {?? ?? ?? ??}; ``` We need a way to prune query graphs during expansion, but I'll leave it for future PRs (this one is big enough already). To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS: ``` cmake -DEXPERIMENTAL_QUERY_GRAPHS=ON .. make ```

msm-code · 2020-04-14T21:45:30Z

libursa/QueryGraph.cpp

+    return result;
+}
+
+QueryResult QueryGraph::run(const QueryFunc &oracle) const {


Is it not a strange fate that we should suffer so much fear and doubt for so small a thing?

msm-code · 2020-04-15T02:50:55Z

Small benchmark from me:

OLD:

[nix-shell:~/opt]$ python3 nanobench.py
select "abc";                            average      1.300 files: 179
select "abcdefgh";                       average      0.491 files: 25
select {61 62 ??};                       average     34.951 files: 20532
select {61 62 ?? 63};                    average     21.709 files: 8912
select {61 62 ?? 63 64};                 average     19.058 files: 3882
select {?1 ?2 ?3};                       average    281.624 files: 50075
select {?1 ?2 ?3 ?4};                    average    510.384 files: 48855
select {?1 ?2 ?3 ?4 ?5};                 average    717.313 files: 48076
select {?1 ?2 ?3 ?4 ?5 ?6};              average    915.995 files: 47369
select {?1 ?2 ?3 ?4 ?5 ?6 ?7};           average   1206.581 files: 46487
select {?1 ?2 ?3 ?? ?5 ?6 ?7};           average  13239.535 files: 48496
select {62 ?? 63 ?? 64};                 average      0.689 files: [ERRORED]
select {61 62 ?? 63 ?? 64 65};           average      0.516 files: [ERRORED]
select {61 62 ?? 63 ?? 64 ?? 65 66};     average      0.717 files: [ERRORED]
select {61 62 63 64 64 1? 2? 3? 4? 5?};  average    681.255 files: 0
select {1? 2? 3? 4? 5? 61 62 63 64 64};  average    683.184 files: 0
select {61 62 63 64 ?? ??};              average      0.506 files: [ERRORED]
select {?? ?? 61 62 63 64};              average      0.574 files: [ERRORED]
select {61 62 63 64 ?? ?? 65};           average      0.396 files: [ERRORED]
select {65 ?? ?? 61 62 63 64};           average      0.662 files: [ERRORED]

CURRENT:

[nix-shell:~/opt]$ python3 nanobench.py
select "abc";                            average      1.228 files: 179
select "abcdefgh";                       average      0.569 files: 25
select {61 62 ??};                       average     34.758 files: 20532
select {61 62 ?? 63};                    average      2.938 files: 183
select {61 62 ?? 63 64};                 average      2.998 files: 12
select {?1 ?2 ?3};                       average    281.533 files: 50075
select {?1 ?2 ?3 ?4};                    average    214.231 files: 26125
select {?1 ?2 ?3 ?4 ?5};                 average    137.889 files: 4001
select {?1 ?2 ?3 ?4 ?5 ?6};              average    138.636 files: 500
select {?1 ?2 ?3 ?4 ?5 ?6 ?7};           average    153.307 files: 114
select {?1 ?2 ?3 ?? ?5 ?6 ?7};           average   2453.104 files: 691
select {62 ?? 63 ?? 64};                 average    221.472 files: 471
select {61 62 ?? 63 ?? 64 65};           average    147.124 files: 39
select {61 62 ?? 63 ?? 64 ?? 65 66};     average    313.255 files: 27
select {61 62 63 64 64 1? 2? 3? 4? 5?};  average     37.451 files: 0
select {1? 2? 3? 4? 5? 61 62 63 64 64};  average    115.385 files: 0
select {61 62 63 64 ?? ??};              average    150.113 files: 34
select {?? ?? 61 62 63 64};              average    261.572 files: 34
select {61 62 63 64 ?? ?? 65};           average    317.216 files: 31

Times in ms, number of matched files to the right.

All tasks performed on a dataset with 51277 random 40kbyte files (straight out of /dev/urandom). Before you comment - yes, I know that's not realistic data.

The only index used is gram3, but that's just to keep the data clean - results should generalise to other graph types.

First of all, observe that the new implementation is much faster in almost all (or even all?) tasks involving wildcards.

Second, notice that the old implementation just stright out refused to run some queries. This is not a problem anymore.

And finally, the most important thing - notice how in many cases the old code returned something like 48076 files (so almost complete database - 93.7% files), where the new implementation returned just 4001. And there are even more extreme examples.

Even with this change we're still far from optimal though - for example, {61 62 63 64 64 1? 2? 3? 4? 5?} and {1? 2? 3? 4? 5? 61 62 63 64 64} should take roughly equally long, but there is a 5x difference in execution time because the first one gets to narrow the result set quickly while the second one doesn't.

Anyway, is it really an improvement for real-world queries? I believe so - for example, masking a single byte like in {61 62 ?? 63 64} is pretty common in yara rules (equally often only register part is masked). Old implementation found 3882 files, while new one just 12 (200x improvement!).

To summarise, as soon as I manage to stabilise this feature more we'll get a huge preformance and precision boost for ursadb.

msm-code requested a review from chivay April 14, 2020 21:45

msm-code commented Apr 14, 2020

View reviewed changes

msm-code added 2 commits April 14, 2020 23:57

Forgotten include (works on my machine!)

eaab080

Fix formatting this time...

f515585

chivay approved these changes Apr 15, 2020

View reviewed changes

msm-code merged commit 0835ea7 into master Apr 15, 2020

msm-code deleted the feature/query-graphs2 branch April 15, 2020 13:13

This was referenced Apr 15, 2020

Implement alternatives in the ursadb query syntax #65

Closed

Investigate regex support CERT-Polska/mquery#97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new experimental feature: QueryGraphs #56

Add a new experimental feature: QueryGraphs #56

msm-code commented Apr 14, 2020

msm-code Apr 14, 2020

msm-code commented Apr 15, 2020 •

edited

Add a new experimental feature: QueryGraphs #56

Add a new experimental feature: QueryGraphs #56

Conversation

msm-code commented Apr 14, 2020

msm-code Apr 14, 2020

Choose a reason for hiding this comment

msm-code commented Apr 15, 2020 • edited

msm-code commented Apr 15, 2020 •

edited