Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new experimental feature: QueryGraphs #56

Merged
merged 3 commits into from Apr 15, 2020
Merged

Conversation

msm-code
Copy link
Contributor

This is the first step to QueryGraphs on production. There are no precise
measurements yet, but it gives us power to execute queries that would be
completely impossible to optimize before. For example:

ursadb> select into iterator {1? 2? 3? 4? 5? 6? 7? 8?};
[2020-04-14 23:17:11.177] [info] {
    "result": {
        "file_count": 5,
        "iterator": "1c36d313",
        "mode": "iterator"
    },
    "type": "select"
}

There are only 5 files (out of 50k) in the result, a reduction factor of
1/10k. Not bad. The performance is... better than expected (there are
basically no optimisations yet), but needs testing on bigger datasets.

The most important caveat is that this is not stable yet. It's relatively
easy to kill the DB using a query like:

select {?? ?? ?? ??};

We need a way to prune query graphs during expansion, but I'll leave it
for future PRs (this one is big enough already).

To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS:

cmake -DEXPERIMENTAL_QUERY_GRAPHS=ON ..
make

This is the first step to QueryGraphs on production. There are no precise
measurements yet, but it gives us power to execute queries that would be
completely impossible to optimize before. For example:

```
ursadb> select into iterator {1? 2? 3? 4? 5? 6? 7? 8?};
[2020-04-14 23:17:11.177] [info] {
    "result": {
        "file_count": 5,
        "iterator": "1c36d313",
        "mode": "iterator"
    },
    "type": "select"
}
```

There are only 5 files (out of 50k) in the result, a reduction factor of
1/10k. Not bad. The performance is... better than expected (there are
basically no optimisations yet), but needs testing on bigger datasets.

The most important caveat is that this is not stable yet. It's relatively
easy to kill the DB using a query like:

```
select {?? ?? ?? ??};
```

We need a way to prune query graphs during expansion, but I'll leave it
for future PRs (this one is big enough already).

To test, configure cmake with EXPERIMENTAL_QUERY_GRAPHS:

```
cmake -DEXPERIMENTAL_QUERY_GRAPHS=ON ..
make
```
@msm-code msm-code requested a review from chivay April 14, 2020 21:45
return result;
}

QueryResult QueryGraph::run(const QueryFunc &oracle) const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not a strange fate that we should suffer so much fear and doubt for so small a thing?

@msm-code
Copy link
Contributor Author

msm-code commented Apr 15, 2020

Small benchmark from me:

OLD:

[nix-shell:~/opt]$ python3 nanobench.py
select "abc";                            average      1.300 files: 179
select "abcdefgh";                       average      0.491 files: 25
select {61 62 ??};                       average     34.951 files: 20532
select {61 62 ?? 63};                    average     21.709 files: 8912
select {61 62 ?? 63 64};                 average     19.058 files: 3882
select {?1 ?2 ?3};                       average    281.624 files: 50075
select {?1 ?2 ?3 ?4};                    average    510.384 files: 48855
select {?1 ?2 ?3 ?4 ?5};                 average    717.313 files: 48076
select {?1 ?2 ?3 ?4 ?5 ?6};              average    915.995 files: 47369
select {?1 ?2 ?3 ?4 ?5 ?6 ?7};           average   1206.581 files: 46487
select {?1 ?2 ?3 ?? ?5 ?6 ?7};           average  13239.535 files: 48496
select {62 ?? 63 ?? 64};                 average      0.689 files: [ERRORED]
select {61 62 ?? 63 ?? 64 65};           average      0.516 files: [ERRORED]
select {61 62 ?? 63 ?? 64 ?? 65 66};     average      0.717 files: [ERRORED]
select {61 62 63 64 64 1? 2? 3? 4? 5?};  average    681.255 files: 0
select {1? 2? 3? 4? 5? 61 62 63 64 64};  average    683.184 files: 0
select {61 62 63 64 ?? ??};              average      0.506 files: [ERRORED]
select {?? ?? 61 62 63 64};              average      0.574 files: [ERRORED]
select {61 62 63 64 ?? ?? 65};           average      0.396 files: [ERRORED]
select {65 ?? ?? 61 62 63 64};           average      0.662 files: [ERRORED]

CURRENT:

[nix-shell:~/opt]$ python3 nanobench.py
select "abc";                            average      1.228 files: 179
select "abcdefgh";                       average      0.569 files: 25
select {61 62 ??};                       average     34.758 files: 20532
select {61 62 ?? 63};                    average      2.938 files: 183
select {61 62 ?? 63 64};                 average      2.998 files: 12
select {?1 ?2 ?3};                       average    281.533 files: 50075
select {?1 ?2 ?3 ?4};                    average    214.231 files: 26125
select {?1 ?2 ?3 ?4 ?5};                 average    137.889 files: 4001
select {?1 ?2 ?3 ?4 ?5 ?6};              average    138.636 files: 500
select {?1 ?2 ?3 ?4 ?5 ?6 ?7};           average    153.307 files: 114
select {?1 ?2 ?3 ?? ?5 ?6 ?7};           average   2453.104 files: 691
select {62 ?? 63 ?? 64};                 average    221.472 files: 471
select {61 62 ?? 63 ?? 64 65};           average    147.124 files: 39
select {61 62 ?? 63 ?? 64 ?? 65 66};     average    313.255 files: 27
select {61 62 63 64 64 1? 2? 3? 4? 5?};  average     37.451 files: 0
select {1? 2? 3? 4? 5? 61 62 63 64 64};  average    115.385 files: 0
select {61 62 63 64 ?? ??};              average    150.113 files: 34
select {?? ?? 61 62 63 64};              average    261.572 files: 34
select {61 62 63 64 ?? ?? 65};           average    317.216 files: 31

Times in ms, number of matched files to the right.

All tasks performed on a dataset with 51277 random 40kbyte files (straight out of /dev/urandom). Before you comment - yes, I know that's not realistic data.

The only index used is gram3, but that's just to keep the data clean - results should generalise to other graph types.

First of all, observe that the new implementation is much faster in almost all (or even all?) tasks involving wildcards.

Second, notice that the old implementation just stright out refused to run some queries. This is not a problem anymore.

And finally, the most important thing - notice how in many cases the old code returned something like 48076 files (so almost complete database - 93.7% files), where the new implementation returned just 4001. And there are even more extreme examples.

Even with this change we're still far from optimal though - for example, {61 62 63 64 64 1? 2? 3? 4? 5?} and {1? 2? 3? 4? 5? 61 62 63 64 64} should take roughly equally long, but there is a 5x difference in execution time because the first one gets to narrow the result set quickly while the second one doesn't.

Anyway, is it really an improvement for real-world queries? I believe so - for example, masking a single byte like in {61 62 ?? 63 64} is pretty common in yara rules (equally often only register part is masked). Old implementation found 3882 files, while new one just 12 (200x improvement!).

To summarise, as soon as I manage to stabilise this feature more we'll get a huge preformance and precision boost for ursadb.

@msm-code msm-code merged commit 0835ea7 into master Apr 15, 2020
@msm-code msm-code deleted the feature/query-graphs2 branch April 15, 2020 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants