Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a naive graph pruning algorithm #73

Merged
merged 1 commit into from
Apr 17, 2020
Merged

Conversation

msm-code
Copy link
Contributor

@msm-code msm-code commented Apr 16, 2020

Constants in OnDiskIndex.cpp explain it all:

// Maximum number of possible values for the edge to be considered.
// If token has more than MAX_EDGE possible values, it will never start
// or end a subgraph. This is to avoid starting a subquery with `??`.
constexpr uint32_t MAX_EDGE = 16;

// Maximum number of possible values for ngram to be considered. If ngram
// has more than MAX_NGRAM possible values, it won't be included in the
// graph and the graph will be split into one or more subgraphs.
constexpr uint32_t MAX_NGRAM = 256 * 256;

Oh the one hand, this is far from perfect. For example there is a huge
performance difference between:

BB CC ?? DD ?? EE FF

And

A? B? C? ?? D? E? F?

The first one will execute in miliseconds, while the second one will take a
long time. Fair enough, the second is literally the worst case (I can't think
of a more complex expression that won't be rejected), but they both have 4 ?s
in the worst ngram, while one runs in 233ms and the other in 4135ms.

On the other hand, they all finish running successfully, and I guess that's
what matters in the end (no DB crashes).

@msm-code msm-code force-pushed the feature/pseudo-pruning branch 2 times, most recently from 6a0aa2c to be8237c Compare April 16, 2020 21:55
Constants in OnDiskIndex.cpp explain it all:

```
// Maximum number of possible values for the edge to be considered.
// If token has more than MAX_EDGE possible values, it will never start
// or end a subgraph. This is to avoid starting a subquery with `??`.
constexpr uint32_t MAX_EDGE = 16;

// Maximum number of possible values for ngram to be considered. If ngram
// has more than MAX_NGRAM possible values, it won't be included in the
// graph and the graph will be split into one or more subgraphs.
constexpr uint32_t MAX_NGRAM = 256 * 256;
```

Oh the one hand, this is far from perfect. For example there is a **huge**
performance difference between:

```
BB CC ?? DD ?? EE FF
```

And

```
A? B? C? ?? D? E? F?
```

The first one will execute in miliseconds, while the second one will take a
long time. Fair enough, the second is literally the worst case (I can't think
of a more complex expression that won't be rejected, but they both have 4 `?`s
in the worst ngram, while one runs in 233ms and the other in 4135ms.

On the other hand, they all finish running successfully, and I guess that's
what matters in the end (no DB crashes).
@msm-code msm-code merged commit af80816 into master Apr 17, 2020
@msm-code msm-code deleted the feature/pseudo-pruning branch April 17, 2020 02:43
@msm-code msm-code linked an issue Apr 17, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement query graph pruning
2 participants