Skip to content

Commit

Permalink
Revamp JanusGraphMultiQueryStrategy for better parent step usage [cql…
Browse files Browse the repository at this point in the history
…-tests] [tp-tests]

This commit improves JanusGraphMultiQueryStrategy to better support multi-query compatible parent steps.

1. This commit brings better support for `repeat` step by introducing next itaration registration process.
Previously `repeat` children steps was getting traversers registered from the beginning of all outer repeat steps which
could result in duplicate or unnecesary retrievals for the first batch. Moreover, next iterations were not considered.
This commit changes the approach to bring different `repeat` step modes which can change the batches registration behaviour
to aacount only the closest `repeat` step, all `repeat` steps, all only starts of all `repeat` steps.

2. This commit adds support to almost all known TinkerPop Parent steps.
The exception is `match` step. We didn't have proper outter start registration for `match` step previously and now as well.

Fixes #3733
Fixes #3735
Fixes #2996

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
  • Loading branch information
porunov committed Jun 5, 2023
1 parent fbcc39f commit 6e73ef4
Show file tree
Hide file tree
Showing 32 changed files with 2,172 additions and 238 deletions.
40 changes: 35 additions & 5 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,10 @@ Support for Gryo MessageSerializer [has been dropped in TinkerPop 3.6.0](https:/
and we therefore also no longer support it in JanusGraph.
GraphBinary is now used as the default MessageSerializer.hb

##### Batch Processing enabled by default
##### Batch Processing enabled by default. Configuration changes.

`query.batch` is now a configuration namespace. Thus, previous `query.batch` configuration is replaced by `query.batch.enabled`.
`query.limit-batch-size` configuration option is changed to `query.batch.limited`.

[Batch processing](https://docs.janusgraph.org/operations/batch-processing/) allows JanusGraph to fetch a batch of
vertices from the storage backend together instead of requesting each vertex individually which leads to a high number
Expand All @@ -194,12 +197,39 @@ This mode therefore solves the problem of having potentially unlimited batch siz
That is why we now enable this mode by default as most users should benefit from this limited batch processing.

If you want to continue using JanusGraph without batch processing, then you have to manually disable it by setting
`query.batch` to `false`.
`query.batch.enabled` to `false`.

`limit-batch-size` configuration option is changed to `limited-batch`.
A new configuration option `limited-batch-size` exists to configure default barrier step size for batch processing
The size of the batches can be limited by using `barrier()` steps if limited batch processing is used (`query.batch.limited` set to `true`).
A special strategy exists which already inserts `barrier()` steps by default for some steps, the `LazyBarrierStrategy`.
A new configuration option `query.batch.limited-size` exists to configure default barrier step size for batch processing
for batch cases when `LazyBarrierStrategy` not applied `.barrier` step and no user-provided barrier step exists for
batchable query part. Notice, that `limited-batch-size` is only used when `limited-batch` is `true`.
batchable query part. Notice, that `query.batch.limited-size` is only used when `query.batch.limited` is `true` (default in this version).

##### Batch registration for nested batch compatible steps is changed for `repeat` step

Previously any batch compatible steps like `out`, `in`, `values`, etc. would receive vertices for batch registration
from all `repeat` parent steps, but only for their starts in case of multi-nested repeat steps
(skipping their subsequent iterations registration).
With JanusGraph 1.0.0 batches registration for the subsequent iterations of multi-nested repeat steps are used as well.

```groovy
g.V(startVertexId).emit().
repeat(__.repeat(__.in("connects")).emit()).
until(__.loops().is(P.gt(2)))
```
In the example above multi-nested `repeat` case would not register vertices returned from the inner `emit()` step
for the next outer iteration which would result in sequential calls of `in("connects")` for next outer iteration. The behaviour is
now changed to register these vertices for the next child `repeat` step start.

The behaviour can be controlled by `query.batch.repeat-step-mode` configuration option.
In case the old behaviour is preferable then `query.batch.repeat-step-mode` should be set to `starts_only_of_all_repeat_parents`.

However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again which would mean a waste of operation when it isn't necessary.
In such situations `closest_repeat_parent` mode might be more preferable than `all_repeat_parents`.
With `closest_repeat_parent` mode vertices for batch registration will be received from the start of the closest
`repeat` step as well as the end of the closest `repeat` step (for the next iteration). Any other parent `repeat` steps
will be ignored.

##### Breaking change for Geoshape GraphBinary serialization

Expand Down
14 changes: 11 additions & 3 deletions docs/configs/janusgraph-cfg.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,19 +346,27 @@ Configuration options for query processing

| Name | Description | Datatype | Default Value | Mutability |
| ---- | ---- | ---- | ---- | ---- |
| query.batch | Whether traversal queries should be batched when executed against the storage backend. This can lead to significant performance improvement if there is a non-trivial latency to the backend. | Boolean | true | MASKABLE |
| query.batch-property-prefetch | Whether to do a batched pre-fetch of all properties on adjacent vertices against the storage backend prior to evaluating a has condition against those vertices. Because these vertex properties will be loaded into the transaction-level cache of recently-used vertices when the condition is evaluated this can lead to significant performance improvement if there are many edges to adjacent vertices and there is a non-trivial latency to the backend. | Boolean | false | MASKABLE |
| query.fast-property | Whether to pre-fetch all properties on first singular vertex property access. This can eliminate backend calls on subsequent property access for the same vertex at the expense of retrieving all properties at once. This can be expensive for vertices with many properties | Boolean | true | MASKABLE |
| query.force-index | Whether JanusGraph should throw an exception if a graph query cannot be answered using an index. Doing so limits the functionality of JanusGraph's graph queries but ensures that slow graph queries are avoided on large graphs. Recommended for production use of JanusGraph. | Boolean | false | MASKABLE |
| query.hard-max-limit | If smart-limit is disabled and no limit is given in the query, query optimizer adds a limit in light of possibly large result sets. It works in the same way as smart-limit except that hard-max-limit is usually a large number. Default value is Integer.MAX_VALUE which effectively disables this behavior. This option does not take effect when smart-limit is enabled. | Integer | 2147483647 | MASKABLE |
| query.ignore-unknown-index-key | Whether to ignore undefined types encountered in user-provided index queries | Boolean | false | MASKABLE |
| query.index-select-strategy | Name of the index selection strategy or full class name. Following shorthands can be used: <br>- `brute-force` (Try all combinations of index candidates and pick up optimal one)<br>- `approximate` (Use greedy algorithm to pick up approximately optimal index candidate)<br>- `threshold-based` (Use index-select-threshold to pick up either `approximate` or `threshold-based` strategy on runtime) | String | threshold-based | MASKABLE |
| query.index-select-threshold | Threshold of deciding whether to use brute force enumeration algorithm or fast approximation algorithm for selecting suitable indexes. Selecting optimal indexes for a query is a NP-complete set cover problem. When number of suitable index candidates is no larger than threshold, JanusGraph uses brute force search with exponential time complexity to ensure the best combination of indexes is selected. Only effective when `threshold-based` index select strategy is chosen. | Integer | 10 | MASKABLE |
| query.limited-batch | Configure a maximum batch size for queries against the storage backend. This can be used to ensure responsiveness if batches tend to grow very large. The used batch size is equivalent to the barrier size of a preceding barrier() step. If a step has no preceding barrier(), the default barrier of TinkerPop will be inserted. This option only takes effect if query.batch is enabled. | Boolean | true | MASKABLE |
| query.limited-batch-size | Default batch size (barrier() step size) for queries. This size is applied only for cases where `LazyBarrierStrategy` strategy didn't apply `barrier` step and where user didn't apply barrier step either. This option is used only when `query.limited-batch` is `true`. Notice, value `2147483647` is considered to be unlimited. | Integer | 2500 | MASKABLE |
| query.optimizer-backend-access | Whether the optimizer should be allowed to fire backend queries during the optimization phase. Allowing these will give the optimizer a chance to find more efficient execution plan but also increase the optimization overhead. | Boolean | true | MASKABLE |
| query.smart-limit | Whether the query optimizer should try to guess a smart limit for the query to ensure responsiveness in light of possibly large result sets. Those will be loaded incrementally if this option is enabled. | Boolean | false | MASKABLE |

### query.batch
Configuration options to configure batch queries optimization behavior


| Name | Description | Datatype | Default Value | Mutability |
| ---- | ---- | ---- | ---- | ---- |
| query.batch.enabled | Whether traversal queries should be batched when executed against the storage backend. This can lead to significant performance improvement if there is a non-trivial latency to the backend. If `false` then all other configuration options under `query.batch` namespace are ignored. | Boolean | true | MASKABLE |
| query.batch.limited | Configure a maximum batch size for queries against the storage backend. This can be used to ensure responsiveness if batches tend to grow very large. The used batch size is equivalent to the barrier size of a preceding `barrier()` step. If a step has no preceding `barrier()`, the default barrier of TinkerPop will be inserted. This option only takes effect if `query.batch.enabled` is `true`. | Boolean | true | MASKABLE |
| query.batch.limited-size | Default batch size (barrier() step size) for queries. This size is applied only for cases where `LazyBarrierStrategy` strategy didn't apply `barrier` step and where user didn't apply barrier step either. This option is used only when `query.batch.limited` is `true`. Notice, value `2147483647` is considered to be unlimited. | Integer | 2500 | MASKABLE |
| query.batch.repeat-step-mode | Batch mode for `repeat` step. Used only when query.batch.enabled is `true`.<br>These modes are controlling how the child steps with batch support are behaving if they placed to the start of the `repeat`, `emit`, or `until` traversals.<br>Supported modes:<br>- `closest_repeat_parent` Child start steps are receiving vertices for batching from the closest `repeat` step parent only.<br>- `all_repeat_parents` Child start steps are receiving vertices for batching from all `repeat` step parents.<br>- `starts_only_of_all_repeat_parents` Child start steps are receiving vertices for batching from the closest `repeat` step parent (both for the parent start and for next iterations) and also from all `repeat` step parents for the parent start. | String | all_repeat_parents | MASKABLE |

### schema
Schema related configuration options

Expand Down
Loading

1 comment on commit 6e73ef4

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark

Benchmark suite Current: 6e73ef4 Previous: 3a7ba53 Ratio
org.janusgraph.JanusGraphSpeedBenchmark.basicAddAndDelete 14448.221230944444 ms/op 20132.68260361965 ms/op 0.72
org.janusgraph.GraphCentricQueryBenchmark.getVertices 1396.0177500724933 ms/op 1617.8956294193085 ms/op 0.86
org.janusgraph.MgmtOlapJobBenchmark.runClearIndex 221.18195505217392 ms/op 222.76279991304347 ms/op 0.99
org.janusgraph.MgmtOlapJobBenchmark.runReindex 467.2156776563637 ms/op 541.8996275500001 ms/op 0.86
org.janusgraph.JanusGraphSpeedBenchmark.basicCount 411.9777834992106 ms/op 368.0788711730627 ms/op 1.12
org.janusgraph.CQLMultiQueryBenchmark.getIdToOutVerticesProjection 425.0886989393221 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getElementsWithUsingEmitRepeatSteps 33721.92819198809 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getAllElementsTraversedFromOuterVertex 16939.769371643957 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getNeighborNames 16673.675699318654 ms/op 19824.388658859913 ms/op 0.84
org.janusgraph.CQLMultiQueryBenchmark.getVerticesWithDoubleUnion 626.635692786763 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getElementsWithUsingRepeatUntilSteps 18067.736305916 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getAdjacentVerticesLocalCounts 16971.690529527226 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getNames 16367.7759909025 ms/op 19643.66567439762 ms/op 0.83
org.janusgraph.CQLMultiQueryBenchmark.getVerticesFilteredByAndStep 666.589312380716 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getVerticesFromMultiNestedRepeatStepStartingFromSingleVertex 23044.72647966667 ms/op
org.janusgraph.CQLMultiQueryBenchmark.getVerticesWithCoalesceUsage 592.2298148586152 ms/op

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.