Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-range slice queries for CQL storage implementation #3816

Open
porunov opened this issue Jun 12, 2023 · 6 comments
Open

Implement multi-range slice queries for CQL storage implementation #3816

porunov opened this issue Jun 12, 2023 · 6 comments

Comments

@porunov
Copy link
Member

porunov commented Jun 12, 2023

Discussion: #3812

As for now JanusGraph can fetch requested properties either using a separate slice query per each property or using a single slice query to fetch all vertex properties. The user should always weight pros and cons of each approach and decide how they want to fetch those properties.
With multiple ranges per single slice query it will be possible to fetch only requested set of properties in a single slice query. Thus, the user won't need to decide between trade-offs of each solution.

@porunov
Copy link
Member Author

porunov commented Jun 13, 2023

I think also the main performance problem for retrieving selected properties via different slice queries is the fact that we await (block) until we fetch one property for all batched vertices and only then we request other vertices. If this process is parallel - we would have better performance for this case.

porunov added a commit to porunov/janusgraph that referenced this issue Jun 14, 2023
…mnValueStore

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 15, 2023
…mnValueStore

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 15, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 16, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 16, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) required for JanusGraph#3816

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented Jun 17, 2023

I think I'm stuck with this issue.
At least I was trying to figure out the implementation for Cassandra and couldn't figure out how to retrieve multiple different specific columns using a single CQL query. CQL doesn't support OR operator. Also a Batch statement doesn't support select operator.
It is known to be anti-pattern to use in operator to query for multiple partition keys, but in our situation we want to query the same key and just find columns in specific ranges. I'm not sure yet how can we achieve it via CQL.
It seems sending a separate CQL query per each column is the only way of doing it, but my benchmarks shows that retrieving 10 columns via a single CQL query is faster than retrieving 3 columns via 3 CQL queries. If we try to retrieve 10 columns via 10 CQL queries the performance is about 3.6 times slower than retrieving the same columns via a single CQL query.

In #3825 I'm experimenting with benchmarks and I find that retrieving all columns for many cases is much faster than retrieving only necessary columns.
Here I tried to add 100 String properties to each vertex (each property is 42 to 44 chars in size).

If we retrieve all 100 properties via a single query the benchmark is like below:

Benchmark                                                 (verticesAmount)  Mode  Cnt     Score     Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValuesAllProperties              5000  avgt    5   429.254 ±  12.147  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllProperties             50000  avgt    5  5329.649 ± 871.086  ms/op

Benchmarks to retrieve multiple specific properties (2 properties, 3 properties, 12 properties, 24 properties) is like below:

Benchmark                                                   (verticesAmount)  Mode  Cnt     Score     Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValues2Properties              5000  avgt    5    70.182 ±   9.294  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValues2Properties             50000  avgt    5   884.719 ±  58.783  ms/op

Benchmark                                                   (verticesAmount)  Mode  Cnt     Score     Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValues3Properties              5000  avgt    5   100.436 ± 14.759  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValues3Properties             50000  avgt    5  1249.490 ± 50.597  ms/op

Benchmark                                                   (verticesAmount)  Mode  Cnt     Score     Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValues12Properties              5000  avgt    5   357.143 ±  29.807  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValues12Properties             50000  avgt    5  4805.386 ± 503.603  ms/op

Benchmark                                                   (verticesAmount)  Mode  Cnt     Score     Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValues24Properties              5000  avgt    5   719.683 ±  62.495  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValues24Properties             50000  avgt    5  9816.612 ± 235.833  ms/op

It shows that retrieving all 100 String properties (each 42-44 chars in size) is faster than retrieving 24 properties. I didn't investigate why exactly, but I guess each CQL query adds some overhead and slows down the channel (just assumption).
If we could write a CQL query where we can retrieve columns in several ranges (not in one range as right now) or maybe if there were a way of batching multiple select statements together then I assume we might have a better performance of retrieving multiple different columns for the same key, but I'm not sure Cassandra supports that.

Nevertheless, if anyone has any ideas or thoughts, please, let me know.

cc @li-boxuan @cdegroc @FlorianHockmann

@porunov porunov changed the title Implement multi-range slice queries Implement multi-range slice queries for CQL storage implementation Jun 17, 2023
@porunov
Copy link
Member Author

porunov commented Jun 17, 2023

I converted this issue into CQL related issue because I think with #3825 each storage backend will be responsible for deciding how exactly it's better to fetch multiple slices.

porunov added a commit to porunov/janusgraph that referenced this issue Jun 17, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) required for JanusGraph#3816

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 17, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for JanusGraph#3816 and also mimimizes keys duplicate collections creation)

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 17, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for JanusGraph#3816 and also mimimizes keys duplicate collections creation)

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented Jun 17, 2023

I started wondering if we really need to execute range when the difference between sliceStart and sliceEnd is equal to 1 (i.e. key == $key and column1 >= $sliceStart and column1 < $sliceEnd). I could be wrong, but it feels like key = $key and column1 = $sliceStart should give the same result.
If this theory is correct (really didn't investigate that yet) then we could actually combine multiple columns into a single CQL query with something like:
key == $key and column1 IN ($sliceStart1, $sliceStart2, $sliceStart3, ...).
In such case we could actually resolve this issue, but I didn't investigate if this theory actually works.

@porunov
Copy link
Member Author

porunov commented Jun 19, 2023

I started wondering if we really need to execute range when the difference between sliceStart and sliceEnd is equal to 1 (i.e. key == $key and column1 >= $sliceStart and column1 < $sliceEnd). I could be wrong, but it feels like key = $key and column1 = $sliceStart should give the same result.
If this theory is correct (really didn't investigate that yet) then we could actually combine multiple columns into a single CQL query with something like:
key == $key and column1 IN ($sliceStart1, $sliceStart2, $sliceStart3, ...).
In such case we could actually resolve this issue, but I didn't investigate if this theory actually works.

This theory works for properties with cardinality Single.
That said, I doubt it will works for properties with cardinality LIST / SET.
I think this logic also applies for some edge fetching cases. It probably won’t work for MULTI edges but it might work for some other cases.
I tried to combine multiple slice queries into a single CQL query using IN operator and it did improve performance quite good.

Benchmark tests can be compared to benchmarks provided in the PR:
#3825 (comment)

Benchmark                                                                                                          (verticesAmount)  Mode  Cnt     Score      Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                   5000  avgt    5    80.896 ±    4.602  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                  50000  avgt    5   826.616 ±   35.267  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithUnlimitedBatch                                                     5000  avgt    5    67.762 ±    1.113  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithUnlimitedBatch                                                    50000  avgt    5   752.281 ±   15.674  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection              5000  avgt    5   106.457 ±    8.336  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection             50000  avgt    5  1164.789 ±  129.096  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithSmallBatch                                                    5000  avgt    5   201.621 ±   22.163  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithSmallBatch                                                   50000  avgt    5  2099.141 ±  316.161  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithUnlimitedBatch                                                5000  avgt    5   108.949 ±   10.711  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithUnlimitedBatch                                               50000  avgt    5  1426.985 ±  116.058  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesThreePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                 5000  avgt    5    61.831 ±    4.911  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesThreePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                50000  avgt    5   655.754 ±   46.568  ms/op
CQLMultiQueryMultiSlicesBenchmark.vertexCentricPropertiesFetching                                                              5000  avgt    5   813.705 ±  247.815  ms/op
CQLMultiQueryMultiSlicesBenchmark.vertexCentricPropertiesFetching                                                             50000  avgt    5  8214.669 ± 3842.645  ms/op

@porunov
Copy link
Member Author

porunov commented Jun 19, 2023

I started wondering if we really need to execute range when the difference between sliceStart and sliceEnd is equal to 1 (i.e. key == $key and column1 >= $sliceStart and column1 < $sliceEnd). I could be wrong, but it feels like key = $key and column1 = $sliceStart should give the same result.
If this theory is correct (really didn't investigate that yet) then we could actually combine multiple columns into a single CQL query with something like:
key == $key and column1 IN ($sliceStart1, $sliceStart2, $sliceStart3, ...).
In such case we could actually resolve this issue, but I didn't investigate if this theory actually works.

This theory works for properties with cardinality Single. That said, I doubt it will works for properties with cardinality LIST / SET. I think this logic also applies for some edge fetching cases. It probably won’t work for MULTI edges but it might work for some other cases. I tried to combine multiple slice queries into a single CQL query using IN operator and it did improve performance quite good.

Benchmark tests can be compared to benchmarks provided in the PR: #3825

Benchmark                                                                                                          (verticesAmount)  Mode  Cnt     Score      Error  Units
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                   5000  avgt    5    80.896 ±    4.602  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                  50000  avgt    5   826.616 ±   35.267  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithUnlimitedBatch                                                     5000  avgt    5    67.762 ±    1.113  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesAllPropertiesWithUnlimitedBatch                                                    50000  avgt    5   752.281 ±   15.674  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection              5000  avgt    5   106.457 ±    8.336  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection             50000  avgt    5  1164.789 ±  129.096  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithSmallBatch                                                    5000  avgt    5   201.621 ±   22.163  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithSmallBatch                                                   50000  avgt    5  2099.141 ±  316.161  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithUnlimitedBatch                                                5000  avgt    5   108.949 ±   10.711  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesMultiplePropertiesWithUnlimitedBatch                                               50000  avgt    5  1426.985 ±  116.058  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesThreePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                 5000  avgt    5    61.831 ±    4.911  ms/op
CQLMultiQueryMultiSlicesBenchmark.getValuesThreePropertiesWithAllMultiQuerySlicesUnderMaxRequestsPerConnection                50000  avgt    5   655.754 ±   46.568  ms/op
CQLMultiQueryMultiSlicesBenchmark.vertexCentricPropertiesFetching                                                              5000  avgt    5   813.705 ±  247.815  ms/op
CQLMultiQueryMultiSlicesBenchmark.vertexCentricPropertiesFetching                                                             50000  avgt    5  8214.669 ± 3842.645  ms/op

Tested with SET / LIST properties - it doesn't work. Thus, the optimization with IN statement can be used for SIGNLE properties only. That said, it feels more like a hack rather than optimization. Ideally it would be great to be able to send multiple slice queries via a single CQL query. I.e. something like SELECT * FROM edgestore WHERE key = ?key AND (column1 >= ?sliceStart1 AND column1 < ?sliceEnd1 OR column1 >= ?sliceStart2 AND column1 < ?sliceStart2 OR ...).
That said, OR statement doesn't exist in CQL.
Another solution could be to have some ranges in the IN statement. I.e. SELECT * FROM edgestore WHERE key =? key AND column1 IN(range(sliceStart1, sliceEnd1), range(sliceStart2, sliceEnd2), range(sliceStart3, sliceEnd3), ...). Again, there is no possibility to use range query in IN statement.
Unfortunately, I don't have any ideas on how to properly solve this issue without having CQL support on this.
I think we could speak with Cassandra community about this case. The reason for not having OR operator is understood, but it limits even valid cases when we explicitly interested in just one big raw and want to return several range parts of this raw.

porunov added a commit to porunov/janusgraph that referenced this issue Jun 19, 2023
…operties fetching

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 19, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 19, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 21, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for JanusGraph#3816 and also mimimizes keys duplicate collections creation)

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 22, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for JanusGraph#3816 and also mimimizes keys duplicate collections creation)

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 22, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for JanusGraph#3816 and also mimimizes keys duplicate collections creation)

Fixes JanusGraph#3824
Related to JanusGraph#3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit that referenced this issue Jun 22, 2023
…mnValueStore [cql-tests] [tp-tests]

- Adds possibility to execute groups of same key sets slice queries together
- Implement parallel execution of all provided Slice queries to CQL storage backend
- Adds a basic implementation (i.e. current non-optimized implementation) to any other storage backend which doesn't have optimized implementation right now
- Adds queries grouping algorithm (`MultiSliceQueriesGroupingUtil`) (required for #3816 and also mimimizes keys duplicate collections creation)

Fixes #3824
Related to #3816

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 23, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 23, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 23, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 23, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 24, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Jun 24, 2023
…operties fetching [cql-tests] [tp-tests]

Related to JanusGraph#3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit that referenced this issue Jun 26, 2023
…operties fetching [cql-tests] [tp-tests]

Related to #3816 (partially resolves the issue, but only for SINGLE properties use-cases)

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant