Optimize Starrocks #32

murphyatwork · 2025-03-20T23:28:10Z

change cast(data->'field' as varchar) to get_json_string(data, 'field'), which is little bit faster
add ddl_flat.sql, which leverage the FlatJson feature to boost query performance but keep the schema simple
not add the test result yet, will do it later

rschu1ze · 2025-03-21T08:08:53Z

starrocks/ddl_flat.sql

@@ -0,0 +1,4 @@
+CREATE TABLE bluesky (


ddl_flat is currently not called from starrocks/main.sh, respectively starrocks/create_and_load.sh.

Here is the current ddl.sql:

CREATE TABLE bluesky ( `id` BIGINT AUTO_INCREMENT, -- Main JSON column (comes after key columns) `data` JSON NULL COMMENT "Main JSON object", -- Key columns (must come first in the schema and in the same order as ORDER BY) `kind` VARCHAR(255) AS get_json_string(data, '$.kind'), `operation` VARCHAR(255) AS get_json_string(data, '$.commit.operation'), `collection` VARCHAR(255) AS get_json_string(data, '$.commit.collection'), `did` VARCHAR(255) AS get_json_string(data, '$.did'), `time_us` BIGINT AS get_json_int(data, '$.time_us') ) ENGINE=OLAP ORDER BY(`kind`, `operation`, `collection`, `did`, `time_us`) PROPERTIES ( "compression" = "ZSTD" );

I don't understand this:

is it okay that ddl_flat.sql does not have key columns kind, operation, collection, did and time_us?

ddl_flat.sql specifies no ENGINE, is it implicitly OLAP?

ddl_flat.sql specifies no compression, is it implicitly ZSTD?

what does the term "flat" refer to exactly?

Regarding the latter: JSONBench currently distinguishes zstd and LZ4 compression. I would be fine with measuring only one compression kind (e.g. for SingleStore, also only the default compression is measured). If we do that, then it would be nice to replace the existing ddl.sql file by ddl_flat.sql.

yes, the default key would be id. It may not be ideal for sorting or filtering, but it works adequately in most cases.

actually StarRocks only has OLAP engine now, which is the default

default is LZ4

flat means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction

I would modify the main.sh to include this ddl_flat

FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat alongside ddl_lz4/ddl_zstd.

Thoughts:

yes, the default key would be id. It may not be ideal for sorting or filtering, but it works adequately in most cases.

Could you please add this as a comment to ddl_flat.sql?

actually StarRocks only has OLAP engine now, which is the default

In that case, please remove ENGINE=OLAP from ddl_lz4.sql and ddl_zstd.sql, otherwise people might be confused why the engine is specified explicitly.

default is LZ4

Please add a comment in ddl_sql that the default compression is LZ4.

flat means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction

The term "flat" seems a bit loaded. In the context of JSONBench, it could be associated with "flattening" which is forbidden by the benchmark rules (see here).

What about simply "ddl_default.sql" (or "ddl_no_index.sql") as name?

FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat alongside ddl_lz4/ddl_zstd.

A single benchmark run for all scale factors (1, 10, 100, 1000) takes multiple hours. If the majority of the Starrocks users uses anyway something equivalent to "ddl_flat", then I suggest to remove "ddl_lz4" and "ddl_zstd" from the benchmark (but that is really just a suggestion to simplify benchmarking).

Understood. "Flat" is a specific term for StarRocs but can be adapted for others. I can rename it to ddl_default.sql, while the other can be labeled as ddl_materialized.sql (materialized using Generated Column). Additionally, since SR doesn’t show any advantages with zstd, we can stick with the default lz4.

rschu1ze · 2025-03-21T08:10:35Z

starrocks/queries.sql

-SELECT cast(data->'commit.collection' AS VARCHAR) AS event, hour(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) as hour_of_day, count() AS count FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (array_contains(['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'], cast(data->'commit.collection' AS VARCHAR))) GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
-SELECT cast(data->'$.did' as VARCHAR) as user_id, min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) AS first_post_date FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY first_post_date ASC LIMIT 3;
-SELECT cast(data->'$.did' as VARCHAR) as user_id, date_diff('millisecond', min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))),max(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000))))) AS activity_span FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;
+SELECT get_json_string(data, 'commit.collection') AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC;


We ideally update the queries and the runtime measurements in the same PR, otherwise the measurements become stale. If you like me to run the measurements on my local machine, please let me know.

(but let's first clarify the questions in my other comment)

yes, please help me run that measurements. Appreciate it.
I attempted to run this benchmark in a 32-core Docker container but obtained results that differed from yours. I’ll work on reproducing your results; however, for now, I believe it’s best to rely solely on your data.

BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.

Running the benchmarks right now.

BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.

I did not choose the version intentionally, I am just testing whichever version is loaded in install.sh.

My local measurements for scale factors 1, 10, 100 succeeded, then for scale factor 1000 only a single file was processed. I think the reason was that I was in parallel experimenting with #43 ... redoing the measurements now.

By default, SR ensures the atomic loading of a file, meaning the load will fail if any records are unqualified. However, SR offers a max_filter_ratio parameter in stream load to control this behavior—consider whether you need to adjust it.

@murphyatwork I am not very familiar with Starrocks but feel free to change the scripts so they use the setting. I could then re-benchmark.

Signed-off-by: Murphy <mofei@starrocks.com>

optimize for starrocks

1fbea63

rschu1ze reviewed Mar 21, 2025

View reviewed changes

rschu1ze changed the title ~~optimize for starrocks~~ Optimize Starrocks Mar 21, 2025

murphyatwork and others added 3 commits March 21, 2025 23:55

incldue ddl_flat in main.sh

caa34de

Signed-off-by: Murphy <mofei@starrocks.com>

add empty line in queries

5433550

Merge remote-tracking branch 'origin/main' into murphy_opt_flat

46bfa2b

rschu1ze mentioned this pull request Mar 23, 2025

Add FerretDB #44

Merged

murphyatwork and others added 2 commits March 24, 2025 00:21

rename

815d991

Merge remote-tracking branch 'origin/main' into murphy_opt_flat

6c2c0f3

rschu1ze force-pushed the murphy_opt_flat branch from 456d74b to 9c80733 Compare March 24, 2025 21:28

Add measurements

0da5e48

rschu1ze force-pushed the murphy_opt_flat branch from 9c80733 to 0da5e48 Compare March 24, 2025 21:30

rschu1ze merged commit a41549a into ClickHouse:main Mar 24, 2025

rschu1ze mentioned this pull request May 26, 2025

Starrocks: Remove materialized results #77

Merged

Optimize Starrocks #32

Optimize Starrocks #32

Uh oh!

Conversation

murphyatwork commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants