Skip to content

Conversation

murphyatwork
Copy link
Contributor

  1. change cast(data->'field' as varchar) to get_json_string(data, 'field'), which is little bit faster
  2. add ddl_flat.sql, which leverage the FlatJson feature to boost query performance but keep the schema simple
  3. not add the test result yet, will do it later

@@ -0,0 +1,4 @@
CREATE TABLE bluesky (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddl_flat is currently not called from starrocks/main.sh, respectively starrocks/create_and_load.sh.

Here is the current ddl.sql:

CREATE TABLE bluesky (
    `id` BIGINT AUTO_INCREMENT,
    -- Main JSON column (comes after key columns)
    `data` JSON NULL COMMENT "Main JSON object",
    -- Key columns (must come first in the schema and in the same order as ORDER BY)
    `kind` VARCHAR(255) AS get_json_string(data, '$.kind'),
    `operation` VARCHAR(255) AS get_json_string(data, '$.commit.operation'),
    `collection` VARCHAR(255) AS get_json_string(data, '$.commit.collection'),
    `did` VARCHAR(255) AS get_json_string(data, '$.did'),
    `time_us` BIGINT AS get_json_int(data, '$.time_us')
) ENGINE=OLAP
ORDER BY(`kind`, `operation`, `collection`, `did`, `time_us`)
PROPERTIES (
"compression" = "ZSTD"
);

I don't understand this:

  • is it okay that ddl_flat.sql does not have key columns kind, operation, collection, did and time_us?
  • ddl_flat.sql specifies no ENGINE, is it implicitly OLAP?
  • ddl_flat.sql specifies no compression, is it implicitly ZSTD?
  • what does the term "flat" refer to exactly?

Regarding the latter: JSONBench currently distinguishes zstd and LZ4 compression. I would be fine with measuring only one compression kind (e.g. for SingleStore, also only the default compression is measured). If we do that, then it would be nice to replace the existing ddl.sql file by ddl_flat.sql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, the default key would be id. It may not be ideal for sorting or filtering, but it works adequately in most cases.
  2. actually StarRocks only has OLAP engine now, which is the default
  3. default is LZ4
  4. flat means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction
  5. I would modify the main.sh to include this ddl_flat

FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat alongside ddl_lz4/ddl_zstd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts:

yes, the default key would be id. It may not be ideal for sorting or filtering, but it works adequately in most cases.

Could you please add this as a comment to ddl_flat.sql?

actually StarRocks only has OLAP engine now, which is the default

In that case, please remove ENGINE=OLAP from ddl_lz4.sql and ddl_zstd.sql, otherwise people might be confused why the engine is specified explicitly.

default is LZ4

Please add a comment in ddl_sql that the default compression is LZ4.

flat means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction

The term "flat" seems a bit loaded. In the context of JSONBench, it could be associated with "flattening" which is forbidden by the benchmark rules (see here).

What about simply "ddl_default.sql" (or "ddl_no_index.sql") as name?

FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat alongside ddl_lz4/ddl_zstd.

A single benchmark run for all scale factors (1, 10, 100, 1000) takes multiple hours. If the majority of the Starrocks users uses anyway something equivalent to "ddl_flat", then I suggest to remove "ddl_lz4" and "ddl_zstd" from the benchmark (but that is really just a suggestion to simplify benchmarking).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. "Flat" is a specific term for StarRocs but can be adapted for others. I can rename it to ddl_default.sql, while the other can be labeled as ddl_materialized.sql (materialized using Generated Column). Additionally, since SR doesn’t show any advantages with zstd, we can stick with the default lz4.

SELECT cast(data->'commit.collection' AS VARCHAR) AS event, hour(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) as hour_of_day, count() AS count FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (array_contains(['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'], cast(data->'commit.collection' AS VARCHAR))) GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
SELECT cast(data->'$.did' as VARCHAR) as user_id, min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) AS first_post_date FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY first_post_date ASC LIMIT 3;
SELECT cast(data->'$.did' as VARCHAR) as user_id, date_diff('millisecond', min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))),max(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000))))) AS activity_span FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;
SELECT get_json_string(data, 'commit.collection') AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally update the queries and the runtime measurements in the same PR, otherwise the measurements become stale. If you like me to run the measurements on my local machine, please let me know.

(but let's first clarify the questions in my other comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please help me run that measurements. Appreciate it.
I attempted to run this benchmark in a 32-core Docker container but obtained results that differed from yours. I’ll work on reproducing your results; however, for now, I believe it’s best to rely solely on your data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the benchmarks right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.

I did not choose the version intentionally, I am just testing whichever version is loaded in install.sh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local measurements for scale factors 1, 10, 100 succeeded, then for scale factor 1000 only a single file was processed. I think the reason was that I was in parallel experimenting with #43 ... redoing the measurements now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, SR ensures the atomic loading of a file, meaning the load will fail if any records are unqualified. However, SR offers a max_filter_ratio parameter in stream load to control this behavior—consider whether you need to adjust it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@murphyatwork I am not very familiar with Starrocks but feel free to change the scripts so they use the setting. I could then re-benchmark.

@rschu1ze rschu1ze changed the title optimize for starrocks Optimize Starrocks Mar 21, 2025
@rschu1ze rschu1ze mentioned this pull request Mar 23, 2025
@rschu1ze rschu1ze merged commit a41549a into ClickHouse:main Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants