-
Notifications
You must be signed in to change notification settings - Fork 19
Optimize Starrocks #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
starrocks/ddl_flat.sql
Outdated
@@ -0,0 +1,4 @@ | |||
CREATE TABLE bluesky ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddl_flat is currently not called from starrocks/main.sh, respectively starrocks/create_and_load.sh.
Here is the current ddl.sql:
CREATE TABLE bluesky (
`id` BIGINT AUTO_INCREMENT,
-- Main JSON column (comes after key columns)
`data` JSON NULL COMMENT "Main JSON object",
-- Key columns (must come first in the schema and in the same order as ORDER BY)
`kind` VARCHAR(255) AS get_json_string(data, '$.kind'),
`operation` VARCHAR(255) AS get_json_string(data, '$.commit.operation'),
`collection` VARCHAR(255) AS get_json_string(data, '$.commit.collection'),
`did` VARCHAR(255) AS get_json_string(data, '$.did'),
`time_us` BIGINT AS get_json_int(data, '$.time_us')
) ENGINE=OLAP
ORDER BY(`kind`, `operation`, `collection`, `did`, `time_us`)
PROPERTIES (
"compression" = "ZSTD"
);
I don't understand this:
- is it okay that ddl_flat.sql does not have key columns
kind
,operation
,collection
,did
andtime_us
? - ddl_flat.sql specifies no
ENGINE
, is it implicitly OLAP? - ddl_flat.sql specifies no compression, is it implicitly ZSTD?
- what does the term "flat" refer to exactly?
Regarding the latter: JSONBench currently distinguishes zstd and LZ4 compression. I would be fine with measuring only one compression kind (e.g. for SingleStore, also only the default compression is measured). If we do that, then it would be nice to replace the existing ddl.sql file by ddl_flat.sql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes, the default key would be
id
. It may not be ideal for sorting or filtering, but it works adequately in most cases. - actually StarRocks only has
OLAP
engine now, which is the default - default is LZ4
flat
means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction- I would modify the main.sh to include this ddl_flat
FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat
alongside ddl_lz4/ddl_zstd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts:
yes, the default key would be id. It may not be ideal for sorting or filtering, but it works adequately in most cases.
Could you please add this as a comment to ddl_flat.sql
?
actually StarRocks only has OLAP engine now, which is the default
In that case, please remove ENGINE=OLAP
from ddl_lz4.sql
and ddl_zstd.sql
, otherwise people might be confused why the engine is specified explicitly.
default is LZ4
Please add a comment in ddl_sql
that the default compression is LZ4.
flat means FlatJSON, which will extract common keys in key into columnar storage automatically during data ingestion & compaction
The term "flat" seems a bit loaded. In the context of JSONBench, it could be associated with "flattening" which is forbidden by the benchmark rules (see here).
What about simply "ddl_default.sql" (or "ddl_no_index.sql") as name?
FlatJSON aims to simplify JSON usage by reducing the need for manually extracting keys. Ideally, StarRocks should implement all possible optimizations for FlatJSON to ensure both efficiency and simplicity. Therefore, I propose adding ddl_flat alongside ddl_lz4/ddl_zstd.
A single benchmark run for all scale factors (1, 10, 100, 1000) takes multiple hours. If the majority of the Starrocks users uses anyway something equivalent to "ddl_flat", then I suggest to remove "ddl_lz4" and "ddl_zstd" from the benchmark (but that is really just a suggestion to simplify benchmarking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. "Flat" is a specific term for StarRocs but can be adapted for others. I can rename it to ddl_default.sql
, while the other can be labeled as ddl_materialized.sql
(materialized using Generated Column). Additionally, since SR doesn’t show any advantages with zstd, we can stick with the default lz4.
SELECT cast(data->'commit.collection' AS VARCHAR) AS event, hour(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) as hour_of_day, count() AS count FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (array_contains(['app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like'], cast(data->'commit.collection' AS VARCHAR))) GROUP BY event, hour_of_day ORDER BY hour_of_day, event; | ||
SELECT cast(data->'$.did' as VARCHAR) as user_id, min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))) AS first_post_date FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY first_post_date ASC LIMIT 3; | ||
SELECT cast(data->'$.did' as VARCHAR) as user_id, date_diff('millisecond', min(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000)))),max(from_unixtime(round(divide(cast(data->'time_us' AS BIGINT), 1000000))))) AS activity_span FROM bluesky WHERE (data->'kind' = 'commit') AND (data->'commit.operation' = 'create') AND (data->'commit.collection' = 'app.bsky.feed.post') GROUP BY user_id ORDER BY activity_span DESC LIMIT 3; | ||
SELECT get_json_string(data, 'commit.collection') AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally update the queries and the runtime measurements in the same PR, otherwise the measurements become stale. If you like me to run the measurements on my local machine, please let me know.
(but let's first clarify the questions in my other comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please help me run that measurements. Appreciate it.
I attempted to run this benchmark in a 32-core Docker container but obtained results that differed from yours. I’ll work on reproducing your results; however, for now, I believe it’s best to rely solely on your data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the benchmarks right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, i'm running on the latest version StarRocks-3.4.1, your results seems to be on StarRocks-3.4.0. If possible please use that version when running the measurements, thanks.
I did not choose the version intentionally, I am just testing whichever version is loaded in install.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My local measurements for scale factors 1, 10, 100 succeeded, then for scale factor 1000 only a single file was processed. I think the reason was that I was in parallel experimenting with #43 ... redoing the measurements now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, SR ensures the atomic loading of a file, meaning the load will fail if any records are unqualified. However, SR offers a max_filter_ratio
parameter in stream load to control this behavior—consider whether you need to adjust it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@murphyatwork I am not very familiar with Starrocks but feel free to change the scripts so they use the setting. I could then re-benchmark.
Signed-off-by: Murphy <mofei@starrocks.com>
456d74b
to
9c80733
Compare
9c80733
to
0da5e48
Compare
cast(data->'field' as varchar)
toget_json_string(data, 'field')
, which is little bit fasterddl_flat.sql
, which leverage the FlatJson feature to boost query performance but keep the schema simple