-
Notifications
You must be signed in to change notification settings - Fork 21
update starrocks result to 4.0.0-rc01 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Murphy <mofei@starrocks.com>
starrocks/ddl.sql
Outdated
| get_json_string(data, 'did') | ||
| ) | ||
| ) | ||
| DISTRIBUTED BY HASH(sort_key) BUCKETS 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other databases in JSONBench (except Doris) also don't partition. Is partitioning a prerequisite for using Starrocks? If so, we can keep it, otherwise it would be nice to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several important reasons:
- Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.
- If
DISTRIBUTED BY HASHis not specified, the table defaults to RANDOM DISTRIBUTION. While this is suitable for small, non-critical tables, it is not ideal for larger or performance-sensitive tables. For such cases, selecting an appropriate DISTRIBUTION KEY is strongly recommended to enhance performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.
ClickHouse supports single and multi-node deployments (docs). Also, Starrocks is not exclusively a distributed database, single-node deployments are (of course) supported too.
Fair benchmarking is hard and there are many tuning knobs in any of the tested databases which could improve performance. JSONBench intentionally runs all databases with their default configuration. The same applies to (hash) partitioning. As long as partitioning is not the default (i.e. implicitly enabled by Starrocks when someone creates a table), we better test without partitioning. JSONBench focusses on analytics over JSON data, not physical database tuning.
I repeated the measurements locally on m6i.8xlarge with and without partitioning:
With partitioning:
--1 mio rows:
[0.03,0.02,0.02],
[0.06,0.04,0.04],
[0.03,0.03,0.03],
[0.03,0.03,0.03],
[0.03,0.03,0.03],
--10 mio rows:
[0.05,0.03,0.02],
[1.28,0.15,0.16],
[0.61,0.05,0.06],
[0.16,0.04,0.05],
[0.08,0.04,0.04],
--100 mio rows:
[0.10,0.08,0.08],
[6.18,0.93,0.96],
[2.42,0.18,0.20],
[1.30,0.13,0.12],
[0.13,0.13,0.13],
--1000 mio rows:
[1.71,0.85,0.81],
[45.19,6.36,6.37],
[26.87,2.15,2.16],
[29.34,2.52,1.80],
[7.24,6.19,5.43],
Without partitioning:
--1 mio rows:
[0.03,0.03,0.03],
[0.11,0.05,0.04],
[0.04,0.04,0.03],
[0.03,0.02,0.02],
[0.02,0.02,0.02],
--10 mio rows:
[0.07,0.05,0.05],
[0.33,0.32,0.32],
[0.11,0.11,0.09],
[0.03,0.03,0.02],
[0.03,0.03,0.03],
--100 mio rows:
[0.57,0.42,0.43],
[8.04,1.28,1.19],
[0.97,0.77,0.78],
[0.76,0.73,0.78],
[0.78,0.75,0.77],
--1000 mio rows:
[1.02,0.81,0.85],
[8.93,8.70,20.44],
[2.05,2.03,29.45],
[3.25,1.76,21.13],
[7.16,4.21,
Removing partitioning caused slightly higher runtimes for most queries. IN the 1000 mio rows case, the third run somehow got a lot slower (the last query did not go through).
So let's revert to non-partitioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, to ensure fairness, we can use the non-partitioning table.
After that, I’ll look into the unusual performance issue causing the third run to be slower.
Signed-off-by: Murphy <mofei@starrocks.com>
|
This happens if I run the script locally: How to debug this? |
|
I think the error is due to my last commit. However, even if I revert that commit, I am getting tons of these errors: |
|
I tested the script and found the issue lies with the Docker In my case, I manually set up Docker and loaded the dataset, so I did not experience this error. |
|
I made a small update in this commit (194b582), monitoring container logs until the cluster is ready. I believe this adjustment will help the script run more smoothly. |
5b0370e to
194b582
Compare
starrocks/queries_formatted.sql
Outdated
| FROM bluesky | ||
| to_datetime(min(get_json_int(data, 'time_us')), 6), | ||
| to_datetime(max(get_json_int(data, 'time_us')), 6)) AS activity_span | ||
| FROM bluesky_sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we refer to bluesky_sorted but the other four queries in this file refer to bluesky. I could not find bluesky_sorted in ddl.sql.
That's a typo, right? (queries_formatted.sql is only for prettyprinting and not executed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a typo
starrocks/ddl.sql
Outdated
| get_json_string(data, 'did') | ||
| ) | ||
| ) | ||
| DISTRIBUTED BY HASH(sort_key) BUCKETS 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.
ClickHouse supports single and multi-node deployments (docs). Also, Starrocks is not exclusively a distributed database, single-node deployments are (of course) supported too.
Fair benchmarking is hard and there are many tuning knobs in any of the tested databases which could improve performance. JSONBench intentionally runs all databases with their default configuration. The same applies to (hash) partitioning. As long as partitioning is not the default (i.e. implicitly enabled by Starrocks when someone creates a table), we better test without partitioning. JSONBench focusses on analytics over JSON data, not physical database tuning.
I repeated the measurements locally on m6i.8xlarge with and without partitioning:
With partitioning:
--1 mio rows:
[0.03,0.02,0.02],
[0.06,0.04,0.04],
[0.03,0.03,0.03],
[0.03,0.03,0.03],
[0.03,0.03,0.03],
--10 mio rows:
[0.05,0.03,0.02],
[1.28,0.15,0.16],
[0.61,0.05,0.06],
[0.16,0.04,0.05],
[0.08,0.04,0.04],
--100 mio rows:
[0.10,0.08,0.08],
[6.18,0.93,0.96],
[2.42,0.18,0.20],
[1.30,0.13,0.12],
[0.13,0.13,0.13],
--1000 mio rows:
[1.71,0.85,0.81],
[45.19,6.36,6.37],
[26.87,2.15,2.16],
[29.34,2.52,1.80],
[7.24,6.19,5.43],
Without partitioning:
--1 mio rows:
[0.03,0.03,0.03],
[0.11,0.05,0.04],
[0.04,0.04,0.03],
[0.03,0.02,0.02],
[0.02,0.02,0.02],
--10 mio rows:
[0.07,0.05,0.05],
[0.33,0.32,0.32],
[0.11,0.11,0.09],
[0.03,0.03,0.02],
[0.03,0.03,0.03],
--100 mio rows:
[0.57,0.42,0.43],
[8.04,1.28,1.19],
[0.97,0.77,0.78],
[0.76,0.73,0.78],
[0.78,0.75,0.77],
--1000 mio rows:
[1.02,0.81,0.85],
[8.93,8.70,20.44],
[2.05,2.03,29.45],
[3.25,1.76,21.13],
[7.16,4.21,
Removing partitioning caused slightly higher runtimes for most queries. IN the 1000 mio rows case, the third run somehow got a lot slower (the last query did not go through).
So let's revert to non-partitioning?
No description provided.