update starrocks result to 4.0.0-rc01 #85

murphyatwork · 2025-09-09T12:32:08Z

No description provided.

Signed-off-by: Murphy <mofei@starrocks.com>

starrocks/.gitignore

starrocks/count.sh

starrocks/env.sh

rschu1ze · 2025-09-09T17:34:32Z

starrocks/ddl.sql

+        get_json_string(data, 'did')
+    )
+) 
+DISTRIBUTED BY HASH(sort_key) BUCKETS 128


The other databases in JSONBench (except Doris) also don't partition. Is partitioning a prerequisite for using Starrocks? If so, we can keep it, otherwise it would be nice to avoid it.

There are several important reasons:

Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.

If DISTRIBUTED BY HASH is not specified, the table defaults to RANDOM DISTRIBUTION. While this is suitable for small, non-critical tables, it is not ideal for larger or performance-sensitive tables. For such cases, selecting an appropriate DISTRIBUTION KEY is strongly recommended to enhance performance.

Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.

ClickHouse supports single and multi-node deployments (docs). Also, Starrocks is not exclusively a distributed database, single-node deployments are (of course) supported too.

Fair benchmarking is hard and there are many tuning knobs in any of the tested databases which could improve performance. JSONBench intentionally runs all databases with their default configuration. The same applies to (hash) partitioning. As long as partitioning is not the default (i.e. implicitly enabled by Starrocks when someone creates a table), we better test without partitioning. JSONBench focusses on analytics over JSON data, not physical database tuning.

I repeated the measurements locally on m6i.8xlarge with and without partitioning:

With partitioning:

--1 mio rows: [0.03,0.02,0.02], [0.06,0.04,0.04], [0.03,0.03,0.03], [0.03,0.03,0.03], [0.03,0.03,0.03], --10 mio rows: [0.05,0.03,0.02], [1.28,0.15,0.16], [0.61,0.05,0.06], [0.16,0.04,0.05], [0.08,0.04,0.04], --100 mio rows: [0.10,0.08,0.08], [6.18,0.93,0.96], [2.42,0.18,0.20], [1.30,0.13,0.12], [0.13,0.13,0.13], --1000 mio rows: [1.71,0.85,0.81], [45.19,6.36,6.37], [26.87,2.15,2.16], [29.34,2.52,1.80], [7.24,6.19,5.43],

Without partitioning:

--1 mio rows: [0.03,0.03,0.03], [0.11,0.05,0.04], [0.04,0.04,0.03], [0.03,0.02,0.02], [0.02,0.02,0.02], --10 mio rows: [0.07,0.05,0.05], [0.33,0.32,0.32], [0.11,0.11,0.09], [0.03,0.03,0.02], [0.03,0.03,0.03], --100 mio rows: [0.57,0.42,0.43], [8.04,1.28,1.19], [0.97,0.77,0.78], [0.76,0.73,0.78], [0.78,0.75,0.77], --1000 mio rows: [1.02,0.81,0.85], [8.93,8.70,20.44], [2.05,2.03,29.45], [3.25,1.76,21.13], [7.16,4.21,

Removing partitioning caused slightly higher runtimes for most queries. IN the 1000 mio rows case, the third run somehow got a lot slower (the last query did not go through).

So let's revert to non-partitioning?

Alright, to ensure fairness, we can use the non-partitioning table.
After that, I’ll look into the unusual performance issue causing the third run to be slower.

Signed-off-by: Murphy <mofei@starrocks.com>

starrocks/main.sh

rschu1ze · 2025-09-10T19:22:35Z

This happens if I run the script locally:

/data/JSONBench/starrocks (murphy_sr_4.0.0 %=) $ ./main.sh 1
docker 28.1.1+1 from Canonical✓ installed
Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu noble InRelease
Hit:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:4 https://apt.llvm.org/noble llvm-toolchain-noble-19 InRelease
Hit:5 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:6 https://pkgs.tailscale.com/stable/ubuntu jammy InRelease
Fetched 6,578 B in 0s (15.2 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  mysql-client
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 0 B/9,412 B of archives.
After this operation, 42.0 kB of additional disk space will be used.
Selecting previously unselected package mysql-client.
(Reading database ... 119056 files and directories currently installed.)
Preparing to unpack .../mysql-client_8.0.43-0ubuntu0.24.04.1_all.deb ...
Unpacking mysql-client (8.0.43-0ubuntu0.24.04.1) ...
Setting up mysql-client (8.0.43-0ubuntu0.24.04.1) ...
Scanning processes...
Scanning candidates...
Scanning linux images...

Pending kernel upgrade!
Running kernel version:
  6.14.0-1010-aws
Diagnostics:
  The currently running kernel version is not the expected kernel version 6.14.0-1012-aws.

Restarting the system to load the new kernel will not be handled automatically, so you should consider rebooting.

Restarting services...

Service restarts being deferred:
 systemctl restart networkd-dispatcher.service
 systemctl restart unattended-upgrades.service

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
Unable to find image 'starrocks/allin1-ubuntu:latest' locally
latest: Pulling from starrocks/allin1-ubuntu
60d98d907669: Pull complete
ff80fb62b140: Pull complete
5d77066bb972: Pull complete
483a9650bad6: Pull complete
3528f7cb87a3: Pull complete
4098ce6f4e97: Pull complete
d13d950c5ae9: Pull complete
4e8fa276af51: Pull complete
c80d080efd32: Pull complete
Digest: sha256:711dbcdec06a93858bf37ab6e197f36fa707e0488bd4f9c06fa1c666d9ef8149
Status: Downloaded newer image for starrocks/allin1-ubuntu:latest
67f31973527542dff93f4137a2b0e1b2078b10f0cdf49c6e1000052824085511
Create database
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Execute DDL
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Load data
Processing file: /home/ubuntu/data/bluesky/file_0001.json.gz

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
------------------------------------------------------------------------------------------------------------------------
Physical query plan for query Q1:

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
------------------------------------------------------------------------------------------------------------------------
Physical query plan for query Q2:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
------------------------------------------------------------------------------------------------------------------------
Physical query plan for query Q3:

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
------------------------------------------------------------------------------------------------------------------------
Physical query plan for query Q4:

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
------------------------------------------------------------------------------------------------------------------------
Physical query plan for query Q5:

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Running queries on database: bluesky_1m
Clearing file system cache...
File system cache cleared.
Running query: SELECT get_json_string(data, 'commit.collection') AS event, count() AS count FROM bluesky GROUP BY event ORDER BY count DESC;
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
Clearing file system cache...
File system cache cleared.
Running query: SELECT get_json_string(data, 'commit.collection') AS event, count() AS count, count(DISTINCT get_json_string(data, 'did')) AS users FROM bluesky WHERE (get_js
n_string(data, 'kind') = 'commit') AND (get_json_string(data, 'commit.operation') = 'create') GROUP BY event ORDER BY count DESC;
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
Clearing file system cache...
File system cache cleared.
Running query: SELECT get_json_string(data, 'commit.collection') AS event, hour_from_unixtime(get_json_int(data, 'time_us')/1000000) as hour_of_day, count() AS count FROM bl
esky WHERE (get_json_string(data, 'kind') = 'commit') AND (get_json_string(data, 'commit.operation') = 'create') AND (array_contains(['app.bsky.feed.post', 'app.bsky.feed.re
ost', 'app.bsky.feed.like'], get_json_string(data, 'commit.collection'))) GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
Clearing file system cache...
File system cache cleared.
Running query: SELECT get_json_string(data, 'did') as user_id, to_datetime(min(get_json_int(data, 'time_us')), 6) AS first_post_date FROM bluesky WHERE (get_json_string(data
 'kind') = 'commit') AND (get_json_string(data, 'commit.operation') = 'create') AND (get_json_string(data, 'commit.collection') = 'app.bsky.feed.post') GROUP BY user_id ORDE
 BY first_post_date ASC LIMIT 3;
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
Clearing file system cache...
File system cache cleared.
Running query: SELECT get_json_string(data, 'did') as user_id, date_diff('millisecond', to_datetime(min(get_json_int(data, 'time_us')), 6), to_datetime(max(get_json_int(data
 'time_us')), 6)) AS activity_span FROM bluesky WHERE (get_json_string(data, 'kind') = 'commit') AND (get_json_string(data, 'commit.operation') = 'create') AND (get_json_str
ng(data, 'commit.collection') = 'app.bsky.feed.post') GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
Response time:  s
Result written to _m6i.8xlarge_bluesky_1m.results_runtime
Dropping table: bluesky_1m.bluesky
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2
starrocks
starrocks
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  mysql-client-8.0 mysql-client-core-8.0 mysql-common
Use 'sudo apt autoremove' to remove them.
The following packages will be REMOVED:
  mysql-client
0 upgraded, 0 newly installed, 1 to remove and 38 not upgraded.
After this operation, 42.0 kB disk space will be freed.
(Reading database ... 119057 files and directories currently installed.)
Removing mysql-client (8.0.43-0ubuntu0.24.04.1) ...
docker removed

How to debug this?

rschu1ze · 2025-09-10T19:26:22Z

I think the error is due to my last commit.

However, even if I revert that commit, I am getting tons of these errors:

mysql: [ERROR] mysql: Empty value for 'port' specified.

murphyatwork · 2025-09-11T02:39:49Z

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2

I tested the script and found the issue lies with the Docker entrypoint.sh script. This script starts all processes (FE and BE) asynchronously, meaning the cluster is not immediately ready after executing docker run. It typically takes around 20 seconds for the cluster to initialize. During this time, you may encounter errors such as "Lost connection."

In my case, I manually set up Docker and loaded the dataset, so I did not experience this error.

murphyatwork · 2025-09-11T02:51:33Z

I made a small update in this commit (194b582), monitoring container logs until the cluster is ready. I believe this adjustment will help the script run more smoothly.

rschu1ze · 2025-09-14T11:50:07Z

starrocks/queries_formatted.sql

-FROM bluesky
+      to_datetime(min(get_json_int(data, 'time_us')), 6),
+      to_datetime(max(get_json_int(data, 'time_us')), 6)) AS activity_span 
+FROM bluesky_sorted


Here we refer to bluesky_sorted but the other four queries in this file refer to bluesky. I could not find bluesky_sorted in ddl.sql.

That's a typo, right? (queries_formatted.sql is only for prettyprinting and not executed)

rschu1ze · 2025-09-14T12:07:01Z

starrocks/ddl.sql

+        get_json_string(data, 'did')
+    )
+) 
+DISTRIBUTED BY HASH(sort_key) BUCKETS 128


Unlike ClickHouse, which supports single-node tables, StarRocks operates as a distributed database, meaning all tables are inherently distributed.

ClickHouse supports single and multi-node deployments (docs). Also, Starrocks is not exclusively a distributed database, single-node deployments are (of course) supported too.

Fair benchmarking is hard and there are many tuning knobs in any of the tested databases which could improve performance. JSONBench intentionally runs all databases with their default configuration. The same applies to (hash) partitioning. As long as partitioning is not the default (i.e. implicitly enabled by Starrocks when someone creates a table), we better test without partitioning. JSONBench focusses on analytics over JSON data, not physical database tuning.

I repeated the measurements locally on m6i.8xlarge with and without partitioning:

With partitioning:

--1 mio rows: [0.03,0.02,0.02], [0.06,0.04,0.04], [0.03,0.03,0.03], [0.03,0.03,0.03], [0.03,0.03,0.03], --10 mio rows: [0.05,0.03,0.02], [1.28,0.15,0.16], [0.61,0.05,0.06], [0.16,0.04,0.05], [0.08,0.04,0.04], --100 mio rows: [0.10,0.08,0.08], [6.18,0.93,0.96], [2.42,0.18,0.20], [1.30,0.13,0.12], [0.13,0.13,0.13], --1000 mio rows: [1.71,0.85,0.81], [45.19,6.36,6.37], [26.87,2.15,2.16], [29.34,2.52,1.80], [7.24,6.19,5.43],

Without partitioning:

--1 mio rows: [0.03,0.03,0.03], [0.11,0.05,0.04], [0.04,0.04,0.03], [0.03,0.02,0.02], [0.02,0.02,0.02], --10 mio rows: [0.07,0.05,0.05], [0.33,0.32,0.32], [0.11,0.11,0.09], [0.03,0.03,0.02], [0.03,0.03,0.03], --100 mio rows: [0.57,0.42,0.43], [8.04,1.28,1.19], [0.97,0.77,0.78], [0.76,0.73,0.78], [0.78,0.75,0.77], --1000 mio rows: [1.02,0.81,0.85], [8.93,8.70,20.44], [2.05,2.03,29.45], [3.25,1.76,21.13], [7.16,4.21,

Removing partitioning caused slightly higher runtimes for most queries. IN the 1000 mio rows case, the third run somehow got a lot slower (the last query did not go through).

So let's revert to non-partitioning?

update starrocks result to 4.0.0-rc01

9193979

Signed-off-by: Murphy <mofei@starrocks.com>

rschu1ze reviewed Sep 9, 2025

View reviewed changes

murphyatwork and others added 4 commits September 10, 2025 09:51

fix comment

f257134

Signed-off-by: Murphy <mofei@starrocks.com>

Remove outdated result files

2bd992e

Fix result file endings

8d7d54e

Move some code around (minor changes)

4e13df7

rschu1ze reviewed Sep 10, 2025

View reviewed changes

starrocks/main.sh Show resolved Hide resolved

wait for the container ready

194b582

murphyatwork force-pushed the murphy_sr_4.0.0 branch from 5b0370e to 194b582 Compare September 11, 2025 02:52

rschu1ze reviewed Sep 14, 2025

View reviewed changes

rschu1ze added 4 commits September 15, 2025 08:41

Cosmetics

c06a04b

Cosmetics

dead208

Switch to non-partitioned

4bb26f6

Restore json files (sorry)

a0e6e67

rschu1ze approved these changes Sep 15, 2025

View reviewed changes

rschu1ze merged commit 9098bd3 into ClickHouse:main Sep 15, 2025

wangmj17 mentioned this pull request Sep 16, 2025

Is generated column allowed? #89

Closed

rschu1ze mentioned this pull request Sep 17, 2025

Use of generated columns with JSON flattening violates benchmark rules #92

Closed

murphyatwork mentioned this pull request Nov 3, 2025

update doris ddl and version #101

Merged

update starrocks result to 4.0.0-rc01 #85

update starrocks result to 4.0.0-rc01 #85

Uh oh!

Conversation

murphyatwork commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rschu1ze Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

murphyatwork Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

rschu1ze Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

murphyatwork Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rschu1ze commented Sep 10, 2025

Uh oh!

rschu1ze commented Sep 10, 2025

Uh oh!

murphyatwork commented Sep 11, 2025

Uh oh!

murphyatwork commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschu1ze Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

murphyatwork Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

rschu1ze Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

murphyatwork commented Sep 11, 2025 •

edited

Loading