Skip to content

Conversation

@rschu1ze
Copy link
Member

Added a few more measurements for Arc after the scripts were re-submitted.

@rschu1ze rschu1ze changed the title A Add more measurements for Arc Oct 27, 2025
@rschu1ze
Copy link
Member Author

@xe-nvdk

@rschu1ze rschu1ze merged commit 63ea73e into main Oct 27, 2025
@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

Thank you.

@nwoolmer
Copy link

Hey @xe-nvdk @rschu1ze

Good job for getting this merged back in, the results look much more consistent with DuckDB on parquet 🎉

However, I was curious about the divergence at the bottom:

image

When running the benchmark myself, it seems that the bottom 7 queries return no results, which could explain how they execute so quickly. Is this reproducible on your end(s)?

sql: SELECT URL, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10
iresponse: {'success': True, 'columns': ['URL', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 47.34, 'timestamp': '2025-10-27T18:13:25.479810', 'error': None}
0.0639

sql: SELECT Title, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' GROUP BY Title ORDER BY PageViews DESC LIMIT 10
response: {'success': True, 'columns': ['Title', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 41.21, 'timestamp': '2025-10-27T18:13:25.703730', 'error': None}
0.0536

sql: SELECT URL, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND IsRefresh = 0 AND IsLink <> 0 AND IsDownload = 0 GROUP BY URL ORDER BY PageViews DESC LIMIT 10 OFFSET 1000
response: {'success': True, 'columns': ['URL', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 33.03, 'timestamp': '2025-10-27T18:13:25.915403', 'error': None}
0.0450

sql: SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000
response: {'success': True, 'columns': ['TraficSourceID', 'SearchEngineID', 'AdvEngineID', 'Src', 'Dst', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 50.31, 'timestamp': '2025-10-27T18:13:26.141661', 'error': None}
0.0660

sql: SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10 OFFSET 100
response: {'success': True, 'columns': ['URLHash', 'EventDate', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 40.86, 'timestamp': '2025-10-27T18:13:26.374304', 'error': None}
0.0538

sql: SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16262 AND EventDate <= 16292 AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10 OFFSET 10000
response: {'success': True, 'columns': ['WindowClientWidth', 'WindowClientHeight', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 43.0, 'timestamp': '2025-10-27T18:13:26.609484', 'error': None}
0.0555

sql: SELECT DATE_TRUNC('minute', CAST(EventTime AS TIMESTAMP)) AS M, COUNT(*) AS PageViews FROM clickbench.hits WHERE CounterID = 62 AND EventDate >= 16275 AND EventDate <= 16276 AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', CAST(EventTime AS TIMESTAMP)) ORDER BY DATE_TRUNC('minute', CAST(EventTime AS TIMESTAMP)) LIMIT 10 OFFSET 1000
response: {'success': True, 'columns': ['M', 'PageViews'], 'data': [], 'row_count': 0, 'execution_time_ms': 47.99, 'timestamp': '2025-10-27T18:13:26.855125', 'error': None}
0.0609

@rschu1ze
Copy link
Member Author

That's kind of weird indeed.

@xe-nvdk ?

@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

Let me see if I can replicate this. The data used is the parquet file downloaded and we query from those, without any modification.

@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

Yep, indeed is not returning data.

=== Query 37 Results ===
{
"success": true,
"columns": [
"URL",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 37.83,
"timestamp": "2025-10-27T18:49:18.591339",
"error": null
}
...

=== Query 38 Results ===
{
"success": true,
"columns": [
"Title",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 37.13,
"timestamp": "2025-10-27T18:49:18.794525",
"error": null
}
...

=== Query 39 Results ===
{
"success": true,
"columns": [
"URL",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 49.21,
"timestamp": "2025-10-27T18:49:19.005502",
"error": null
}
...

=== Query 40 Results ===
{
"success": true,
"columns": [
"TraficSourceID",
"SearchEngineID",
"AdvEngineID",
"Src",
"Dst",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 64.51,
"timestamp": "2025-10-27T18:49:19.248933",
"error": null
}
...

=== Query 41 Results ===
{
"success": true,
"columns": [
"URLHash",
"EventDate",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 43.5,
"timestamp": "2025-10-27T18:49:19.521065",
"error": null
}
...

=== Query 42 Results ===
{
"success": true,
"columns": [
"WindowClientWidth",
"WindowClientHeight",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 47.22,
"timestamp": "2025-10-27T18:49:19.800042",
"error": null
}
...

=== Query 43 Results ===
{
"success": true,
"columns": [
"M",
"PageViews"
],
"data": [],
"row_count": 0,
"execution_time_ms": 45.42,
"timestamp": "2025-10-27T18:49:20.007235",
"error": null
}
...

Let me see what we are missing in the queries.

@nwoolmer
Copy link

nwoolmer commented Oct 27, 2025

The deviation may be due to your use of integers in the filters instead of dates. I suspect if you use the same queries as DuckDB (parquet, single), you should get the same results.

Do you recall why you changed them before?

@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

The deviation may be due to your use of integers in the filters instead of dates. I suspect if you use the same queries as DuckDB (parquet, single), you should get the same results.

Do you recall why you changed them before?

Yes, to match on how we save the data, that we save in unix time, but too much stuff we were changing these days so, I'm not 100% sure.

I will try with the queries for duckdb, parquet that I guess that is what you take.

We don't query the parquet directly with duckdb, is going through an api endpoint, so I guess that is going to be a little bit of overhead there.

I will post results or findings.

@nwoolmer
Copy link

nwoolmer commented Oct 27, 2025

They map the EventDate field to an actual date on ingest (view creation):

REPLACE (make_date(EventDate) AS EventDate)

They also use single-shot CLI calls, so I think it should even out (or be slower) than your API version: https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/run.sh

@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

Found the issue! The EventDate range in the dataset is 15888 to 15917, but queries 37-43 are searching for 16262-16292 (which is way out of range). That's why they're returning no results!

I will run this and update the results for c6a.4xlarge.json

@xe-nvdk
Copy link
Contributor

xe-nvdk commented Oct 27, 2025

Ok, we have new values, I'm pushing this now. I revisited all the queries and all of them returned values. 121.69 cold run. 36.40 hot run. So, the comparison with DuckDB now is going to have more sense. Thank you @nwoolmer for bringing this to my attention.

    [0.3240, 0.2876, 0.3705],
    [0.4465, 0.4760, 0.3664],
    [0.4526, 0.4729, 0.3246],
    [0.4812, 0.0791, 0.0758],
    [1.1639, 0.3212, 0.3300],
    [1.0129, 0.5477, 0.5545],
    [0.0726, 0.0568, 0.0562],
    [0.0774, 0.0678, 0.0519],
    [0.7381, 0.4311, 0.4511],
    [1.0924, 0.5699, 0.5547],
    [0.4432, 0.1487, 0.1403],
    [1.0393, 0.1834, 0.1746],
    [1.3479, 0.5573, 0.5705],
    [2.4321, 0.9395, 0.9058],
    [0.9376, 0.6361, 0.6543],
    [0.4990, 0.3867, 0.4148],
    [2.3501, 1.0664, 1.0703],
    [2.1065, 0.8039, 0.7907],
    [4.6851, 3.4149, 3.3916],
    [0.1581, 0.0603, 0.0956],
    [9.9643, 0.9384, 0.9358],
    [11.0750, 0.8700, 0.8688],
    [19.9851, 1.7030, 1.6963],
    [2.6626, 0.5353, 0.5198],
    [0.2502, 0.2039, 0.1895],
    [0.9021, 0.2900, 0.2902],
    [0.1929, 0.1430, 0.1550],
    [10.0546, 0.7772, 0.7815],
    [8.9975, 9.0713, 9.0197],
    [0.1288, 0.1200, 0.0923],
    [2.1978, 0.5900, 0.6071],
    [5.8035, 0.7783, 0.8516],
    [5.4261, 2.1362, 2.1842],
    [10.0824, 2.5418, 2.5405],
    [10.0871, 2.6066, 2.5556],
    [0.6097, 0.5709, 0.5771],
    [0.1811, 0.1435, 0.1563],
    [0.1419, 0.1236, 0.1257],
    [0.1530, 0.1019, 0.0939],
    [0.4868, 0.2732, 0.3178],
    [0.0913, 0.0716, 0.0646],
    [0.0867, 0.0712, 0.0579],
    [0.2705, 0.2326, 0.2172]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants