Skip to content

Implement Trino entries (single + partitioned Parquet)#856

Merged
alexey-milovidov merged 4 commits intomainfrom
add-trino-entry
May 7, 2026
Merged

Implement Trino entries (single + partitioned Parquet)#856
alexey-milovidov merged 4 commits intomainfrom
add-trino-entry

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Replaces the placeholder trino/ entry with a working Hive-connector setup over a single 14.7 GB Parquet file. The catalog uses Trino's file metastore and the native local filesystem driver, so the benchmark runs in one Trino Docker container with no Hive Metastore Service or Hadoop dependency.
  • Adds a matching trino-partitioned/ entry over the 100-file athena_partitioned dataset.
  • Both entries handle the BIGINT/UINT16-encoded EventTime/EventDate columns with a thin hits view, so queries.sql matches the canonical ClickBench text.
  • Includes c8g.24xlarge results for both.

Test plan

  • trino/benchmark.sh runs end-to-end on a fresh Ubuntu 24.04 host (Docker only).
  • trino-partitioned/benchmark.sh runs end-to-end on the same setup.
  • All 43 queries return non-empty results on both setups.

alexey-milovidov and others added 4 commits May 6, 2026 23:08
The existing trino/ directory had only a placeholder benchmark.sh. Replace
it with a working Hive-connector setup that uses the file metastore and the
native local filesystem driver, so the entire benchmark runs in a single
Trino Docker container with no Hive Metastore Service or Hadoop required.

The Parquet files store EventTime/ClientEventTime/LocalEventTime as Unix
epoch BIGINT and EventDate as a UINT16 day count, so create.sql registers
hits_raw with the on-disk types and exposes a hits view that converts
these columns to TIMESTAMP and DATE. queries.sql then matches the
canonical ClickBench text without modification.

trino-partitioned/ uses the same setup against the 100-file
athena_partitioned dataset so multi-file scan performance can be compared.

Both entries include c8g.24xlarge results.
The Trino container runs as uid 1000 ("trino"), but cloud-init runs
benchmark.sh as root, so data/ ends up root-owned and the file
metastore CREATE SCHEMA fails with "Could not write database schema".
Chown the data dir to 1000:1000 before starting the container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov self-assigned this May 7, 2026
@alexey-milovidov alexey-milovidov merged commit 7799c23 into main May 7, 2026
alexey-milovidov added a commit that referenced this pull request May 9, 2026
…-system layout

These four entries were added on main while this branch was in flight (the
existing trino/ scripts here were a memory-connector stub that never worked
end-to-end). Rebuild each one against the new install/start/check/stop/load/
query/data-size contract so they share lib/benchmark-common.sh:

- trino, trino-partitioned: Hive connector + file metastore + local Parquet
  hardlinked into data/hits/ (matches main's working impl from PR #856).
- trino-datalake{,-partitioned}: same, plus the AnonymousAWSCredentials shim
  to read clickhouse-public-datasets/hits_compatible/athena from anonymous
  S3 (the published bucket size is reported by data-size since the data is
  read on demand). BENCH_DOWNLOAD_SCRIPT="" — no local dataset to fetch.
- benchmark.sh in all four becomes a 4-line shim. Old run.sh deleted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant