Add Hive and Impala by alexey-milovidov · Pull Request #907 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-15T23:21:26Z

Summary

Hive: apache/hive:4.0.1 in a single container — SERVICE_NAME=hiveserver2 runs HiveServer2 + embedded Derby metastore in one JVM. External table over a single hits.parquet bind-mounted at /clickbench; beeline for the CLI; check probes the HS2 web-UI /jmx for readiness.
Impala: apache/impala:4.4.1 quickstart layout via docker compose — hms / statestored / catalogd / impalad-1 on a private bridge. External table over the same parquet bind-mount; impala-shell.sh in the coordinator; check probes /healthz on port 25000.
Both follow the existing per-system script interface (install/start/stop/check/load/query/data-size + benchmark.sh thin shim), so lib/benchmark-common.sh handles cold cycles, drop_caches, and the QPS sweep without per-system glue.
queries.sql is the canonical ClickBench set adapted for each engine: LIMIT n OFFSET m instead of Trino's OFFSET m LIMIT n, and engine-specific regex backreference quoting in Q29.

Closes #889.

Test plan

./run-benchmark.sh with system=hive machine=c6a.4xlarge
./run-benchmark.sh with system=impala machine=c6a.4xlarge
Inspect result.csv for null/skipped queries (Hive may fall back on OFFSET in some 4.0 paths; flag any cases here for follow-up)
Compare wall-clock to existing Trino/Presto numbers as a sanity check

🤖 Generated with Claude Code

Hive: apache/hive:4.0.1 in a single container, SERVICE_NAME=hiveserver2 runs HiveServer2 + embedded Derby metastore in one JVM. External table over a single hits.parquet bind-mounted at /clickbench, beeline for the SQL client, check probes HS2's web-UI /jmx for readiness. Impala: apache/impala:4.4.1 quickstart layout via docker compose — hms / statestored / catalogd / impalad-1, all on a private bridge. External table over the same parquet bind-mount, impala-shell.sh in the coordinator for SQL, check probes impalad's /healthz on port 25000. Both follow the existing per-system script interface (install/start/stop/check/load/query/data-size + benchmark.sh thin shim), so the lib/benchmark-common.sh driver handles cold cycles, drop_caches, and the QPS sweep without per-system glue. queries.sql is the canonical ClickBench set adapted for each engine: LIMIT n OFFSET m instead of Trino's OFFSET m LIMIT n, and Hive needs '\$1' regex backreferences via '\1' / '\\1' depending on shell quoting (Hive: $1; Impala: \\1). Closes #889. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hive: - BENCH_DURABLE=no. apache/hive's entrypoint writes /opt/hive/conf/hiveserver2.pid and refuses to restart HS2 if the file exists, so a `docker stop` + `docker start` leaves the second start a no-op. Mounting the Derby store on a named volume hits a separate ERROR XBM0J ("Directory metastore_db already exists" — Derby refuses to create a database into a dir that pre-exists, which a docker mount always does). Simpler: rm -f the container on every cold cycle and re-run ./load to rebuild the schema; the load wall-clock rolls into the cold-try timing per the standard contract. - Bump SERVICE_OPTS to -Xmx<60% RAM>. Default 1 GB heap OOMs the parquet vectorized reader on every non-trivial query ("GC overhead limit exceeded" inside MapRecordProcessor). - Q43: FLOOR_MINUTE() instead of DATE_TRUNC('MINUTE', ...). Hive 4.0 doesn't ship DATE_TRUNC; SemanticException [Error 10011] otherwise. - Make ./load idempotent (cold-cycle re-runs only mv the source on the first invocation, then reuse the staged file). Impala: - Fix image tags. The non-HMS quickstart images are tagged plainly (4.4.1-statestored, 4.4.1-catalogd, 4.4.1-impalad_coord_exec) — only the HMS image carries the `impala_quickstart_` prefix. - Add a `client` sidecar from apache/impala:4.4.1-impala_quickstart_client with `entrypoint: [sleep, infinity]`. The coordinator image does NOT bundle impala-shell — the binary lives only in the client image at /usr/local/bin/impala-shell. ./load and ./query now `docker exec` into the sidecar and connect over the impala-net bridge to impalad-1:21050. - README: call out the AVX requirement. Impala's C++ daemons refuse to start on any CPU without AVX, so Graviton hosts (and amd64-via-QEMU on aarch64 dev boxes) cannot run this entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default 300s BENCH_CHECK_TIMEOUT was hit on the first cold start — the four-container quickstart stack has to bootstrap the Hive metastore schema before catalogd can register tables with statestored and impalad-1 starts answering /healthz=OK, which doesn't fit in 5 minutes on c6a.4xlarge. 900s gives that path room; real crashes still surface in roughly the same wall-clock because the loop polls every second. ./check now emits docker ps to stderr when curl fails. bench_check_loop captures the last call's stderr and prints it once the timeout fires, so a future timeout reports which container is missing or unhealthy instead of just "did not succeed within Ns". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous diagnostic told us impalad-1 exited 139 (SIGSEGV) and catalogd exited 1, but not *why*. docker logs is preserved on exited containers until they're removed; tail the last 50 lines for each so the next readiness-timeout report shows the JVM/native stack trace, config-validation error, or OOM line that actually killed the daemon. bench_check_loop overwrites last_err on each iteration so only the final check's output survives — no risk of log spam in the run record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sspath) 4.4.1's impalad_coord_exec and catalogd images crash at startup with NoClassDefFoundError: javax/jdo/JDOException at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler at o.a.h.hive.metastore.HiveMetaStoreClient.<init> before the /healthz endpoint ever comes up — both containers exit 139 (SIGSEGV from the C++ "Check failure stack trace" abort path) within seconds of start. jdo-api is required by the embedded HMS client and the 4.4.1 minimal images don't ship it. 4.5.0 is the newest tagged release and bundles the missing jar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

4.5.0 didn't bring jdo-api back — both impalad_coord_exec and catalogd still crash at startup with NoClassDefFoundError: javax/jdo/JDOException at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler → SIGSEGV in the C++ "Check failure" abort path, before /healthz ever opens. The jar isn't a packaging detail we can fix downstream; we just have to supply it. ./install now curls jdo-api-3.0.1.jar from Maven Central into data/extra-lib/ once per VM, and docker-compose.yml bind-mounts that file into /opt/impala/lib/jdo-api-3.0.1.jar on both daemon containers — already on the impala daemon classpath via the /opt/impala/lib/* glob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With jdo-api on the classpath the HMS client now starts, but both impalad and catalogd hit MetaException: Failed to get driver instance for jdbcUrl=jdbc:derby:;databaseName=metastore_db;create=true CAUSED BY: SQLException: No suitable driver — they're trying to spin up their own embedded Derby metastore instead of talking to the dedicated `hms` compose service. The apache/impala:4.5.0-{catalogd,impalad_coord_exec} images ship /opt/impala/conf/ empty; daemon_entrypoint.sh puts that dir at the head of CLASSPATH, so dropping a hive-site.xml in it is the canonical config injection point. Set hive.metastore.uris to thrift://hms:9083 and bind-mount the file into both daemons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench_main drives the query loop as while IFS= read -r query; do ...; done < queries.sql so every command inside the loop body inherits queries.sql as its stdin. hive/load and impala/load both run sudo docker exec -i hive|impala-client beeline|impala-shell -f ... with -i but without redirecting stdin. The inner SQL client never reads stdin (it's invoked with -f or -e/-q), but docker exec -i unconditionally forwards host stdin into the container until EOF, draining the queries.sql fd along the way. For BENCH_DURABLE=no (hive's case), bench_run_query re-runs ./load inside the loop for every cold cycle, so Q1's ./load drains queries.sql and the next bench_main read hits EOF. The loop exits silently after the first query — explaining why the hive logs show exactly one `[t,t,t]` line followed immediately by data-size, with no error message. Add `< /dev/null` to every docker exec -i call in {hive,impala}/{load,query}. hive/query and impala/query aren't affected today because $(cat) above already drains the printf pipe, but pin them too so the scripts stay safe if anyone reorders the cat or calls them outside the bench wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov and others added 8 commits May 15, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hive and Impala#907

Add Hive and Impala#907
alexey-milovidov wants to merge 8 commits into
mainfrom
add-hive-impala

alexey-milovidov commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexey-milovidov commented May 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant