Add Hive and Impala#907
Open
alexey-milovidov wants to merge 8 commits into
Open
Conversation
Hive: apache/hive:4.0.1 in a single container, SERVICE_NAME=hiveserver2 runs HiveServer2 + embedded Derby metastore in one JVM. External table over a single hits.parquet bind-mounted at /clickbench, beeline for the SQL client, check probes HS2's web-UI /jmx for readiness. Impala: apache/impala:4.4.1 quickstart layout via docker compose — hms / statestored / catalogd / impalad-1, all on a private bridge. External table over the same parquet bind-mount, impala-shell.sh in the coordinator for SQL, check probes impalad's /healthz on port 25000. Both follow the existing per-system script interface (install/start/stop/check/load/query/data-size + benchmark.sh thin shim), so the lib/benchmark-common.sh driver handles cold cycles, drop_caches, and the QPS sweep without per-system glue. queries.sql is the canonical ClickBench set adapted for each engine: LIMIT n OFFSET m instead of Trino's OFFSET m LIMIT n, and Hive needs '\$1' regex backreferences via '\1' / '\\1' depending on shell quoting (Hive: $1; Impala: \\1). Closes #889. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hive:
- BENCH_DURABLE=no. apache/hive's entrypoint writes
/opt/hive/conf/hiveserver2.pid and refuses to restart HS2 if the file
exists, so a `docker stop` + `docker start` leaves the second start a
no-op. Mounting the Derby store on a named volume hits a separate
ERROR XBM0J ("Directory metastore_db already exists" — Derby refuses
to create a database into a dir that pre-exists, which a docker mount
always does). Simpler: rm -f the container on every cold cycle and
re-run ./load to rebuild the schema; the load wall-clock rolls into
the cold-try timing per the standard contract.
- Bump SERVICE_OPTS to -Xmx<60% RAM>. Default 1 GB heap OOMs the
parquet vectorized reader on every non-trivial query
("GC overhead limit exceeded" inside MapRecordProcessor).
- Q43: FLOOR_MINUTE() instead of DATE_TRUNC('MINUTE', ...). Hive 4.0
doesn't ship DATE_TRUNC; SemanticException [Error 10011] otherwise.
- Make ./load idempotent (cold-cycle re-runs only mv the source on the
first invocation, then reuse the staged file).
Impala:
- Fix image tags. The non-HMS quickstart images are tagged plainly
(4.4.1-statestored, 4.4.1-catalogd, 4.4.1-impalad_coord_exec) — only
the HMS image carries the `impala_quickstart_` prefix.
- Add a `client` sidecar from apache/impala:4.4.1-impala_quickstart_client
with `entrypoint: [sleep, infinity]`. The coordinator image does NOT
bundle impala-shell — the binary lives only in the client image at
/usr/local/bin/impala-shell. ./load and ./query now `docker exec` into
the sidecar and connect over the impala-net bridge to impalad-1:21050.
- README: call out the AVX requirement. Impala's C++ daemons refuse to
start on any CPU without AVX, so Graviton hosts (and amd64-via-QEMU
on aarch64 dev boxes) cannot run this entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default 300s BENCH_CHECK_TIMEOUT was hit on the first cold start — the four-container quickstart stack has to bootstrap the Hive metastore schema before catalogd can register tables with statestored and impalad-1 starts answering /healthz=OK, which doesn't fit in 5 minutes on c6a.4xlarge. 900s gives that path room; real crashes still surface in roughly the same wall-clock because the loop polls every second. ./check now emits docker ps to stderr when curl fails. bench_check_loop captures the last call's stderr and prints it once the timeout fires, so a future timeout reports which container is missing or unhealthy instead of just "did not succeed within Ns". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous diagnostic told us impalad-1 exited 139 (SIGSEGV) and catalogd exited 1, but not *why*. docker logs is preserved on exited containers until they're removed; tail the last 50 lines for each so the next readiness-timeout report shows the JVM/native stack trace, config-validation error, or OOM line that actually killed the daemon. bench_check_loop overwrites last_err on each iteration so only the final check's output survives — no risk of log spam in the run record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sspath) 4.4.1's impalad_coord_exec and catalogd images crash at startup with NoClassDefFoundError: javax/jdo/JDOException at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler at o.a.h.hive.metastore.HiveMetaStoreClient.<init> before the /healthz endpoint ever comes up — both containers exit 139 (SIGSEGV from the C++ "Check failure stack trace" abort path) within seconds of start. jdo-api is required by the embedded HMS client and the 4.4.1 minimal images don't ship it. 4.5.0 is the newest tagged release and bundles the missing jar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.5.0 didn't bring jdo-api back — both impalad_coord_exec and catalogd still crash at startup with NoClassDefFoundError: javax/jdo/JDOException at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler → SIGSEGV in the C++ "Check failure" abort path, before /healthz ever opens. The jar isn't a packaging detail we can fix downstream; we just have to supply it. ./install now curls jdo-api-3.0.1.jar from Maven Central into data/extra-lib/ once per VM, and docker-compose.yml bind-mounts that file into /opt/impala/lib/jdo-api-3.0.1.jar on both daemon containers — already on the impala daemon classpath via the /opt/impala/lib/* glob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With jdo-api on the classpath the HMS client now starts, but both
impalad and catalogd hit
MetaException: Failed to get driver instance for
jdbcUrl=jdbc:derby:;databaseName=metastore_db;create=true
CAUSED BY: SQLException: No suitable driver
— they're trying to spin up their own embedded Derby metastore
instead of talking to the dedicated `hms` compose service. The
apache/impala:4.5.0-{catalogd,impalad_coord_exec} images ship
/opt/impala/conf/ empty; daemon_entrypoint.sh puts that dir at the
head of CLASSPATH, so dropping a hive-site.xml in it is the
canonical config injection point. Set hive.metastore.uris to
thrift://hms:9083 and bind-mount the file into both daemons.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench_main drives the query loop as
while IFS= read -r query; do ...; done < queries.sql
so every command inside the loop body inherits queries.sql as its
stdin. hive/load and impala/load both run
sudo docker exec -i hive|impala-client beeline|impala-shell -f ...
with -i but without redirecting stdin. The inner SQL client never
reads stdin (it's invoked with -f or -e/-q), but docker exec -i
unconditionally forwards host stdin into the container until EOF,
draining the queries.sql fd along the way.
For BENCH_DURABLE=no (hive's case), bench_run_query re-runs ./load
inside the loop for every cold cycle, so Q1's ./load drains
queries.sql and the next bench_main read hits EOF. The loop exits
silently after the first query — explaining why the hive logs show
exactly one `[t,t,t]` line followed immediately by data-size, with
no error message.
Add `< /dev/null` to every docker exec -i call in {hive,impala}/{load,query}.
hive/query and impala/query aren't affected today because $(cat) above
already drains the printf pipe, but pin them too so the scripts stay
safe if anyone reorders the cat or calls them outside the bench
wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apache/hive:4.0.1in a single container —SERVICE_NAME=hiveserver2runs HiveServer2 + embedded Derby metastore in one JVM. External table over a singlehits.parquetbind-mounted at/clickbench; beeline for the CLI;checkprobes the HS2 web-UI/jmxfor readiness.apache/impala:4.4.1quickstart layout viadocker compose—hms/statestored/catalogd/impalad-1on a private bridge. External table over the same parquet bind-mount;impala-shell.shin the coordinator;checkprobes/healthzon port 25000.install/start/stop/check/load/query/data-size+benchmark.shthin shim), solib/benchmark-common.shhandles cold cycles,drop_caches, and the QPS sweep without per-system glue.queries.sqlis the canonical ClickBench set adapted for each engine:LIMIT n OFFSET minstead of Trino'sOFFSET m LIMIT n, and engine-specific regex backreference quoting in Q29.Closes #889.
Test plan
./run-benchmark.shwithsystem=hive machine=c6a.4xlarge./run-benchmark.shwithsystem=impala machine=c6a.4xlargeresult.csvfor null/skipped queries (Hive may fall back on OFFSET in some 4.0 paths; flag any cases here for follow-up)🤖 Generated with Claude Code