Skip to content

Add Hive and Impala#907

Open
alexey-milovidov wants to merge 8 commits into
mainfrom
add-hive-impala
Open

Add Hive and Impala#907
alexey-milovidov wants to merge 8 commits into
mainfrom
add-hive-impala

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Hive: apache/hive:4.0.1 in a single container — SERVICE_NAME=hiveserver2 runs HiveServer2 + embedded Derby metastore in one JVM. External table over a single hits.parquet bind-mounted at /clickbench; beeline for the CLI; check probes the HS2 web-UI /jmx for readiness.
  • Impala: apache/impala:4.4.1 quickstart layout via docker composehms / statestored / catalogd / impalad-1 on a private bridge. External table over the same parquet bind-mount; impala-shell.sh in the coordinator; check probes /healthz on port 25000.
  • Both follow the existing per-system script interface (install/start/stop/check/load/query/data-size + benchmark.sh thin shim), so lib/benchmark-common.sh handles cold cycles, drop_caches, and the QPS sweep without per-system glue.
  • queries.sql is the canonical ClickBench set adapted for each engine: LIMIT n OFFSET m instead of Trino's OFFSET m LIMIT n, and engine-specific regex backreference quoting in Q29.

Closes #889.

Test plan

  • ./run-benchmark.sh with system=hive machine=c6a.4xlarge
  • ./run-benchmark.sh with system=impala machine=c6a.4xlarge
  • Inspect result.csv for null/skipped queries (Hive may fall back on OFFSET in some 4.0 paths; flag any cases here for follow-up)
  • Compare wall-clock to existing Trino/Presto numbers as a sanity check

🤖 Generated with Claude Code

alexey-milovidov and others added 8 commits May 15, 2026 23:21
Hive: apache/hive:4.0.1 in a single container, SERVICE_NAME=hiveserver2
runs HiveServer2 + embedded Derby metastore in one JVM. External table
over a single hits.parquet bind-mounted at /clickbench, beeline for the
SQL client, check probes HS2's web-UI /jmx for readiness.

Impala: apache/impala:4.4.1 quickstart layout via docker compose —
hms / statestored / catalogd / impalad-1, all on a private bridge.
External table over the same parquet bind-mount, impala-shell.sh in the
coordinator for SQL, check probes impalad's /healthz on port 25000.

Both follow the existing per-system script interface
(install/start/stop/check/load/query/data-size + benchmark.sh thin
shim), so the lib/benchmark-common.sh driver handles cold cycles,
drop_caches, and the QPS sweep without per-system glue.

queries.sql is the canonical ClickBench set adapted for each engine:
LIMIT n OFFSET m instead of Trino's OFFSET m LIMIT n, and Hive
needs '\$1' regex backreferences via '\1' / '\\1' depending on shell
quoting (Hive: $1; Impala: \\1).

Closes #889.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hive:
- BENCH_DURABLE=no. apache/hive's entrypoint writes
  /opt/hive/conf/hiveserver2.pid and refuses to restart HS2 if the file
  exists, so a `docker stop` + `docker start` leaves the second start a
  no-op. Mounting the Derby store on a named volume hits a separate
  ERROR XBM0J ("Directory metastore_db already exists" — Derby refuses
  to create a database into a dir that pre-exists, which a docker mount
  always does). Simpler: rm -f the container on every cold cycle and
  re-run ./load to rebuild the schema; the load wall-clock rolls into
  the cold-try timing per the standard contract.
- Bump SERVICE_OPTS to -Xmx<60% RAM>. Default 1 GB heap OOMs the
  parquet vectorized reader on every non-trivial query
  ("GC overhead limit exceeded" inside MapRecordProcessor).
- Q43: FLOOR_MINUTE() instead of DATE_TRUNC('MINUTE', ...). Hive 4.0
  doesn't ship DATE_TRUNC; SemanticException [Error 10011] otherwise.
- Make ./load idempotent (cold-cycle re-runs only mv the source on the
  first invocation, then reuse the staged file).

Impala:
- Fix image tags. The non-HMS quickstart images are tagged plainly
  (4.4.1-statestored, 4.4.1-catalogd, 4.4.1-impalad_coord_exec) — only
  the HMS image carries the `impala_quickstart_` prefix.
- Add a `client` sidecar from apache/impala:4.4.1-impala_quickstart_client
  with `entrypoint: [sleep, infinity]`. The coordinator image does NOT
  bundle impala-shell — the binary lives only in the client image at
  /usr/local/bin/impala-shell. ./load and ./query now `docker exec` into
  the sidecar and connect over the impala-net bridge to impalad-1:21050.
- README: call out the AVX requirement. Impala's C++ daemons refuse to
  start on any CPU without AVX, so Graviton hosts (and amd64-via-QEMU
  on aarch64 dev boxes) cannot run this entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default 300s BENCH_CHECK_TIMEOUT was hit on the first cold start —
the four-container quickstart stack has to bootstrap the Hive metastore
schema before catalogd can register tables with statestored and
impalad-1 starts answering /healthz=OK, which doesn't fit in 5 minutes
on c6a.4xlarge. 900s gives that path room; real crashes still surface
in roughly the same wall-clock because the loop polls every second.

./check now emits docker ps to stderr when curl fails. bench_check_loop
captures the last call's stderr and prints it once the timeout fires, so
a future timeout reports which container is missing or unhealthy instead
of just "did not succeed within Ns".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous diagnostic told us impalad-1 exited 139 (SIGSEGV) and
catalogd exited 1, but not *why*. docker logs is preserved on exited
containers until they're removed; tail the last 50 lines for each so the
next readiness-timeout report shows the JVM/native stack trace,
config-validation error, or OOM line that actually killed the daemon.

bench_check_loop overwrites last_err on each iteration so only the final
check's output survives — no risk of log spam in the run record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sspath)

4.4.1's impalad_coord_exec and catalogd images crash at startup with

  NoClassDefFoundError: javax/jdo/JDOException
  at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler
  at o.a.h.hive.metastore.HiveMetaStoreClient.<init>

before the /healthz endpoint ever comes up — both containers exit
139 (SIGSEGV from the C++ "Check failure stack trace" abort path)
within seconds of start. jdo-api is required by the embedded HMS
client and the 4.4.1 minimal images don't ship it. 4.5.0 is the
newest tagged release and bundles the missing jar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.5.0 didn't bring jdo-api back — both impalad_coord_exec and
catalogd still crash at startup with

  NoClassDefFoundError: javax/jdo/JDOException
  at o.a.h.hive.metastore.HiveMetaStore.newRetryingHMSHandler

→ SIGSEGV in the C++ "Check failure" abort path, before /healthz
ever opens. The jar isn't a packaging detail we can fix downstream;
we just have to supply it.

./install now curls jdo-api-3.0.1.jar from Maven Central into
data/extra-lib/ once per VM, and docker-compose.yml bind-mounts that
file into /opt/impala/lib/jdo-api-3.0.1.jar on both daemon
containers — already on the impala daemon classpath via the
/opt/impala/lib/* glob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With jdo-api on the classpath the HMS client now starts, but both
impalad and catalogd hit

  MetaException: Failed to get driver instance for
    jdbcUrl=jdbc:derby:;databaseName=metastore_db;create=true
  CAUSED BY: SQLException: No suitable driver

— they're trying to spin up their own embedded Derby metastore
instead of talking to the dedicated `hms` compose service. The
apache/impala:4.5.0-{catalogd,impalad_coord_exec} images ship
/opt/impala/conf/ empty; daemon_entrypoint.sh puts that dir at the
head of CLASSPATH, so dropping a hive-site.xml in it is the
canonical config injection point. Set hive.metastore.uris to
thrift://hms:9083 and bind-mount the file into both daemons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench_main drives the query loop as

  while IFS= read -r query; do ...; done < queries.sql

so every command inside the loop body inherits queries.sql as its
stdin. hive/load and impala/load both run

  sudo docker exec -i hive|impala-client beeline|impala-shell -f ...

with -i but without redirecting stdin. The inner SQL client never
reads stdin (it's invoked with -f or -e/-q), but docker exec -i
unconditionally forwards host stdin into the container until EOF,
draining the queries.sql fd along the way.

For BENCH_DURABLE=no (hive's case), bench_run_query re-runs ./load
inside the loop for every cold cycle, so Q1's ./load drains
queries.sql and the next bench_main read hits EOF. The loop exits
silently after the first query — explaining why the hive logs show
exactly one `[t,t,t]` line followed immediately by data-size, with
no error message.

Add `< /dev/null` to every docker exec -i call in {hive,impala}/{load,query}.
hive/query and impala/query aren't affected today because $(cat) above
already drains the printf pipe, but pin them too so the scripts stay
safe if anyone reorders the cat or calls them outside the bench
wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Hive and Impala

1 participant