Skip to content

Fixes for various systems#903

Merged
alexey-milovidov merged 67 commits into
mainfrom
clickbench-fixes
May 14, 2026
Merged

Fixes for various systems#903
alexey-milovidov merged 67 commits into
mainfrom
clickbench-fixes

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

No description provided.

alexey-milovidov and others added 30 commits May 14, 2026 20:22
Several systems' load scripts do `sudo mv hits_*.parquet
/var/lib/<engine>/user_files/` or `sudo cp hits.csv .../extern/`
followed by `chown` to the daemon's user. The mv/cp copies 14-75 GB
of data the daemon reads once during INSERT and we delete right
after — a complete waste of bytes on disk and time on the wire.

Replace with `ln -s` + `chown -h` where the daemon's user-files dir
is on a different filesystem from the dataset. `chown -h` chowns
the symlink itself rather than following into the (often read-only)
original; the underlying dataset is mode 644 anyway, so daemon
processes can read through the symlink as their own user.

Systems updated: clickhouse, clickhouse-tencent, pg_clickhouse,
kinetica, oxla, ursa, arc, cockroachdb.

Motivated by the ClickBench playground (Firecracker microVM service)
where the dataset is mounted read-only and shared across all VMs;
the copy step was the dominant cost on parquet/csv-format systems
and pulled 14 GB into the per-VM snapshot golden disk unnecessarily.
The change is also benign for the regular benchmark — daemons still
read the same bytes, just through a symlink.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickBench: fix elasticsearch load.py bytes/str mix

VM tweaks for the long tail of failures:
  - chdb-dataframe / duckdb-dataframe materialize the full hits dataset
    in process memory and need >32 GB. Default to 48 GB.
  - Druid / Pinot / similar JVM stacks take 5-10 min to come up
    (Zookeeper → Coordinator → Broker → Historical, in sequence). The
    agent's 300 s check-loop wasn't enough; widen to 900 s.

elasticsearch/load.py: gzip.open in mode='rt' returns str docs, but
bulk_stream yields bytes for ACTION_META_BYTES and str for the doc.
requests.adapters.send() calls sock.sendall() on the mixed iterable
and crashes with `TypeError: a bytes-like object is required, not
'str'`. Open in 'rb' so docs are bytes — matches the rest of the
generator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot pipeline:
- /opt/clickbench-playground reformatted as XFS so cp --reflink=always
  can clone golden->working in milliseconds.
- _snapshot_disks and _restore_disks switched to reflink (parallel,
  O(1) extent-list copies).
- snapshot.bin no longer compressed; Firecracker mmaps it on restore,
  pages fault in lazily.
- Snapshot is taken with the daemon running: pre-snapshot stop+fstrim
  +drop_caches is followed by start+check, so restore resumes a live
  daemon and the first query pays no cold-start cost.
- _snapshot_disks runs while VM is paused, before resume. Without this
  the daemon's post-snapshot kernel writes (journal commits, atime)
  leaked into the golden disk and surfaced as ext4 EBADMSG on restore.

Agent + host wiring:
- New /ready endpoint on the in-VM agent; _restore_snapshot waits for
  /ready (up to 10 min) before reporting state="ready" so slow JVMs
  like Doris/Druid don't time out on the user's first query.
- dockerd restart hook at agent boot — without it docker-using systems
  fail to launch containers after snapshot restore.
- Output streamed and capped at OUTPUT_LIMIT+1 bytes (default 64 KB)
  with head-style early-kill; default query timeout 600 -> 60 s.
- /api/query no longer triggers initial provisioning. Only restore.
  Initial provision requires explicit /api/admin/provision/<name>.
- /api/queries/<name> returns the system's example queries.
- _call_agent_provision: no aiohttp idle timeout, 7-day total cap.
- ClickHouse-family stays on the internet after snapshot (datalake
  variants need S3); rest stays offline.

Catalog:
- paradedb-partitioned (pg_lakehouse removed upstream) and
  pg_duckdb-motherduck (needs cloud creds) excluded.
- ClickHouse + chdb variants emit Pretty format.
- ClickBench: trino/presto-datalake javac classpath uses find for AWS
  SDK / Hadoop jars instead of pinning a stale jar filename.
- ClickBench: cedardb/cedardb-parquet/mongodb start scripts hardened
  (systemctl restart docker, longer wait windows, better diagnostics).
- ClickBench: duckdb start scripts scrub stale *.wal.
- ClickBench: arc start broadened admin-token regex.

UI:
- Catalog rendered as horizontal slabs, colored by state.
- Per-system result cache (output + timing) keyed by system name.
- Example-query selector populated from /api/queries/<name>.
- Down systems swap the query pane for a "Last error" pane.
- Stats row trimmed to time + truncated marker.
- monospace font, no rounded corners, black selected outline.
- Spellcheck / autocomplete / Grammarly opt-outs on the textarea.

Bootstrap:
- install-firecracker.sh: chown only the top-level state dirs, not
  recursively (a chown -R was descending into a base-rootfs build's
  loop mount and flipping /etc/sudoers to uid 1000).
- install-firecracker.sh checks the state dir supports reflink and
  exits with an XFS-format hint if not.
- download-datasets.sh fetches hits.json.gz (used by parseable).
…uery

The bench-vortex binaries (clickbench / query_bench) emit only gh-json
timing blobs — no rows — which makes the system useless in the
interactive playground. Replace ./query with the datafusion (Parquet)
flow: datafusion-cli reads parquet directly via create.sql.

- install: cargo install datafusion-cli 49.0.2 (vortex bench build
  stays for benchmark.sh continuity).
- create.sql (new): same shape as datafusion's, points at
  hits.parquet / partitioned/.
- load: symlink parquet files via the playground stub instead of
  invoking the vortex driver's warmup pass.
- query: identical to datafusion / datafusion-partitioned.
Recent Arc debs (25.12+) link against GLIBC_2.38; Ubuntu 22.04 has
2.35 and the daemon dies with libm.so.6 'GLIBC_2.38 not found'.
Extract Ubuntu noble's libc6 into /opt/glibc-noble and replace
/usr/bin/arc with a wrapper that exec's the real binary via that
ld-linux + --library-path. Leaves the rest of the system on 22.04's
glibc.
opteryx.query() returns a Cursor / Relation; list(cursor) iterates
arrow batches rather than rows in current versions, so the prior
script ended up printing nothing for every query. Probe arrow() /
to_arrow_table() / fetchall() in turn and render the result as TSV
(header + rows) when it's a pyarrow Table, falling back to iteration.
docker compose down --volumes deleted the anonymous volumes that
hold byconity's HDFS namenode state and the bench database, so after
the playground's pre-snapshot stop and post-snapshot restore the
schema was gone. Use docker compose stop instead — it leaves
containers (and their volumes) intact.
drill-embedded crashes with ExceptionInInitializerError /
"Could not initialize class RootAllocator" on JDK 17+ when
java.base/java.nio (and neighbours) aren't opened to ALL-UNNAMED.
Set DRILL_JAVA_OPTS on the docker run so every fresh JVM start in
./query has the right flags.

playground: trust duckdb-datalake{,-partitioned} for outbound internet

These two read from S3 at query time, same as clickhouse-datalake.
mongod 8.0.23 hardcodes "MongoDB cannot start: Linux kernel versions
6.19 and newer" and exits — the playground's guest kernel is 7.0.
The 7.0 release line predates that check and runs fine.
The CLI was invoked with --output-format=NULL, which suppresses the
result. Switch to ALIGNED so the playground shows actual rows. The
benchmark still gets its timing from the wall-clock around the
java -jar invocation.
Apache Doris -> Doris across the doris and doris-parquet template.json,
every historical results/*.json, the get-result-json.sh helpers, and
the rendered data.generated.js. README.md and install scripts keep
'Apache Doris' since those reference the upstream project name.
…tart'

After a snapshot+restore the on-disk monetdbd lock files
(.merovingian.lock, .gdk_lock) outlive the process. `monetdbd start`
then exits with "another monetdbd is already running" without
relaunching, mclient has no server to reach, and ./check spins for
the agent's full 15-minute timeout.

- If no monetdbd is actually running, wipe the lock files and any
  stale mserver5 before invoking monetdbd start.
- After daemon-side relaunch, `monetdb start test` is also needed
  to actually start mserver5 for the database; `monetdb release` only
  un-marks maintenance.
- Wait up to 60 s for mclient to go through; bail loudly with status
  output instead of leaving the agent's blind 15-min poll loop.
dpkg-deb -x preserves the package's internal layout — the deb stores
the loader at /usr/lib64/ld-linux-x86-64.so.2, so extraction at
$NOBLE_DIR puts it at $NOBLE_DIR/usr/lib64/..., not
$NOBLE_DIR/lib64/... arc was failing with 'No such file or
directory' on every restart.
don't hang on the credentials timeout

DuckDB's S3 driver probes 169.254.169.254 for IAM credentials before
each S3 request. The playground's SNI proxy blocks that IP (correctly,
it's host metadata); each query then waits the full IMDS timeout
before falling through to anonymous access, which the user sees as a
hang.

- duckdb-datalake: switch from s3://... to the HTTPS URL directly.
  httpfs reads HTTPS with no credential lookup at all.
- duckdb-datalake-partitioned: keep s3:// (httpfs has no HTTPS glob),
  but add a CREATE SECRET ... TYPE S3 ... KEY_ID '' that short-
  circuits the credential chain to anonymous and skips IMDS.
Same root cause as the duckdb-datalake hang: ClickHouse's S3 engine
probes 169.254.169.254 for IAM credentials before each request. The
playground's SNI proxy blocks IMDS (correctly — it's host metadata),
and each query waits the full timeout before falling through to
anonymous. NOSIGN tells the S3 engine to skip the credential chain
entirely and make anonymous requests, which the public bucket accepts.
Upstream bench-vortex no longer exposes a 'clickbench' bin target;
only compress, public_bi, query_bench, random_access remain. Match
the partitioned variant and build query_bench. The playground's
./query path uses datafusion-cli either way, but install needs to
succeed for the system to provision.
CREATE TABLE in ./load failed with "Table replication num should be
less than of equal to the number of available BE nodes" because the
blind 'sleep 30' after ALTER SYSTEM ADD BACKEND wasn't enough on a
fresh cold start — BE registration was still in progress when the
load script proceeded.

Replace with two active waits: first for FE to accept connections,
then poll SHOW BACKENDS until Alive=true. Also make the script
idempotent on both fronts (FE up + BE alive).
Two dataframe systems were carrying both files:
- queries.sql with SQL equivalents that nothing ran
- queries.py with the Python expressions the server actually eval'd
plus a BENCH_QUERIES_FILE=queries.py override in benchmark.sh.

Drop the unused queries.sql, rename queries.py -> queries.sql, drop
the override. The lib/benchmark-common.sh default
BENCH_QUERIES_FILE=queries.sql now picks up the Python expressions
unchanged. Updated docstrings in server.py to note the new file
contains Python (the filename matches the cross-system convention).
…ectly

polars/server.py kept a 43-entry list of (sql_string, lambda) tuples
and the /query endpoint did a dict lookup. Replace with the
polars-dataframe pattern: /query takes a Python expression and
eval()s it against the loaded LazyFrame, with hits/pl/date in scope.
queries.sql now holds those Python expressions (one per line), same
shape as polars-dataframe.

The load remains lazy (pl.scan_parquet without collect) so the
streaming behaviour that distinguishes polars from polars-dataframe
is preserved.
install was bumped from 33.0.0 to 37.0.0 (33.0.0 was retired from the
apache mirror) but start/load/data-size still pointed at the old
$DRUID_DIR. start launched nothing — directory didn't exist — and
the agent timed out the 900 s check loop.
The previous attempt set DRILL_JAVA_OPTS only — that env var is
consumed by drillbit.sh, but drill-embedded launches sqlline which
reads JAVA_OPTS instead. The RootAllocator's static init still
failed on JDK 17+ inside the embedded JVM with
ExceptionInInitializerError.

Set DRILL_JAVA_OPTS / DRILL_SHELL_JAVA_OPTS / JAVA_OPTS /
_JAVA_OPTIONS so whichever path the apache/drill launcher follows
picks the flags up. Also widen the list of --add-opens (lang.reflect,
util.concurrent, jdk.internal.misc/ref) which Arrow's allocator
touches.
Previous fix got the daemon running ('test' DB shows R 100% in
monetdb status) but mclient was still failing. Without -h, mclient
hunts for a unix socket that doesn't always exist after a restart,
and a hostname mismatch can stall TCP. Use -h 127.0.0.1 (both in
check and start), and extend the wait to 120 s while logging
mclient's actual error on failure for diagnosis.
CREATE SECRET in create.sql is session-scoped: it lives only inside
the duckdb process that ran ./load. When ./query reopens hits.db the
secret is gone, so DuckDB falls back to the us-east-1 default,
ListObjects against the eu-central-1 bucket gets a 301 redirect that
the playground's SNI-filter proxy can't follow cleanly, and Q1 on
the partitioned variant hangs the full 60 s.

Apply the SET s3_region + CREATE OR REPLACE SECRET on every duckdb
invocation in ./query so the region is always correct and the
credential chain never probes IMDS.
monetdb:
  Feed SQL via stdin instead of -s 'STMT' on the mclient command line.
  The previous flag form caused mclient to dump --help (and exit
  non-zero) instead of running the SELECT 1 health probe, even though
  the daemon was listening on 127.0.0.1:50000. check + start both
  switched to echo | mclient.

siglens:
  siglens 1.0.54's go.mod requires Go 1.21+. Ubuntu's `golang` package
  is 1.18 — go mod tidy fails. Install go 1.22.7 from the official
  tarball into /usr/local/go and prepend to PATH.

starrocks:
  FE was reporting "FE saved address not match backend address" after
  snapshot+restore because `hostname -i` could return different IPs
  between runs. Pin priority_networks=127.0.0.1/32 in both fe.conf
  and be.conf, register the BE under 127.0.0.1:9050, and drop any
  stale BE entry before re-adding. Stable rendezvous across restarts.

tidb:
  cluster_info reports a tiflash row before the placement driver is
  ready to accept SET TIFLASH REPLICA. ./load failed with
  "tiflash server count: 0" on a fresh playground. Add a second
  phase to start: probe with a throwaway table until ALTER TABLE ...
  SET TIFLASH REPLICA 1 actually succeeds, then drop the probe.
parseable and victorialogs both decompress hits.json.gz inside the
VM at load time. The 75 GB output overflows the 200 GB sysdisk
(after the install / wget overhead) and load fails with
"No space left on device". Pre-decompress on the host once and
make the file available to load scripts as a symlink.

- playground/images/build-base-rootfs.sh: add lib stubs
  download-hits-json and download-hits-json-gz so per-system load
  scripts can pick up the file via the same pattern as the other
  formats.
- playground/scripts/download-datasets.sh: extends the host-side
  download flow to decompress hits.json.gz into datasets/hits.json.
- parseable/load: prefer the pre-decompressed file; the legacy
  wget + gunzip path stays as a fallback for standalone use.
- victorialogs/load: same — use the read-only hits.json directly
  with split -n r/8.
datafusion-vortex{,-partitioned}:
  Check looked for the renamed `clickbench` binary; switch to
  `command -v datafusion-cli` (matches what ./query actually uses)
  and have install symlink ~/.cargo/bin/datafusion-cli into
  /usr/local/bin so the agent's stripped PATH finds it.

duckdb-datalake{,-partitioned}:
  readlink -f the installed duckdb before symlinking into
  /usr/local/bin so the link stays valid regardless of $HOME at
  provision/query time.

gizmosql:
  v1.26+ links GLIBC_2.38; Ubuntu 22.04 has 2.35. Wrap both binaries
  with the same noble-loader trick used for arc.

pinot:
  start/load referenced the old 1.3.0 dir; install was bumped to
  1.5.0. Sync.

systems.py:
  - disable pandas (peak RSS ~30 GB OOMs the 16 GB VM).
  - disable paradedb (postgres crashes during index VACUUM under
    16 GB).
MongoDB doesn't publish a 7.0 apt repository for noble — only 8.0+.
But 8.0 has the hardcoded "Linux kernel >= 6.19" refusal we need to
avoid (the playground guest runs 7.0). The 7.0 packages built for
jammy depend on libssl3 which noble also provides, so they install
fine on a 24.04 base.
…body

pinot: schema 1.5.0 requires schemaName == tableName; rename schema
'hitsSchema' to 'hits'. Also wait for the controller's REST API to
be live before AddTable and drop the silent '|| true' that masked
genuine failures.

daft-parquet-partitioned: drop daft==0.7.4 version pin — the old
release doesn't have col(...).decode('utf-8'), which made /load
return 500 'Internal Server Error' before the data ever loaded.

vm_manager: when a /provision fails, save the full agent response
body to logs/provision-<system>.log so the real failure (often in
start/check/load) is recoverable; the 2000-byte tail in the
exception message usually catches only the install epilogue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…epared hits.json

monetdb: drop the query.expect wrapper. The interactive 'password:'
expect cycle silently dumped --help on certain mclient builds and
broke load/check. Use 'mclient -h 127.0.0.1 -P monetdb' driven from
stdin instead — credentials inline, no PTY, no expect timeout.

siglens: prefer /opt/clickbench/datasets_ro/hits.json (pre-decompressed
217 GB on the readonly dataset disk) over running pigz inside the VM.
Previously the in-VM gunzip blew the 200 GB rootfs ENOSPC partway in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexey-milovidov and others added 28 commits May 14, 2026 20:22
Pinot's QuickStart -type batch keeps the controller's segment store
in /tmp (a tmpfs). The pre-snapshot stop/start cycle wipes that,
and the snapshotted daemon never re-registers the hits table on
restore: /ready stays false, the host's 600s budget expires, and
/api/query times out. Treat pinot like the dataframe systems:
preserve the running daemon's state across snapshot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion

Pandas (and the other in-process dataframe systems) had /query
return only {"elapsed": <s>}. The playground UI then surfaced
'{"elapsed": 0.026}' as the entire query output, which is the
timing — not the value the user asked for.

server.py for pandas, chdb-dataframe, duckdb-dataframe,
polars-dataframe, daft-parquet, daft-parquet-partitioned now also
returns {"result": str(...)} (ClickHouse Pretty for chdb, repr
for everything else; the agent's OUTPUT_LIMIT caps it before it
crosses the host boundary).

query scripts rewritten to feed the JSON body through a python3
heredoc so they print {result} on stdout and {elapsed} on stderr —
matches the cross-system shell contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-system convention is queries.sql. Renaming victorialogs and
mongodb to match means handle_queries can stay simple — drop the
multi-extension fallback added a moment ago.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… S3 plugin is gone)

trino:latest no longer ships the hadoop-S3 plugin, so the
fs.hadoop.enabled=true + custom AWSCredentialsProvider shim path is
broken: 'External location is not a valid file system URI: s3://...'.

Switch hive.properties to fs.native-s3.enabled=true with region +
endpoint set explicitly; the public bucket allows unauthenticated
GETs and the AWS SDK falls through its default credentials chain
to anonymous when no creds are configured. The shim + core-site.xml
mounts in start stay around as no-ops for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original benchmark setup used --output-format=NULL because it
measures timing only; under the playground that produces 200 OK
with an empty body, which the UI faithfully shows as '(no output)'.
Switch to ALIGNED — same human-readable table presto* uses — so
the saved row + the UI both have something to display.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…host swap

umbra still OOMs at create.sql:109 with the 256 GB swap on the
host: docker's default cgroup setup gives the container the host's
full memory but Umbra's own allocator caps itself at the cgroup's
'available' figure, which lands near the 16 GB physical RAM. Pin
the container to 128 GB with unlimited swap so Umbra's allocator
sees enough room to load the table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
apache/drill ships a JDK whose CgroupV2Subsystem.getInstance() NPEs
when 'anyController is null' — happens on the playground VM where
cgroup2 is mounted with no controllers visible to the container.
The NPE killed RootAllocator init and every query returned 'Could
not initialize class org.apache.drill.exec.memory.RootAllocator'
with no other visible output beyond the JVM picking up the
_JAVA_OPTIONS env line.

Turning off the JVM's container-aware sysinfo path with
-XX:-UseContainerSupport skips the broken code; SELECT now works
end-to-end (verified: SELECT 1 -> '1 row selected (1.105 seconds)').

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original runner used '.mode trash' to keep timing parsing
clean by throwing away result rows. Under the playground that
yielded an empty result body even when the query succeeded.
'.mode box' renders a readable table; the 'Run Time:' line still
matches the timing regex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nvisible to parallel/sh

GNU parallel runs each job under /bin/sh (not bash) by default,
and 'export -f ingest_chunk' only carries the function into bash
children. The chunks were silently routed into a non-existent
command name, parallel exited 0, the load took 4700+ s, and the
table came back with 0 rows.

Inline the awk + curl pipeline as parallel's literal command string
so it's interpreted directly by /bin/sh. Add curl --fail --show-error
so an HTTP error from /api/v1/ingest now propagates to the load
script's exit code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…es.sql

The error-detection branch in quickwit/query only looked at
.error / .status. Quickwit returns the JSON-parse failure as
{"message": "expected value at line 1 column 1"} which the
old check missed, so the playground recorded the failed query
as a success. Add .message + a 'no .took' fallback so any
shape of malformed response surfaces as exit 1.

Also rename the workload file from queries.json to queries.sql
(removing the cosmetic SQL one that was sitting alongside) so
the playground UI picks it up via the standard handle_queries
path. Quickwit consumes Elasticsearch DSL JSON; the .sql name
is just the cross-system convention for the file the playground
reads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…starts

tiup playground generates a fresh data dir per invocation if no
--tag is given. Pre-snapshot stop killed the loaded cluster and
the subsequent pre-snapshot start spun up a brand new one; the
snapshot captured the empty replacement and queries against the
restored VM returned 'Table test.hits doesn't exist'.

Pin --tag clickbench so the load survives stop/start.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Post-snapshot-restore, ES responds on :9200 but shards are still
recovering. Queries land before allocation completes and fail with
  no_shard_available_action_exception (status 503).

Make start/check poll _cluster/health/hits and require active shards
before returning success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
gizmosql_client is a DuckDB-cli fork. With stdout piped, DuckDB-cli
truncates any table taller than the default page to a "<N> rows
(<M> columns)" summary — even under .mode box. Setting .maxrows -1
and .maxwidth 0 disables both axes of truncation so the user sees
the actual rows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nymous shim

trino:latest no longer ships the legacy hadoop-S3 plugin (removed in
v461). The replacement native-S3 filesystem has no anonymous-creds
mode, so it can't read the public clickhouse-public-datasets bucket
even with region/endpoint set — the URI is rejected outright with
'External location is not a valid file system URI: s3://...'.

Pin trino:455 (last release with hadoop-S3) and restore the
fs.hadoop.enabled=true + S3AnonymousProvider shim path that was
working until the recent :latest bump.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The recent 16 -> 96 GiB override was unfair to every other engine.
Revert it; do what we did for the dataframe systems instead:

  - drop the docker --memory=128g cap (raise it to --memory=256g
    to allow swap-backed growth), keep --memory-swap=-1, add
    --memory-swappiness=100 so the cgroup pages out anon memory
    aggressively the moment we exceed physical RAM
  - flip the guest's vm.overcommit_memory to 1 and vm.swappiness
    to 100 inside ./start so the kernel stops refusing the large
    mmap requests Umbra issues during COPY

Removes MEM_OVERRIDES_MIB and the vm_manager plumbing for it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The AWS-SDK + Hadoop-jar name-glob no longer matches anything in
trino:455 (the dependency tree shifted between releases), so the
S3AnonymousProvider compile dies with 'package com.amazonaws.auth
does not exist'. Always use the full /usr/lib/trino/**/*.jar
classpath; the shim has no class-name collisions to worry about.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Druid's JVMs survive a snapshot restore but the SQL stack stays
dead for 10+ minutes — likely ZK session skew across the snapshot
boundary. The old druid/start only checked /status (router up
fast), so it returned 'idempotent: nothing to do' and queries kept
landing on the broken SQL endpoint.

  - druid/start: probe SELECT 1 with a 5s curl, and on failure
    pkill -KILL every druid JVM and cold-start the stack.
  - druid/check already uses SELECT 1 so it's the right gate.

Independently, even with start fixed, /ready was reporting
ready=true throughout the post-restore window because
_daemon_started.is_set() is restored from the snapshot's Python
memory. The host's _wait_for_daemon_ready passed instantly, /query
landed mid-rebuild, and the 60s host budget fired.

Fix:
  - add a btime watcher thread that calls _maybe_reconcile_for_restore
    every second, so the moment the VM resumes the watcher clears
    _daemon_started and spawns _ensure_daemon_started off-thread.
  - /ready also calls _maybe_reconcile_for_restore so a host probe
    can't beat the watcher.
  - _maybe_reconcile_for_restore now kicks _ensure_daemon_started
    in a thread itself (it was previously synchronous-only from
    /query; the watcher must not block).
  - bump _ensure_daemon_started's check loop from 60s to 10 min so
    slow daemons (Druid, Doris, Pinot) actually reach pass before
    /ready flips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Parseable filters every query by [startTime, endTime] against the
row's ingest timestamp. The benchmark script used today's calendar
day, which is fine in a one-shot run-on-the-day-you-loaded-it
benchmark — but in the playground we load during provisioning,
snapshot the result, and then queries run hours-to-days later.
Every row falls outside today's window and the result is always
zero.

Use [2000, 2099] so any plausible load + query date is included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
heavyai/check sent 'SELECT 1' over omnisql stdin. omnisql 5.10.2
parses that as incomplete, exits with 'Missing semicolon at end of
SQL command.' without ever contacting the daemon, and the agent's
check loop spins for the full 900 s. Add the ';'.

oxla's only public docker image (public.ecr.aws/oxla/release) was
de-listed; the repo no longer surfaces in the ECR public gallery
and there's no replacement on Docker Hub or GitHub Releases. Drop
it from the catalog (alongside sirius).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
postgresql-orioledb:
  - Park PGDATA on the per-VM sysdisk instead of the container's
    overlay layer (which lives on the 200 GiB rootfs). The orioledb
    undo log doubles the write footprint of the base table and
    blew up at line ~70M of hits.tsv.
  - Bump the sysdisk for this engine to 400 GiB via a new
    SYSDISK_OVERRIDES_GB hook in systems.py. The image is sparse so
    physical cost is what postgres actually writes.
  - Rootfs is left at 200 GiB — build-system-rootfs.sh clones the
    base via sparse-cp with no resize2fs, so a rootfs override
    would need a deeper change. Moving PGDATA to sysdisk sidesteps
    that.

UI:
  - Hovering a slab in the top system picker now highlights the
    matching row in the competition leaderboard, so the user can
    scan from picker to result without losing context. New
    .slab-hover CSS class toggled via mouseenter/mouseleave.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ooter

The image's sqlline used to print '(N rows in X.YYY seconds)' below
the result rows; current builds print 'N row(s) selected (X.YYY
seconds)' instead. Our grep matched only the old form, so the
result body kept the summary line and the timing extractor returned
empty, failing every query with 'no marker in drill output'.

Match either form for stripping, and pull the timing from any
'(X.YYY seconds)' suffix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tiup playground does not reuse the data dir across restarts even
with --tag — each invocation initialises a fresh cluster, drops PD
metadata about previously-stored TiKV regions, and the test.hits
table becomes invisible. The agent's normal pre-snapshot
stop-then-start cycle therefore destroys the data tidb-lightning
just spent an hour loading.

Mark .preserve-state so the snapshot captures TiDB running as-is
(no stop/start cycle around the snapshot), and the restored VM
resumes with the table intact. The post-restore btime watcher
still re-runs ./start, which is idempotent (returns early when
MySQL on :4000 already responds), so this remains compatible with
the docker-reconcile path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mongosh routes console.error() through its own log formatter rather
than to process.stderr the way Node REPL does, so the elapsed time
the eval block was printing never reached the agent's
_extract_script_timing(stderr) parser. The UI's Time: column was
empty for every mongo query.

Wrap the mongosh invocation in shell-side date arithmetic and emit
the seconds to stderr ourselves.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous attempt set --memory=256g --memory-swap=-1 --memory-swappiness=100,
but on cgroup v2 the swappiness flag is silently discarded and any
--memory cap creates a hard cgroup ceiling that the kernel will OOM
on regardless of swap. Let Umbra run with no docker memory cgroup
and rely on the host kernel + 256 GiB swap drive.

Also raise vm.max_map_count to 1048576 — Umbra issues many small
mmaps for its memory-mapped storage and a 100M-row COPY blows past
the 65530 default well before any OOM-killer fires.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… binary

The trino:455 image ships no /usr/bin/find, so the previous
'find /usr/lib/trino -name "*.jar"' classpath collector silently
returned empty and javac failed with 'package com.amazonaws.auth
does not exist'. Use a brace-glob over the two specific HDFS-plugin
jars (aws-java-sdk-core and hadoop-apache) and match either the
legacy 'com.amazonaws_' / 'io.trino.hadoop_' name prefix used by
older Trino builds or the bare modern name.

Tested: javac produces S3AnonymousProvider.class against
  /usr/lib/trino/plugin/hive/hdfs/aws-java-sdk-core-1.12.770.jar
  /usr/lib/trino/plugin/hive/hdfs/hadoop-apache-3.3.5-3.jar

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
omnisci/core-os-cpu:v5.10.2 ships with an empty
allowed-import-paths, so the load script's
  COPY hits FROM '/tmp/hits.csv'
fails with 'File or directory path "/tmp/hits.csv" is not
whitelisted.' Drop an omnisci.conf with [/tmp/] on the allowlist
into heavyai-storage before launching the container — the
startomnisci wrapper picks it up automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tursodb has been panicking partway through .import:
  thread 'main' panicked at core/storage/sqlite3_ondisk.rs:818:5:
  assertion failed: !*syncing.borrow()
  note: run with `RUST_BACKTRACE=1` environment variable ...
The note speaks for itself. Set RUST_BACKTRACE=1 so the panic line
in the provision log (and any UI-facing panic from /query) ships
with a call stack for the upstream bug report.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SHOW BACKENDS TSV columns are
  1 BackendId  2 IP  3 HeartbeatPort  4 BePort  5 HttpPort
  6 BrpcPort   7 LastStartTime  8 LastHeartbeat  9 Alive  ...
We were inspecting column 10 (SystemDecommissioned), which is always
"false" once the BE is registered — so the wait loop in ./start
timed out even when the backend was alive and serving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov self-assigned this May 14, 2026
@alexey-milovidov alexey-milovidov merged commit 65fc071 into main May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant