Skip to content

ci: set SKIP_CAPACITY on gen2 jobs to fit class RAM#1246

Merged
mbouaziz merged 1 commit into
mainfrom
ci-skip-capacity-gen2
May 27, 2026
Merged

ci: set SKIP_CAPACITY on gen2 jobs to fit class RAM#1246
mbouaziz merged 1 commit into
mainfrom
ci-skip-capacity-gen2

Conversation

@mbouaziz
Copy link
Copy Markdown
Contributor

Summary

Gen2 CircleCI machines enforce virtual memory limits at the cgroup level. The Skip runtime's default SKIP_CAPACITY of 16 GB (skiplang/prelude/runtime/palloc.c:449-451) fails to mmap on any class with <16 GB RAM, producing ERROR (MAP FAILED): Cannot allocate memory early in the job.

Observed on skipruntime / large.gen2 in PR #1245 (job 16682) and on an earlier skdb-wasm / large.gen2 run that died at 61 s (workflow f78dd673, 2026-05-27 09:37). Gen1 tolerated the same 16 GB virtual mmap because only RSS was enforced and the mmap is lazy — gen2 evidently does not.

Set SKIP_CAPACITY explicitly per gen2 job, leaving ~25% RAM headroom for OS + tooling:

Job Class RAM SKIP_CAPACITY
check-ts medium.gen2 4 GB 3G
skdb large.gen2 8 GB 6G
skdb-wasm large.gen2 8 GB 6G
skipruntime large.gen2 (+ pg/kafka sidecars) 8 GB primary 6G
compiler xlarge.gen2 16 GB 12G

Why also setting the ones that haven't (yet) failed: same mmap happens on every Skip-toolchain invocation, so any gen2 job using skiplabs/skip* images is one coin flip away from the same failure. skipruntime on large.gen2 succeeded 3× in a row on the Phase 1 PR before failing on #1245 — the variance is real.

check-examples is on large.gen2 but uses cimg/base and only runs the Skip toolchain inside docker-compose containers (separate cgroups), so it's not affected.

Test plan

  • Land this, then trigger a PR that exercises each gen2 workflow (touch skiplang/prelude/src/foo for compiler + skdb + skdb-wasm + skipruntime, and a ts workspace file for check-ts).
  • Confirm skipruntime no longer dies with MAP FAILED at skargo build.
  • Confirm compiler still passes — 12 G cap is well above measured peak (~10 GB).
  • Watch skdb and skdb-wasm over the next few PRs for any new failures specifically at allocation time; if they appear, drop the cap further.

🤖 Generated with Claude Code

gen2 CircleCI machines enforce virtual memory limits at the cgroup
level. The Skip runtime's default SKIP_CAPACITY of 16 GB (palloc.c)
fails to mmap on any class with <16 GB RAM, producing
"ERROR (MAP FAILED): Cannot allocate memory" early in the job.

Observed on skipruntime/large.gen2 in PR #1245 and on an earlier
skdb-wasm/large.gen2 run that died at 61 s. gen1 tolerated the same
16 GB virtual mmap because only RSS was enforced and the mmap is lazy.

Set SKIP_CAPACITY explicitly per gen2 job, leaving ~25% RAM headroom
for OS and tooling:

  check-ts        medium.gen2 (4 GB)   -> 3G
  skdb            large.gen2  (8 GB)   -> 6G
  skdb-wasm       large.gen2  (8 GB)   -> 6G
  skipruntime     large.gen2  (8 GB)*  -> 6G
  compiler        xlarge.gen2 (16 GB)  -> 12G

* postgres + kafka sidecars get their own RAM allocation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mbouaziz mbouaziz enabled auto-merge May 27, 2026 14:23
@mbouaziz mbouaziz merged commit 6e2ee26 into main May 27, 2026
2 checks passed
@mbouaziz mbouaziz deleted the ci-skip-capacity-gen2 branch May 27, 2026 14:24
mbouaziz added a commit that referenced this pull request May 27, 2026
…1247)

## Summary

Follow-up to #1246. The previous PR set `SKIP_CAPACITY` on the CircleCI
primary container, but the value doesn't propagate into nested `docker
build` steps that run on the remote-docker host. Skip toolchain
invocations inside those builds reverted to the runtime's 16 GB default
`mmap` and OOM'd on gen2 hosts.

Observed in [pipeline 4507 / job
16690](https://app.circleci.com/pipelines/github/SkipLabs/skip/4507/workflows/bb0eb244-0515-4641-8e4f-974a89f63aac/jobs/16690)
(skipruntime job, "Run native addon unreleased test" step): the docker
build of `skiplabs/skiplang-bin-builder` failed at
[skiplang/Dockerfile:41](skiplang/Dockerfile#L41) with `ERROR (MAP
FAILED): Cannot allocate memory` during `make STAGE=0`.

## Fix

Two parts:

1. **Declare `ARG SKIP_CAPACITY=` + `ENV SKIP_CAPACITY=$SKIP_CAPACITY`**
in each Dockerfile that invokes the Skip toolchain at build time:
- [Dockerfile](Dockerfile) (top-level, `bootstrap` stage inherits ENV
via `skiplang-base`)
- [skiplang/Dockerfile](skiplang/Dockerfile) (`skiplang` stage inherits
via `base`)
   - [skipruntime-ts/Dockerfile](skipruntime-ts/Dockerfile)

ARG defaults to empty so local dev builds preserve the runtime's 16 GB
default — `palloc.c:449-451` treats an empty `SKIP_CAPACITY` as unset.

2. **[bin/docker_build.sh](bin/docker_build.sh) forwards
`$SKIP_CAPACITY`** (when set) as a build arg to all bake targets that
invoke the toolchain (`skiplang`, `skip`, `skiplang-bin-builder`,
`skipruntime`). Mirrors the existing `STAGE` forwarding pattern. In CI
this picks up the value already set on the job by #1246, so no
additional `.circleci/base.yml` change is needed.

## Test plan

- [ ] After merge, watch the next PR that triggers `skipruntime` — the
"Run native addon unreleased test" step's docker build should no longer
fail with `MAP FAILED`.
- [ ] Confirm local `bin/docker_build.sh skipruntime` (without
`SKIP_CAPACITY` env var) still uses the runtime's 16 GB default and
builds fine on a developer machine.
- [ ] Confirm `STAGE=1 bin/docker_build.sh skiplang` still works — the
new SKIP_CAPACITY block sits next to STAGE and shouldn't interfere.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
pull Bot pushed a commit to edisplay/skip that referenced this pull request May 27, 2026
Following SkipLabs#1246, SKIP_CAPACITY is set on the CircleCI primary
container but does not propagate into nested `docker build` steps,
which run on a separate (remote-docker) host. Skip toolchain
invocations inside those builds reverted to the runtime's 16 GB
default mmap and OOM'd on gen2 hosts that enforce virtual memory
limits.

Observed in pipeline 4507 / job 16690 (skipruntime): the docker
build of skiplabs/skiplang-bin-builder failed at
skiplang/Dockerfile:41 with "ERROR (MAP FAILED): Cannot allocate
memory" during `make STAGE=0`.

Fix in two parts:

1. Declare `ARG SKIP_CAPACITY=` + `ENV SKIP_CAPACITY=$SKIP_CAPACITY`
   in each Dockerfile whose build invokes the Skip toolchain
   (Dockerfile, skiplang/Dockerfile, skipruntime-ts/Dockerfile).
   Default empty so local dev builds keep the runtime's 16 GB
   default; palloc.c treats empty SKIP_CAPACITY as unset.

2. docker_build.sh forwards $SKIP_CAPACITY (if set) as a build arg
   to all bake targets that invoke the toolchain, mirroring the
   existing STAGE forwarding pattern. In CI this picks up the
   value already set on the job, so no additional config is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbouaziz added a commit that referenced this pull request May 27, 2026
sk_create_mapping printed "CAPACITY SET TO: <bytes>" to stdout for
every Skip process started with a non-default capacity. This was
debug output that escaped into normal stdout — fine when nothing
relied on stdout exactness, but the skdb diff tests
(test/diff/*.sql) compare runtime stdout to .expected golden files
and now fail on every test once SKIP_CAPACITY is set in CI (e.g. via
the gen2 fix in #1246).

Just remove the print. The value is observable via /proc/self/maps
or by recognising the cgroup limit; no observability is being lost
that needed to be on stdout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbouaziz added a commit that referenced this pull request May 27, 2026
## Summary

`sk_create_mapping` in
[skiplang/prelude/runtime/palloc.c:567-569](skiplang/prelude/runtime/palloc.c#L567-L569)
printed `CAPACITY SET TO: <bytes>` to **stdout** for every Skip process
started with a non-default capacity.

This was harmless until #1246 set `SKIP_CAPACITY` on gen2 CI jobs to
avoid `MAP FAILED` OOMs. Now every Skip process emits the line, and the
skdb diff tests (`test/diff/*.sql`) — which compare runtime stdout to
`.expected` golden files — fail on every test.

Observed in [job
16738](https://app.circleci.com/pipelines/github/SkipLabs/skip/4522/workflows/523a19b3-8783-4f62-a2c2-3a188b0e0716/jobs/16738)
(skdb on `large.gen2`, `SKIP_CAPACITY=6G` → `6442450944`):

```
CAPACITY SET TO: 6442450944
04 - test/diff/select1_views.sql (part-2):              FAILED
...
```

Repeated for ~40 diff tests. The skdb workflow has been silently broken
for any PR that triggers it (most don't, because of generate_config.sh's
per-package diff heuristic; this PR triggered it because it touches a
`skiplang/prelude/` file).

## Fix

Remove the print. If anyone needs to confirm the runtime's capacity at
startup, `/proc/self/maps` or the `--capacity` CLI flag echo are cleaner
channels — stdout is reserved for program output that golden-file tests
can rely on.

## Test plan

- [ ] After merge, a PR that touches `skiplang/prelude/` and triggers
`skdb` should now pass.
- [ ] `compiler` and `skipruntime` (also affected, since they run with
`SKIP_CAPACITY=12G` / `6G`) should remain green — they don't compare
stdout, so they weren't failing visibly, but the line was still being
emitted and cluttering logs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant