Skip to content

[2.8] Fix Docker SJ workspace tmpfs permissions#4574

Merged
pcnudde merged 5 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/docker-sj-tmpfs-permissions
May 11, 2026
Merged

[2.8] Fix Docker SJ workspace tmpfs permissions#4574
pcnudde merged 5 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/docker-sj-tmpfs-permissions

Conversation

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@YuanTingHsieh YuanTingHsieh commented May 11, 2026

Summary

  • make the Docker job launcher workspace tmpfs writable with sticky 1777 permissions
  • keep startup/ and local/ read-only and the current job directory as the only host-backed writable mount
  • update the Docker launcher unit expectation and design doc to describe the ephemeral SJ/CJ storage behavior
  • make the Docker example startup commands copy/pasteable by backgrounding parent processes with per-site logs
  • update the Docker example job submit command to use --startup-kit
  • make Docker example paths consistently relative to examples/docker

Root Cause

deploy prepare rewrites server storage paths to /var/tmp/nvflare/workspace/snapshot-storage and /var/tmp/nvflare/workspace/jobs-storage. Server job containers inherit those resources, but the Docker job launcher mounted /var/tmp/nvflare/workspace as a read-only-style 0555 tmpfs. During SJ boot, snapshot_persistor and job_manager both try to create their storage directories under the tmpfs root and fail with EACCES.

Impact

SJ/CJ containers still do not receive the parent site workspace bind mount. The top-level workspace root is now writable only as ephemeral tmpfs so startup-time storage initialization can succeed, matching the K8s emptyDir behavior more closely. Host-backed writes remain limited to /var/tmp/nvflare/workspace/<job_id>.

@YuanTingHsieh YuanTingHsieh changed the base branch from main to 2.8 May 11, 2026 22:53
@YuanTingHsieh YuanTingHsieh force-pushed the codex/docker-sj-tmpfs-permissions branch from e86e9e3 to ce1db5a Compare May 11, 2026 22:58
@YuanTingHsieh YuanTingHsieh force-pushed the codex/docker-sj-tmpfs-permissions branch from ce1db5a to 2fabdc2 Compare May 11, 2026 23:14
@YuanTingHsieh YuanTingHsieh marked this pull request as ready for review May 11, 2026 23:15
Copilot AI review requested due to automatic review settings May 11, 2026 23:15
@YuanTingHsieh YuanTingHsieh changed the title [codex] Fix Docker SJ workspace tmpfs permissions [2.8] Fix Docker SJ workspace tmpfs permissions May 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Docker-mode Server Job (SJ) startup failure by making the SJ/CJ workspace tmpfs mount writable (sticky 1777), while preserving the existing isolation model where startup/ and local/ are read-only and only the per-job directory is host-backed writable storage. It also updates documentation and examples to reflect the intended ephemeral vs persistent storage behavior.

Changes:

  • Change the Docker launcher’s workspace tmpfs mode from 0555 to sticky world-writable 1777 so non-root job containers can create required top-level ephemeral directories.
  • Update the Docker launcher unit test expectation to match the new tmpfs permissions.
  • Update Docker design docs and the Docker example README to better describe/reflect workspace behavior and commands.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
nvflare/app_opt/job_launcher/docker_launcher.py Makes the SJ/CJ workspace tmpfs root writable (1777) and clarifies the workspace isolation behavior in comments/docstring.
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py Updates mount expectation to assert tmpfs_mode: 0o1777.
examples/docker/README.md Adjusts example commands to be more copy/pasteable and changes the assumed working directory.
docs/design/docker_job_launcher_design.md Updates the design doc to describe the new tmpfs permissions and ephemeral storage implications for SJ/CJ containers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/docker/README.md
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR fixes a permissions bug in the Docker job launcher where SJ/CJ containers failed to start because snapshot_persistor and job_manager could not create their ephemeral storage directories under a 0o555 (read-only) tmpfs root. The fix changes the workspace tmpfs mode to 0o1777 (sticky world-writable), matching the standard Linux /tmp behavior, and updates all documentation and tests accordingly.

  • Core fix (docker_launcher.py): _WORKSPACE_TMPFS_MODE changed from 0o555 to 0o1777 so non-root container processes can initialize ephemeral top-level directories (e.g., snapshot-storage, jobs-storage) at SJ boot; startup/ and local/ bind mounts remain read-only.
  • Test (docker_launcher_test.py): Expected tmpfs_mode assertion updated from 0o555 to 0o1777 to keep the unit test in sync.
  • README (examples/docker/README.md): Startup instructions consolidated from three separate terminals into backgrounded nohup processes with per-site log files, paths made relative to examples/docker, and nvflare job submit updated to use --startup-kit.

Confidence Score: 5/5

Safe to merge — the change is a targeted, well-scoped one-liner fix with an updated test and consistent documentation.

The single code change (tmpfs mode constant) is the minimal correct fix for the EACCES boot failure. The sticky 1777 mode is the established Linux standard for a shared writable scratch space; the tmpfs is container-local and ephemeral so the broader permission does not increase the host or cross-container attack surface. The unit test was updated in lock-step, and both the design doc and README accurately describe the new behavior.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/app_opt/job_launcher/docker_launcher.py Changes _WORKSPACE_TMPFS_MODE from 0o555 to 0o1777 (sticky world-writable) to allow non-root container users to initialize ephemeral storage directories at startup; comments updated accordingly.
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py Updates the expected tmpfs_mode assertion from 0o555 to 0o1777 to match the new permission constant; no other test logic changed.
docs/design/docker_job_launcher_design.md Updates design doc to reflect the 1777 tmpfs mode and adds a paragraph explaining that SJ-container ephemeral paths are writable but not connected to the parent server's host-backed storage.
examples/docker/README.md Rewrites startup instructions to use relative paths from examples/docker, consolidates three-terminal workflow into backgrounded nohup processes with per-site logs, and updates job submit to use --startup-kit.

Sequence Diagram

sequenceDiagram
    participant SP as SP/CP Container
    participant DL as DockerJobLauncher
    participant D as Docker Daemon
    participant SJ as SJ/CJ Container

    SP->>DL: launch_job(job_id, workspace)
    DL->>D: "containers.run(mounts=[...])"
    Note over DL,D: Mount 1: tmpfs at /var/tmp/nvflare/workspace (mode=0o1777, ephemeral)
    Note over DL,D: Mount 2: bind startup/ → read-only
    Note over DL,D: Mount 3: bind local/  → read-only
    Note over DL,D: Mount 4: bind job_workspace/ → read-write (host-backed)
    D->>SJ: start container
    SJ->>SJ: snapshot_persistor creates /var/tmp/nvflare/workspace/snapshot-storage
    SJ->>SJ: job_manager creates /var/tmp/nvflare/workspace/jobs-storage
    SJ->>SJ: writes job outputs to /var/tmp/nvflare/workspace/job_id/ (persisted on host)
    SJ-->>SP: job complete
Loading

Reviews (1): Last reviewed commit: "Fix Docker README config paths" | Re-trigger Greptile

@pcnudde pcnudde merged commit 949dcf3 into NVIDIA:2.8 May 11, 2026
24 checks passed
@YuanTingHsieh YuanTingHsieh deleted the codex/docker-sj-tmpfs-permissions branch May 11, 2026 23:42
YuanTingHsieh added a commit to YuanTingHsieh/NVFlare that referenced this pull request May 13, 2026
## Summary

- make the Docker job launcher workspace tmpfs writable with sticky
`1777` permissions
- keep `startup/` and `local/` read-only and the current job directory
as the only host-backed writable mount
- update the Docker launcher unit expectation and design doc to describe
the ephemeral SJ/CJ storage behavior
- make the Docker example startup commands copy/pasteable by
backgrounding parent processes with per-site logs
- update the Docker example job submit command to use `--startup-kit`
- make Docker example paths consistently relative to `examples/docker`

## Root Cause

`deploy prepare` rewrites server storage paths to
`/var/tmp/nvflare/workspace/snapshot-storage` and
`/var/tmp/nvflare/workspace/jobs-storage`. Server job containers inherit
those resources, but the Docker job launcher mounted
`/var/tmp/nvflare/workspace` as a read-only-style `0555` tmpfs. During
SJ boot, `snapshot_persistor` and `job_manager` both try to create their
storage directories under the tmpfs root and fail with `EACCES`.

## Impact

SJ/CJ containers still do not receive the parent site workspace bind
mount. The top-level workspace root is now writable only as ephemeral
tmpfs so startup-time storage initialization can succeed, matching the
K8s emptyDir behavior more closely. Host-backed writes remain limited to
`/var/tmp/nvflare/workspace/<job_id>`.

(cherry picked from commit 949dcf3)
pcnudde pushed a commit that referenced this pull request May 13, 2026
## Summary

Port the selected 2.8 fixes back to `main` in 2.8 merge order:

- #4528 Add warnings for missing study data mappings
- #4538 Update deploy prepare launcher docs
- #4550 Align `Run.get_result()` with the `clean_up` parameter spelling
- #4561 Clarify `remove_client` token cleanup semantics
- #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager
- #4574 Fix Docker SJ workspace tmpfs permissions
- #4576 Narrow client failure reporting for generic launcher execution
errors
- #4583 Fix tracking recipe integration test

---------

Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants