Fix VNC container startup failure when x11 volume is in corrupt state#14
Conversation
Agent-Logs-Url: https://github.com/LCAS/aoc_container_base/sessions/8c372695-b798-4c45-887e-2f66c08ba314 Co-authored-by: cooperj <28831674+cooperj@users.noreply.github.com>
|
@copilot Intergrate this as a docker health check, so when there are issues... the container is tainted as 'unhealthly' and both dependant containers can hold on etc |
…LTHCHECK Agent-Logs-Url: https://github.com/LCAS/aoc_container_base/sessions/ca8bbf03-ea7a-4a24-a2ff-de0d7e97b8bd Co-authored-by: cooperj <28831674+cooperj@users.noreply.github.com>
Done in commit 66b09ef. Added The Dockerfile now also declares |
There was a problem hiding this comment.
Pull request overview
This PR hardens the VNC container startup path so stale X11 state does not leave the service hanging indefinitely, and so Compose can gate dependent services on VNC readiness.
Changes:
- Adds explicit X11 cleanup, TurboVNC/XFCE startup timeouts, retry-on-failure logic, and failure log output in the VNC entrypoint.
- Adds an in-image VNC health check script and Dockerfile
HEALTHCHECKfor X11 and noVNC readiness. - Updates both Compose variants to use the new health check script with a startup grace period.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
vnc.dockerfile |
Copies the healthcheck script into the VNC image and adds an image-level HEALTHCHECK. |
docker/vnc-healthcheck.sh |
New healthcheck script that probes the X display and the noVNC HTTP endpoint. |
docker/vnc-entrypoint.sh |
Reworks VNC startup with X11 cleanup, timeouts, retry handling, and failure diagnostics. |
compose.yml |
Switches the VNC service healthcheck to the bundled script and adds start_period. |
compose.cuda.yml |
Mirrors the Compose healthcheck update for the CUDA variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "display is up" | ||
| if [ -e /tmp/.X1-lock ]; then | ||
| echo " [x11] Found stale lock file /tmp/.X1-lock — removing..." | ||
| rm -f /tmp/.X1-lock || sudo rm -f /tmp/.X1-lock |
|
|
||
| if [ -e /tmp/.X11-unix/X1 ]; then | ||
| echo " [x11] Found stale socket /tmp/.X11-unix/X1 — removing..." | ||
| rm -f /tmp/.X11-unix/X1 || sudo rm -f /tmp/.X11-unix/X1 |
| if $had_stale; then | ||
| echo " [x11] Stale X11 files from previous run cleared successfully." | ||
| else | ||
| echo " [x11] No stale X11 files found." |
| echo "To recover manually, remove the x11 volume and restart:" >&2 | ||
| echo " docker compose down -v && docker compose up" >&2 |
This allows us to clear the x11 volume on crashes Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
The named
x11Docker volume can retain stale lock files and sockets (/tmp/.X1-lock,/tmp/.X11-unix/X1) from a crashed or restarted container, causing TurboVNC to silently fail and the entrypoint to hang indefinitely.Changes
sudo rm -rf ... > /dev/null 2>&1with acleanup_x11()function that reports what it finds/removes and warns if removal fails; falls back tosudoif a plainrmis denied/tmp/vnc.logto stderr, kills the failed screen session, re-runs cleanup, and retries once/tmp/xfce4.logbefore exitingdocker/vnc-healthcheck.shwhich verifies the X11 display (xdpyinfo -display :1) and noVNC endpoint (curl localhost:5801/vnc.html); the container is marked unhealthy if either check fails, allowing dependent containers usingcondition: service_healthyto hold until the full VNC stack is readyHEALTHCHECKinstruction — the health check script is baked into the image (HEALTHCHECK --start-period=60s) so it works correctly even without Docker Composestart_period: 60sadded to the healthcheck incompose.ymlandcompose.cuda.ymlto allow the VNC stack time to fully initialise before retries start counting