Skip to content

feat: add live job log streaming#4454

Merged
nvidianz merged 20 commits intoNVIDIA:mainfrom
nvidianz:job-log-streamer
Apr 24, 2026
Merged

feat: add live job log streaming#4454
nvidianz merged 20 commits intoNVIDIA:mainfrom
nvidianz:job-log-streamer

Conversation

@nvidianz
Copy link
Copy Markdown
Collaborator

@nvidianz nvidianz commented Apr 17, 2026

Summary

  • Adds JobLogStreamer (client) and JobLogReceiver (server) for real-time log streaming from clients to the server during federated jobs, replacing the deprecated ErrorLogSender/LogReceiver which only sent static snapshots after job completion.
  • Adds SystemLogStreamer — a system-level widget for resources.json that automatically injects JobLogStreamer into jobs that don't already declare one, by modifying the deployed config_fed_client.json at BEFORE_JOB_LAUNCH.
  • Updates provisioning templates: client uses SystemLogStreamer (streams error_log.txt), server uses JobLogReceiver.
  • Includes hello-log-streaming example and unit tests.

Why streaming runs inside the job, not in CLIENT_PARENT

The streamer must run inside the job subprocess rather than being managed from the parent process. With Docker or Kubernetes job launchers, the job may run in a different container or pod that the parent process has no filesystem access to. Streaming from CLIENT_PARENT would only work with the local process launcher. By injecting JobLogStreamer into the job config, the streamer always runs where the log file lives — regardless of the launch mechanism.

Key design points

  • LogStreamer tails growing log files, survives log rotation, and sends liveness heartbeats so the receiver can detect dead senders via idle timeout.
  • JobLogStreamer uses a two-phase stop: ABOUT_TO_END_RUN sets stop_event only (returns immediately so framework log lines land in the file during the drain window); END_RUN joins the thread to keep client_run() alive until EOF is acknowledged.
  • A fresh Signal is injected into the streaming FLContext via put() (not set_prop()) to bypass the mask-consistency check that would otherwise leave the triggered run_abort_signal in place.
  • JobLogReceiver writes chunks directly to a file as they arrive so the log can be followed with tail -f.

Test plan

  • Run examples/hello-world/hello-log-streaming/job.py in simulator — verify streamed log matches original
  • Run with POC mode — verify SystemLogStreamer injects JobLogStreamer into jobs without one
  • Verify ALLOW_ERROR_SENDING=false in project.yml removes SystemLogStreamer from provisioned resources.json
  • Run unit tests: pytest tests/unit_test/app_common/streamers/log_streamer_test.py

🤖 Generated with Claude Code

nvidianz and others added 2 commits April 17, 2026 09:06
Introduces LogStreamer, JobLogStreamer, and JobLogReceiver for
real-time log tailing from clients to the server during a federated
job. Deprecates ErrorLogSender and LogReceiver in favour of the new
live-streaming components.

Key design points:
- _LogTailProducer tails a growing log file, survives log rotation,
  sends liveness heartbeats, and does a drain retry after stop_event
  fires to capture bytes written by cleanup code.
- JobLogStreamer uses a two-phase stop: ABOUT_TO_END_RUN sets
  stop_event only (returns immediately so framework log lines land
  in the file during the drain window); END_RUN joins the thread to
  keep client_run() alive until EOF is acknowledged by the server.
- A fresh Signal is injected into the streaming FLContext via put()
  (not set_prop()) to bypass the mask-consistency check that would
  otherwise leave the triggered run_abort_signal in place and cause
  the sender to abort mid-stream.
- JobLogReceiver writes chunks directly to a file as they arrive so
  the log can be followed with tail -f; hands the file to the job
  manager when the stream closes.
- Includes hello-log-streaming example and unit tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SystemLogStreamer is a system-level widget (resources.json) that
automatically injects a JobLogStreamer into every job that doesn't
already declare one.  It hooks BEFORE_JOB_LAUNCH to modify the
deployed config_fed_client.json before the job subprocess starts —
no duplicate streaming code needed.

Provisioning template changes:
- Client: ErrorLogSender → SystemLogStreamer (streams error_log.txt)
- Server: LogReceiver → JobLogReceiver
- Fixed _modify_error_sender to match the new component id

Also downgraded the "file not created" log from WARNING to INFO in
JobLogStreamer since error_log.txt not existing is expected for
successful jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR replaces the deprecated ErrorLogSender/LogReceiver (static post-job snapshots) with a real-time streaming pair: JobLogStreamer (client) tails the live log file and streams chunks to JobLogReceiver (server) as they are written, with heartbeat-based liveness detection and a watchdog idle-timeout on the receiver. SystemLogStreamer auto-injects JobLogStreamer into jobs that don't declare one via BEFORE_JOB_LAUNCH config patching, and uploads a fallback post-run error-log snapshot from CLIENT_PARENT for launch/bootstrap failures.

All previously flagged P0/P1 concerns have been addressed: the liveness_interval < idle_timeout guard is now enforced in stream_log(), the watchdog correctly terminates orphaned streams even before the first consume(), the receiver derives client/job identity from the trusted peer context, the fresh FLContext for background threads is created in both JobLogStreamer and SystemLogStreamer, the stale method name has been renamed, and the wire-protocol namespace change is documented as a breaking change in the release notes.

Confidence Score: 5/5

Safe to merge; all previously flagged P0/P1 issues are resolved and only minor P2 style/polish items remain.

All blocking issues from the prior review round (path traversal, stale fl_ctx, watchdog early-crash, liveness/idle constraint, wire-protocol rename documentation) have been addressed. The two remaining comments are cosmetic: a misleading 'retained at None' warning message and a non-configurable 60 s join timeout — neither affects correctness or security.

No files require special attention; the minor log-message fix in job_log_receiver.py is optional polish.

Important Files Changed

Filename Overview
nvflare/app_common/streamers/log_streamer.py New live-tail streamer with heartbeat and idle-timeout watchdog; previously flagged issues (liveness_interval < idle_timeout guard, watchdog crash-before-first-data) are now fixed.
nvflare/app_common/logging/job_log_receiver.py New server-side log receiver; previously flagged path-traversal and trusted-identity issues are resolved; misleading 'retained at None' log message when no data arrives remains a minor concern.
nvflare/app_common/logging/job_log_streamer.py Client-side job log streamer with two-phase stop (ABOUT_TO_END_RUN sets event, END_RUN joins); join timeout is hardcoded at 60 s without a config knob.
nvflare/app_common/logging/system_log_streamer.py System-level widget that injects JobLogStreamer into jobs; fresh context for background thread and post-run error-log upload are correctly handled.
nvflare/app_common/streamers/streamer_base.py Extracted BaseChunkProducer/BaseChunkConsumer shared logic and unified wire-key namespace to 'Streamer.*'; breaking change is now documented in release notes.
nvflare/private/stream_runner.py Adds END_STREAM hook injection into stream_ctx before consumer creation, enabling receiver-side watchdog to close orphaned streams asynchronously; also fixes a typo (stream_done_db → stream_done_cb).
nvflare/lighter/impl/static_file.py Renamed _modify_error_sender to _modify_system_log_streamer and updated component ID from 'error_log_sender' to 'system_log_streamer'; clean rename.
tests/unit_test/app_common/streamers/log_streamer_test.py Good coverage: per-stream callback isolation, independent idle timeout per stream, orphaned stream cleanup without fl_ctx, and liveness_interval >= idle_timeout rejection.

Sequence Diagram

sequenceDiagram
    participant SLS as SystemLogStreamer(CLIENT_PARENT)
    participant JLS as JobLogStreamer(CLIENT_JOB)
    participant SR as stream_runner(Server)
    participant JLR as JobLogReceiver(Server)
    SLS->>SLS: BEFORE_JOB_LAUNCH patch config_fed_client.json
    Note over JLS: Job subprocess starts with patched config
    JLS->>JLS: START_RUN spawn streaming thread with fresh FLContext
    JLS->>SR: open stream seq=0 with stream_ctx
    SR->>SR: inject END_STREAM hook into stream_ctx
    SR->>JLR: get_consumer creates LogChunkConsumer and watchdog thread
    loop Live tail loop
        JLS->>SR: data chunk
        SR->>JLR: consume writes to file and flushes
        JLR-->>SR: OK continue
        Note over JLS,SR: If log quiet beyond liveness interval
        JLS->>SR: heartbeat message
        SR->>JLR: consume resets idle clock
    end
    JLS->>JLS: ABOUT_TO_END_RUN set stop_event
    JLS->>JLS: drain remaining bytes
    JLS->>SR: EOF chunk
    SR->>JLR: consume returns stop
    JLR->>JLR: finalize stops watchdog
    SR->>JLR: dispatch stream done callback
    JLR->>JLR: close file store via job_manager
    JLS->>JLS: END_RUN join streaming thread up to 60s
    Note over SLS: JOB_COMPLETED error log only
    SLS->>SR: upload snapshot with pre-set stop_event
Loading

Reviews (10): Last reviewed commit: "Document FileStreamer wire-key namespace..." | Re-trigger Greptile

Comment thread nvflare/app_common/streamers/log_streamer.py Outdated
Comment thread nvflare/app_common/streamers/log_streamer.py Outdated
Comment thread nvflare/app_common/logging/job_log_receiver.py
Comment thread nvflare/lighter/impl/static_file.py
nvidianz and others added 7 commits April 17, 2026 09:15
Documents the architecture, two-phase stop strategy, drain retry,
fresh abort signal, and rationale for running streaming inside the
job subprocess (Docker/K8s launcher compatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The streamed log always misses the last few framework teardown lines
written after END_RUN completes — document this as a known limitation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the provisioning template by setting error_log.txt as the
streamed file name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix watchdog to detect dead senders that crash before sending any
  data: seed fl_ctx from get_consumer() instead of waiting for first
  consume(), and stop resetting the idle clock when no message has
  arrived.
- Clean up partial log file on disk when stream ends with non-OK
  return code (e.g. TIMEOUT, TASK_ABORTED).
- Rename _modify_error_sender to _modify_log_streamer to match the
  current component name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Retain the partial file for debugging instead of deleting it.
Log a warning with the file path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nvflare/app_common/logging/job_log_receiver.py
nvidianz and others added 5 commits April 17, 2026 11:11
The watchdog polled every 1.0s regardless of idle_timeout, making
sub-second timeouts indistinguishable. Now polls at
min(1.0, idle_timeout/3) so the watchdog resolution matches the
configured threshold. Updated test to stagger heartbeats and verify
independent timeout ordering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reject absolute paths and path traversal (..) to prevent streaming
files outside the job's log directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply os.path.basename() to job_id, client name, and log file name
from the stream context before constructing the destination path.
This prevents directory traversal via '..' or absolute paths in
client-supplied values.

Also corrected docstring that falsely claimed the liveness_interval
< idle_timeout constraint is enforced at call time — the values are
configured on different sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nvflare/app_common/logging/system_log_streamer.py Outdated
Comment thread nvflare/app_common/logging/system_log_streamer.py Outdated
Copy link
Copy Markdown
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a few comments

Comment thread nvflare/app_common/logging/system_log_streamer.py
@nvidianz
Copy link
Copy Markdown
Collaborator Author

/build

Comment thread nvflare/app_common/streamers/streamer_base.py
nvidianz and others added 2 commits April 23, 2026 18:58
@nvidianz
Copy link
Copy Markdown
Collaborator Author

/build

Comment thread docs/release_notes/flare_260.rst
Comment thread docs/resources/log_config.json
Comment thread nvflare/app_common/logging/job_log_streamer.py
Copy link
Copy Markdown
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

a few minor comments ( FLARE 2.6.0 release notes change)
and a few questions u can address later

@nvidianz nvidianz merged commit ee3e38b into NVIDIA:main Apr 24, 2026
29 checks passed
@nvidianz nvidianz deleted the job-log-streamer branch April 24, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants