Fix Datadog agent for SkyPilot training jobs #3955

sovit-max · 2025-11-20T22:24:33Z

Gets training logs from SkyPilot jobs into Datadog.

The key issue was that SkyPilot's subprocess daemon recursively kills all child processes when a phase ends. Starting the DD agent in the setup phase meant it was killed before training started.

the fix is to move DD agent setup from setup phase to run phase (run.sh), so it stays alive during training.

run.sh also tees stdout to a log file for DD to tail, as determined by an env var we pass from launch.py

Fixes agent startup in Docker containers by checking for binary instead of systemd service, and adds automatic startup during run phase.

Fixes script exit on pgrep failure and adds better error messages

…tion

- Add metta_run_id, skypilot_task_id, and other tags to all log entries - Enables filtering logs by run ID in Datadog Logs Explorer - Fixes issue where logs weren't searchable by metta_run_id tag

- Remove wildcard paths (use explicit file paths only) - Create empty log files during setup so agent can collect immediately - Use unbuffered output (stdbuf) to ensure logs are written in real-time - Add programmatic log checking script (check_datadog_logs.py) - Fix tag formatting in log collection config

- Agent now always restarts to ensure logs_enabled and log collection config are loaded - This ensures the agent picks up config changes made during setup phase

- Add logs section to main datadog.yaml in addition to separate config file - Change custom_logs.d to custom_logs directory (standard format) - This ensures logs are collected even if separate config isn't picked up

- Change directory from custom_logs to skypilot_training.d (proper .d format) - Add logs_config section with auto_multi_line_detection and force_use_http - This follows Datadog documentation best practices for log collection

- Remove duplicate logs section from main datadog.yaml (logs come from conf.d files) - Add detailed verification in startup script to check: - Agent status and log collection status - Log collection config file existence - Training log file existence and size - This helps debug why logs aren't appearing in Datadog

- Show config file contents when found - Run agent configcheck to verify agent sees the config - List available config files if our config is missing - This helps debug why logs aren't being collected

- Set proper file permissions (644) on log config file - Log config file size for verification - Validate config contains required fields - This helps debug why agent isn't collecting logs

This ensures the Datadog agent user (dd-agent) can read logs generated by the training process user. - Set /tmp/training_logs to 777 - Set log files to 666

- Temporarily disable pipefail when starting training to avoid silent failures - Set PYTHONUNBUFFERED=1 for real-time logging - Add debugging to check if training produces output after 5s

Delays the Datadog agent startup to allow the training process (Ray/Torchrun) to fully initialize and bind ports. This prevents resource contention or conflicts that were causing the training process to hang during startup.

….sh just tees to it. both attend to an env var established by launch.py

metta/setup/components/datadog_agent.py

…proceed anyway

metta/setup/components/datadog_agent.py

… datadog config

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

metta/setup/components/datadog_agent.py

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

metta/setup/components/datadog_agent.py

devops/run.sh

Gets training logs from SkyPilot jobs into Datadog. The key issue was that SkyPilot's subprocess daemon recursively kills all child processes when a phase ends. Starting the DD agent in the setup phase meant it was killed before training started. the fix is to move DD agent setup from setup phase to run phase (run.sh), so it stays alive during training. run.sh also `tee`s stdout to a log file for DD to tail, as determined by an env var we pass from launch.py --------- Co-authored-by: Nishu <133812901+nishu-builder@users.noreply.github.com> Co-authored-by: Nishad <nishad@stem.ai> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

sovit-max added 30 commits November 18, 2025 13:02

Start Datadog agent as daemon in SkyPilot Docker jobs

3a3ef42

Fixes agent startup in Docker containers by checking for binary instead of systemd service, and adds automatic startup during run phase.

Fix Datadog agent startup script error handling

92a53fe

Fixes script exit on pgrep failure and adds better error messages

fix: Fix Datadog agent binary detection and installation in Docker

2ea030b

Revert binary check changes

dfc1fcd

fix: Require binary existence to skip Datadog installation

a58d618

fix: Check binary exists after install, use sudo if available

bee7af8

Add Datadog verification helper script

9b92870

fix: Set hostname for Datadog agent in Docker

d3e8072

fix: Sanitize hostname to be RFC1123 compliant

7545205

feat: Enable log collection in Datadog agent for SkyPilot jobs

3b2bc03

fix: Filter empty tags when configuring Datadog agent

dd58a94

fix: Restart Datadog agent to pick up log collection config

4a7ee20

feat: Redirect training stdout/stderr to log files for Datadog collec…

734a8c6

…tion

fix: Add tags to Datadog log collection config for filtering

d43a958

- Add metta_run_id, skypilot_task_id, and other tags to all log entries - Enables filtering logs by run ID in Datadog Logs Explorer - Fixes issue where logs weren't searchable by metta_run_id tag

fix: Always restart Datadog agent in run phase to pick up config changes

98c88ba

- Agent now always restarts to ensure logs_enabled and log collection config are loaded - This ensures the agent picks up config changes made during setup phase

fix: Improve Datadog logs query script error handling

9902b44

fix: Add log collection directly to main datadog.yaml for reliability

4b3231d

- Add logs section to main datadog.yaml in addition to separate config file - Change custom_logs.d to custom_logs directory (standard format) - This ensures logs are collected even if separate config isn't picked up

fix: Add detailed config verification to Datadog agent startup

87d255e

- Show config file contents when found - Run agent configcheck to verify agent sees the config - List available config files if our config is missing - This helps debug why logs aren't being collected

fix: Add validation and better logging for Datadog log config

888daee

- Set proper file permissions (644) on log config file - Log config file size for verification - Validate config contains required fields - This helps debug why agent isn't collecting logs

fix: Update Datadog log config in run phase

74336b0

fix: Ensure Datadog log directory permissions are 777

73683e6

This ensures the Datadog agent user (dd-agent) can read logs generated by the training process user. - Set /tmp/training_logs to 777 - Set log files to 666

fix: Ensure Datadog agent has permissions to read logs

87db704

Make Datadog agent non-blocking and fix log redirection

8e628b1

fix(skypilot): Disable pipefail during training start and add debugging

22769a6

- Temporarily disable pipefail when starting training to avoid silent failures - Set PYTHONUNBUFFERED=1 for real-time logging - Add debugging to check if training produces output after 5s

fix(skypilot): Delay Datadog agent startup by 60s

89b5ff8

Delays the Datadog agent startup to allow the training process (Ray/Torchrun) to fully initialize and bind ports. This prevents resource contention or conflicts that were causing the training process to hang during startup.

Update run.sh

4343802

fix(datadog): generate valid YAML for skypilot_training tags

ed388ff

Nishad added 3 commits December 6, 2025 17:08

rely on run.sh to handle log specification only

9820cc3

only datadog_agent.py is responsiblef for setting up logging, and run…

18bb47e

….sh just tees to it. both attend to an env var established by launch.py

fix $() interpolation issue

c610f1f

graphite-app bot reviewed Dec 7, 2025

View reviewed changes

metta/setup/components/datadog_agent.py Show resolved Hide resolved

install script fails to start with init, but the binary is there, so …

e2f2f1b

…proceed anyway

graphite-app bot reviewed Dec 7, 2025

View reviewed changes

metta/setup/components/datadog_agent.py Outdated Show resolved Hide resolved

Nishad and others added 8 commits December 6, 2025 17:59

Use a better config builder to set tags, and enable logs explcitly in…

edd12e4

… datadog config

add debug logging to DD log config setup

e5c8c24

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

stop any existing DD agent before starting with new config

6a3c53f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix pkill pattern to not kill install process

5e53b81

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

verify DD agent is running after start

1ba27d8

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

capture DD agent startup logs on failure

9243fb4

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

show last 20 lines of DD agent log, wait 5s

3ec77cc

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix DD agent hostname error - set hostname in config

a15ea03

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

graphite-app bot reviewed Dec 7, 2025

View reviewed changes

metta/setup/components/datadog_agent.py Show resolved Hide resolved

fix hostname check to ignore commented lines

0bd674f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

graphite-app bot reviewed Dec 7, 2025

View reviewed changes

metta/setup/components/datadog_agent.py Show resolved Hide resolved

Nishad added 4 commits December 6, 2025 23:20

add DD diagnostics to run.sh

f17c4de

use nohup to keep DD agent running after setup

5523193

move DD agent setup to run phase

4d42c8b

clean up DD agent debug output

4138681

nishu-builder approved these changes Dec 7, 2025

View reviewed changes

graphite-app bot reviewed Dec 7, 2025

View reviewed changes

devops/run.sh Outdated Show resolved Hide resolved

nishu-builder changed the title ~~fix(skypilot): Enable Datadog Logs Agent and add log collection for training jobs~~ Fix Datadog agent for SkyPilot training jobs Dec 7, 2025

remove head pipe that could SIGPIPE the install

9c52b44

nishu-builder added this pull request to the merge queue Dec 7, 2025

Merged via the queue into main with commit 7ba1451 Dec 7, 2025
10 of 11 checks passed

nishu-builder deleted the sovitn-7-datadog-agent-fix branch December 7, 2025 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Datadog agent for SkyPilot training jobs #3955

Fix Datadog agent for SkyPilot training jobs #3955

Uh oh!

sovit-max commented Nov 20, 2025 •

edited by nishu-builder

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Datadog agent for SkyPilot training jobs #3955

Fix Datadog agent for SkyPilot training jobs #3955

Uh oh!

Conversation

sovit-max commented Nov 20, 2025 • edited by nishu-builder Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sovit-max commented Nov 20, 2025 •

edited by nishu-builder

Loading