Skip to content

add agent skill for debugging distributed training log failures#15612

Merged
pzelasko merged 1 commit into
NVIDIA-NeMo:mainfrom
gaikwadabhishek:add-debug-training-logs-skill
Apr 15, 2026
Merged

add agent skill for debugging distributed training log failures#15612
pzelasko merged 1 commit into
NVIDIA-NeMo:mainfrom
gaikwadabhishek:add-debug-training-logs-skill

Conversation

@gaikwadabhishek
Copy link
Copy Markdown
Contributor

  • add /debug-training-logs slash command that analyzes SLURM worker stderr logs and optional AIStore daemon logs to find root causes of distributed training failures
  • covers NCCL timeout analysis: distinguishes straggler ranks (stuck in data loading) from GPU fabric hangs by comparing enqueued vs completed work across ALL ranks
  • includes AIStore log parsing: file time ranges, timezone verification, error counter tracking, proxy/target correlation
  • documents NeMo-specific sync points (PreemptionCallback broadcast, checkpoint broadcasts, DDP allreduce) that can cause rank desync
  • documents Lhotse data loading pitfalls: missing read timeouts, m4a BytesIO extension loss, idle connection resets, fault_tolerant silent drops
  • includes instructions to obtain logs via scp and download AIS daemon logs via ais CLI with env var auth

- add /debug-training-logs slash command that analyzes SLURM worker
  stderr logs and optional AIStore daemon logs to find root causes
  of distributed training failures
- covers NCCL timeout analysis: distinguishes straggler ranks
  (stuck in data loading) from GPU fabric hangs by comparing
  enqueued vs completed work across ALL ranks
- includes AIStore log parsing: file time ranges, timezone
  verification, error counter tracking, proxy/target correlation
- documents NeMo-specific sync points (PreemptionCallback broadcast,
  checkpoint broadcasts, DDP allreduce) that can cause rank desync
- documents Lhotse data loading pitfalls: missing read timeouts,
  m4a BytesIO extension loss, idle connection resets, fault_tolerant
  silent drops
- includes instructions to obtain logs via scp and download AIS
  daemon logs via ais CLI with env var auth

Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com>
Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @gaikwadabhishek!
First Agentic Skill to be merged in NeMo :)

@vmendelev
Copy link
Copy Markdown
Collaborator

Thank you for this skill. Looks very useful!

@pzelasko pzelasko merged commit e967011 into NVIDIA-NeMo:main Apr 15, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants