Lightweight diagnostic callback to identify training bottlenecks #21741
abhinavsriva
started this conversation in
Show off your work
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I have built a lightweight diagnostic callback (TraceMLCallback) that plugs directly into the Lightning Trainer. Instead of parsing messy text-based profiler logs, it provides an interactive dashboard that explicitly flags whether a run is INPUT-BOUND, COMPUTE-BOUND, or suffering from straggler/imbalance issues, all without needing to change your existing training loop.
Immediate diagnosis: Clearly visualize where time is being lost (e.g., dataloader_ms vs forward_pass_ms).
Quick Example:
I have open-sourced this at https://github.com/traceopt-ai/traceml/ and would love for anyone dealing with performance bottlenecks to try it out and let me know if it helps you.
Looking forward to your feedback.
Beta Was this translation helpful? Give feedback.
All reactions