New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLAProfiler
on TPU not working
#15885
Comments
Hey @@Liyang90, Would you be interested into turning this patch into a PR ? Best, |
@Liyang90 Thanks for reporting. Could you provide us with the error trace and the name of the action that caused the issue? |
There is no error with in the Lightning, but the profiler would just not run because of the updates in the action name in the Trainer, such as: https://github.com/Lightning-AI/lightning/blob/12c74f134e4db63aa2f114fa539f1b45c721a1b8/src/pytorch_lightning/trainer/trainer.py#L1438 |
I see. So either the dots or the brackets in the name seem to cause an issue. What if we do a string replacement in the xla profiler and replace these characters with another symbol? We could replace "." with "-" and maybe drop the brackets completely. |
It doesn't seem to be caused by the special symbols. It the way the action names are filtered by |
@Liyang90 Oh I understand now. Sorry for my misleading comments. It's just that I haven't been active on that part of the code base very much. |
Bug description
It seems the changes in
action_name
is preventing it from starting the profiler server or logging any steps. This patch fixes the bug:How to reproduce the bug
No response
Error messages and logs
Environment
More info
No response
cc @carmocca @nbcsm @guotuofeng
The text was updated successfully, but these errors were encountered: