Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLAProfiler on TPU not working #15885

Closed
Liyang90 opened this issue Dec 1, 2022 · 6 comments · Fixed by #15886
Closed

XLAProfiler on TPU not working #15885

Liyang90 opened this issue Dec 1, 2022 · 6 comments · Fixed by #15886
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working profiler
Milestone

Comments

@Liyang90
Copy link
Contributor

Liyang90 commented Dec 1, 2022

Bug description

It seems the changes in action_name is preventing it from starting the profiler server or logging any steps. This patch fixes the bug:

diff --git a/src/pytorch_lightning/profilers/xla.py b/src/pytorch_lightning/profilers/xla.py
index ef103a9a4..958e8de3f 100644
--- a/src/pytorch_lightning/profilers/xla.py
+++ b/src/pytorch_lightning/profilers/xla.py
@@ -50,7 +50,7 @@ class XLAProfiler(Profiler):
     def start(self, action_name: str) -> None:
         import torch_xla.debug.profiler as xp
 
-        if action_name in self.RECORD_FUNCTIONS:
+        if action_name.split(".")[-1] in self.RECORD_FUNCTIONS:
             if not self._start_trace:
                 self.server = xp.start_server(self.port)
                 self._start_trace = True

How to reproduce the bug

No response

Error messages and logs


# Error messages and logs here please

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca @nbcsm @guotuofeng

@Liyang90 Liyang90 added the needs triage Waiting to be triaged by maintainers label Dec 1, 2022
@tchaton
Copy link
Contributor

tchaton commented Dec 2, 2022

Hey @@Liyang90,

Would you be interested into turning this patch into a PR ?

Best,
T.C

@awaelchli awaelchli added bug Something isn't working accelerator: tpu Tensor Processing Unit profiler and removed needs triage Waiting to be triaged by maintainers labels Dec 5, 2022
@awaelchli awaelchli added this to the v1.8.x milestone Dec 5, 2022
@awaelchli
Copy link
Member

@Liyang90 Thanks for reporting. Could you provide us with the error trace and the name of the action that caused the issue?

@Liyang90
Copy link
Contributor Author

Liyang90 commented Dec 5, 2022

There is no error with in the Lightning, but the profiler would just not run because of the updates in the action name in the Trainer, such as: https://github.com/Lightning-AI/lightning/blob/12c74f134e4db63aa2f114fa539f1b45c721a1b8/src/pytorch_lightning/trainer/trainer.py#L1438
The name is not simply "training_step" or "backward" anymore.

@awaelchli
Copy link
Member

I see. So either the dots or the brackets in the name seem to cause an issue. What if we do a string replacement in the xla profiler and replace these characters with another symbol? We could replace "." with "-" and maybe drop the brackets completely.

@Liyang90
Copy link
Contributor Author

Liyang90 commented Dec 5, 2022

It doesn't seem to be caused by the special symbols. It the way the action names are filtered by self.RECORD_FUNCTIONS. It only has https://github.com/Lightning-AI/lightning/blob/12c74f134e4db63aa2f114fa539f1b45c721a1b8/src/pytorch_lightning/profilers/xla.py#L26.
But after some version, these only match the last part of the actual action names passed to the profiler. The linked PR fixes this.

@awaelchli
Copy link
Member

@Liyang90 Oh I understand now. Sorry for my misleading comments. It's just that I haven't been active on that part of the code base very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working profiler
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants