Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule in PyTorchProfiler doesn't work #14063

Open
fedorovgv opened this issue Aug 6, 2022 · 5 comments
Open

Schedule in PyTorchProfiler doesn't work #14063

fedorovgv opened this issue Aug 6, 2022 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on profiler
Milestone

Comments

@fedorovgv
Copy link

fedorovgv commented Aug 6, 2022

🐛 Bug

The class pytorch_lightning.profiler.PyTorchProfiler doesn't work correctly with schedule parameter. It doesn't take into account the repeat parameter torch.profiler.schedule, which is very important in long-term learning.

To Reproduce

Reproduce with the BoringModel:

UPD: new link with example of main incorrect work and work with commit changes:

https://colab.research.google.com/drive/1UbbLx5N5Th0MsXu1olQwWqo7QLGd-lRY?usp=sharing

Expected behavior

I used pytorch_lightning.profiler.PyTorchProfiler with schedule=torch.profiler.schedule(wait=2, warmup=1, active=3, repeat=5,). And according to torch docs, i expect that there will be 5 cycles, each of which consists of 2 wait + 1 warmup + 3 active = 6 steps (per cycle), but in fact PyTorchProfiler records information about fewer cycles.

In code terms:

profiler = PyTorchProfiler(
        schedule=torch.profiler.schedule(
            wait=2, 
            warmup=1, 
            active=3, 
            repeat=5,
        ),
    )

    model = BoringModel()
    trainer = Trainer(
        max_epochs=1,
        profiler=profiler,
    )
    trainer.fit(model, train_dataloaders=train_data)

in this case the profiler should return information about 15 steps (3 active * 5 cylcles) to me, however it returns information about fewer steps because it doesn't record some cycles.

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           11.3
* Packages:
	- lightning:         None
	- lightning_app:     None
	- numpy:             1.21.6
	- pyTorch_debug:     False
	- pyTorch_version:   1.12.0+cu113
	- pytorch-lightning: 1.7.0
	- tqdm:              4.64.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.13
	- version:           #1 SMP Sun Apr 24 10:03:06 PDT 2022

Additional context

I found the same issue #12611 .

cc @carmocca @kaushikb11 @ninginthecloud @rohitgr7 @nbcsm @guotuofeng

@fedorovgv fedorovgv added the needs triage Waiting to be triaged by maintainers label Aug 6, 2022
@fedorovgv
Copy link
Author

fedorovgv commented Aug 6, 2022

If we set the schedule parameter in the profiler = PyTorchProfiler( .. , schedule = .. , .. ), then the profiler._schedule set as ScheduleWrapper object:

https://github.com/Lightning-AI/lightning/blob/26d69ceada7f4ad1632e70df6414348170e85574/src/pytorch_lightning/profilers/pytorch.py#L328-L329

After then, we will call profiler._shedule___call__(num_step) on each profiler step:

https://github.com/Lightning-AI/lightning/blob/b25275ccc27652b91d85d49b7bc220b37c921b54/src/pytorch_lightning/profilers/pytorch.py#L189-L211

And we can see, that we will ignore profiler._schedule if there are self._current_action is None or self.has_finished.

  • the first condition is possible if the profiler ends (as i understand it)
  • the second condition is possible when 'profile._schedule' had RECORD_AND_SAVE action before

But, in fact, ProfilerAction.RECORD_AND_SAVE action it is not a signal for profilers work end. Schedule returns this action when mod_step = step % num_steps == wait + warmup + active and we always write only one schedule cycle.

[torch docs]

  def schedule_fn(step: int) -> ProfilerAction:
      assert step >= 0
      if step < skip_first:
          return ProfilerAction.NONE
      else:
          step -= skip_first
      num_steps = wait + warmup + active
      if repeat > 0 and step / num_steps >= repeat:
          return ProfilerAction.NONE
      mod_step = step % num_steps
      if mod_step < wait:
          return ProfilerAction.NONE
      elif mod_step < wait + warmup:
          return ProfilerAction.WARMUP
      else:
          return ProfilerAction.RECORD if mod_step < num_steps - 1 \
              else ProfilerAction.RECORD_AND_SAVE

@awaelchli awaelchli added bug Something isn't working profiler and removed needs triage Waiting to be triaged by maintainers labels Aug 8, 2022
@stale
Copy link

stale bot commented Apr 15, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 15, 2023
@carmocca carmocca added this to the v1.9.x milestone Apr 15, 2023
@stale stale bot removed the won't fix This will not be worked on label Apr 15, 2023
@jhoareau
Copy link

jhoareau commented Aug 4, 2023

This is still an issue.

@matanninio
Copy link

This is still an issue, and it makes profiling with the pytorch profiler almost impossible, as the behavior is very different from expected and limited in usefulness.

@andife
Copy link

andife commented Dec 10, 2023

I would also be quite interested in that topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on profiler
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants