Schedule in PyTorchProfiler doesn't work #14063

fedorovgv · 2022-08-06T15:29:46Z

🐛 Bug

The class pytorch_lightning.profiler.PyTorchProfiler doesn't work correctly with schedule parameter. It doesn't take into account the repeat parameter torch.profiler.schedule, which is very important in long-term learning.

To Reproduce

Reproduce with the BoringModel:

UPD: new link with example of main incorrect work and work with commit changes:

https://colab.research.google.com/drive/1UbbLx5N5Th0MsXu1olQwWqo7QLGd-lRY?usp=sharing

Expected behavior

I used pytorch_lightning.profiler.PyTorchProfiler with schedule=torch.profiler.schedule(wait=2, warmup=1, active=3, repeat=5,). And according to torch docs, i expect that there will be 5 cycles, each of which consists of 2 wait + 1 warmup + 3 active = 6 steps (per cycle), but in fact PyTorchProfiler records information about fewer cycles.

In code terms:

profiler = PyTorchProfiler(
        schedule=torch.profiler.schedule(
            wait=2, 
            warmup=1, 
            active=3, 
            repeat=5,
        ),
    )

    model = BoringModel()
    trainer = Trainer(
        max_epochs=1,
        profiler=profiler,
    )
    trainer.fit(model, train_dataloaders=train_data)

in this case the profiler should return information about 15 steps (3 active * 5 cylcles) to me, however it returns information about fewer steps because it doesn't record some cycles.

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           11.3
* Packages:
	- lightning:         None
	- lightning_app:     None
	- numpy:             1.21.6
	- pyTorch_debug:     False
	- pyTorch_version:   1.12.0+cu113
	- pytorch-lightning: 1.7.0
	- tqdm:              4.64.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.13
	- version:           #1 SMP Sun Apr 24 10:03:06 PDT 2022

Additional context

I found the same issue #12611 .

cc @carmocca @kaushikb11 @ninginthecloud @rohitgr7 @nbcsm @guotuofeng

The text was updated successfully, but these errors were encountered:

fedorovgv · 2022-08-06T17:29:44Z

If we set the schedule parameter in the profiler = PyTorchProfiler( .. , schedule = .. , .. ), then the profiler._schedule set as ScheduleWrapper object:

https://github.com/Lightning-AI/lightning/blob/26d69ceada7f4ad1632e70df6414348170e85574/src/pytorch_lightning/profilers/pytorch.py#L328-L329

After then, we will call profiler._shedule___call__(num_step) on each profiler step:

https://github.com/Lightning-AI/lightning/blob/b25275ccc27652b91d85d49b7bc220b37c921b54/src/pytorch_lightning/profilers/pytorch.py#L189-L211

And we can see, that we will ignore profiler._schedule if there are self._current_action is None or self.has_finished.

the first condition is possible if the profiler ends (as i understand it)
the second condition is possible when 'profile._schedule' had RECORD_AND_SAVE action before

But, in fact, ProfilerAction.RECORD_AND_SAVE action it is not a signal for profilers work end. Schedule returns this action when mod_step = step % num_steps == wait + warmup + active and we always write only one schedule cycle.

[torch docs]

  def schedule_fn(step: int) -> ProfilerAction:
      assert step >= 0
      if step < skip_first:
          return ProfilerAction.NONE
      else:
          step -= skip_first
      num_steps = wait + warmup + active
      if repeat > 0 and step / num_steps >= repeat:
          return ProfilerAction.NONE
      mod_step = step % num_steps
      if mod_step < wait:
          return ProfilerAction.NONE
      elif mod_step < wait + warmup:
          return ProfilerAction.WARMUP
      else:
          return ProfilerAction.RECORD if mod_step < num_steps - 1 \
              else ProfilerAction.RECORD_AND_SAVE

stale · 2023-04-15T16:23:49Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

jhoareau · 2023-08-04T11:38:44Z

This is still an issue.

matanninio · 2023-09-21T07:07:30Z

This is still an issue, and it makes profiling with the pytorch profiler almost impossible, as the behavior is very different from expected and limited in usefulness.

andife · 2023-12-10T11:53:38Z

I would also be quite interested in that topic.

fedorovgv added the needs triage Waiting to be triaged by maintainers label Aug 6, 2022

fedorovgv mentioned this issue Aug 6, 2022

Delete wrong condition from schedule call function #14065

Closed

11 tasks

awaelchli added bug Something isn't working profiler and removed needs triage Waiting to be triaged by maintainers labels Aug 8, 2022

stale bot added the won't fix This will not be worked on label Apr 15, 2023

carmocca added this to the v1.9.x milestone Apr 15, 2023

stale bot removed the won't fix This will not be worked on label Apr 15, 2023

andife mentioned this issue Dec 10, 2023

Status: Profile Code and Analysis with HTA HITS-AIN/Spherinator#61

Open

awaelchli added the help wanted Open to be worked on label Dec 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedule in PyTorchProfiler doesn't work #14063

Schedule in PyTorchProfiler doesn't work #14063

fedorovgv commented Aug 6, 2022 •

edited by github-actions bot

Loading

fedorovgv commented Aug 6, 2022 •

edited

Loading

stale bot commented Apr 15, 2023

jhoareau commented Aug 4, 2023

matanninio commented Sep 21, 2023

andife commented Dec 10, 2023

Schedule in PyTorchProfiler doesn't work #14063

Schedule in PyTorchProfiler doesn't work #14063

Comments

fedorovgv commented Aug 6, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

fedorovgv commented Aug 6, 2022 • edited Loading

stale bot commented Apr 15, 2023

jhoareau commented Aug 4, 2023

matanninio commented Sep 21, 2023

andife commented Dec 10, 2023

fedorovgv commented Aug 6, 2022 •

edited by github-actions bot

Loading

fedorovgv commented Aug 6, 2022 •

edited

Loading