Revamp Device Stats Logging #9032

ananthsub · 2021-08-22T00:56:15Z

Proposed refactoring or deprecation

Deprecate log_gpu_memory from the Trainer constructor

Motivation

We are auditing the Lightning components and APIs to assess opportunities for improvements:

Carrying forward some of the discussion from here: #9006

By default, logging gpu stats is disabled. Users must opt into this by setting.

Issues today with this argument:

Decoupling this from the trainer constructor gives us more flexibility to evolve the component which is actually responsible for gathering these stats without worrying about backwards compatibility guarantees for the trainer constructor argument as well. For example, the currently supported values for this flag are nvidia-smi specific. If nvidia-smi isn't available, this will fail. But PyTorch now supports getting CUDA memory stats directly: Fetch GPU stats using torch.cuda.memory_stats #8780 which we could take advantage of.
Lack of consistency across different devices: We do not offer corresponding flags for TPUs nor IPUs. Creating one flag per device type would lead to further bloat in the trainer.
There are multiple codepaths Lightning offers to log GPU memory. We have logging both internal to the trainer, as well as a GPU stats monitor callback. Having both is confusing and can lead to bugs like log_gpu_memory='min_max' leads to error in parsing metrics keys #9010 .

Pitch

Instead of offering separate flags & callbacks for different device types, why don't we consolidate this on the Accelerator interface, as this is where all the device-specific information already should be.

Offer a method like get_device_stats as an abstractmethod on the Accelerator interface. CPU, GPU, TPU, and IPU classes would implement this to return whatever stats are needed. This payload should be a black box to the trainer as long as it conforms to the expected return type. The implementations of this function could rely on the existing utilities we already have. This would ensure that all the device-specific code lives roughly in the same place. The stats returned should be per-device: the calling entities fetching these stats should determine whether or how to aggregate these stats (e.g. log stats from rank 0 only, reduce across ranks and log from rank 0, log the min/max across ranks, or log data from all ranks).
In terms of what actually fetches these stats, we could either:

offer a generic callback like DeviceStatsMonitor to periodically gather and log this data. This callback would consolidate the GPU and XLA ones and provide a general extension point for any new device types Lightning supports via the Accelerator interface.
Or, the trainer/loops themselves could periodically gather & log this information.

My preference is to go with the callback-based approach first. It remains the most extensible to customize options like logging frequency (e.g. step based or time-based) .

What do you think? @kaushikb11 @tchaton @awaelchli @carmocca @four4fish @edward-io @yifuwang

Additional context

#8174 (comment)
#9013 (comment)

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-08-22T15:36:09Z

I feel ok with this mainly because of what you mentioned here:

There are multiple codepaths Lightning offers to log GPU memory. We have logging both internal to the trainer, as well as a GPU stats monitor callback. Having both is confusing and can lead to bugs like log_gpu_memory='min_max' leads to error in parsing metrics keys #9010 .

Plus, several third-party loggers have this out of the box anyway so there is even a third way of doing the same thing and that's why I personally never turn the feature on in Lightning.

Querying the accelerator for it's performance stats seems a reasonable approach to me.

tchaton · 2021-08-22T19:10:22Z

Same here !

daniellepintz · 2021-09-14T06:19:00Z

question: is DeviceStatsMonitor callback meant to replace GPUStatsMonitor and XLAStatsMonitor?

ananthsub · 2021-09-14T06:32:16Z

question: is DeviceStatsMonitor callback meant to replace GPUStatsMonitor and XLAStatsMonitor?

Yup!

daniellepintz · 2021-09-15T00:53:13Z

Two questions:

Currently in GPUStatsMonitor in addition to the stats we get from nvidia-smi --query-gpu we also add some hardcoded stats ("batch_time/inter_step (ms)" and "batch_time/intra_step (ms)") that are dependent on the on_train_batch_start and on_train_batch_end hooks.

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L152
https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L177

In the new DeviceStatsMonitor I don't see a clean way to preserve these stats - is it okay to remove them, especially considering newer versions of Pytorch will use torch.cuda.memory_stats(device=device) anyway instead of these old stats?

In XLAStatsMonitor we call log_metrics in on_train_epoch_end with step=trainer.current_epoch but in GPUStatsMonitor we call log_metrics in on_train_batch_start and on_train_batch_end with step=trainer.global_step- where should I be logging in the new DeviceStatsMonitor?

I created a draft PR below for reference

ananthsub · 2021-09-15T03:13:59Z

since these stats are not device-specific and rely on the batch hooks, these stats can be calculated inside of the callback. then they can be merged with the ones from the accelerator and logged. however, timing has come up elsewhere, and I wonder if it makes sense to time this in the loop directly. @carmocca @awaelchli wdyt?
@kaushikb11 any reference for why the XLA stats monitor logs the step differently than the GPU stats monitor? everywhere else we've been using the global step, so that would be my preference. but i'd like to get @awaelchli 's thoughts on this too

kaushikb11 · 2021-09-15T17:47:44Z

any reference for why the XLA stats monitor logs the step differently than the GPU stats monitor?

@daniellepintz @ananthsub Thank you for this proposal! No reason in particular. It was my personal preference, as while training on TPUs, the epoch time usually decreases by time for the initial epochs. It helped me track the performance.

We could switch the XLA stats monitor logs to global_step. Sounds good to me!

daniellepintz · 2021-09-15T20:39:00Z

Sounds great, thank you @kaushikb11!

for 1. @ananthsub you are right, it just seemed a bit messy to do that, but sg. thanks!

cc @carmocca @awaelchli in case there are other plans for time

daniellepintz · 2021-09-15T21:05:13Z

another question @kaushikb11 currently with GPUStatsMonitor we only log_metrics with @rank_zero_only. Is it ok to do this with XLAStats as well, and to log in both on_train_batch_start and on_train_batch_end?

kaushikb11 · 2021-09-16T21:06:57Z

@daniellepintz Yup, only logging on rank_zero sounds good! But I don't think it's required to log in both on_train_batch_start and on_train_batch_end for XLAStats?

ananthsub · 2021-09-16T21:14:18Z

another question @kaushikb11 currently with GPUStatsMonitor we only log_metrics with @rank_zero_only. Is it ok to do this with XLAStats as well, and to log in both on_train_batch_start and on_train_batch_end?

IMO this should be handled at the logger level, along the lines of #8608

It can be very useful to have device stats logged across ranks in order to detect outliers for slowdowns in training. By not enforcing this check in the callback, we allow custom loggers to access this data across all processes.

Loggers in the framework today already provide this check here: https://github.com/PyTorchLightning/pytorch-lightning/blob/45200fc858b3e1068abe8e7c99a540996947b916/pytorch_lightning/loggers/tensorboard.py#L215-L216

daniellepintz · 2021-09-16T21:28:56Z

Thanks Kaushik and Ananth. The fundamental problem here is implementations for GPUstats and XLAstats are different right now in terms of logging on rank zero or not and logging in on_train_batch_start or not, and I am trying to combine them:

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L132-L179

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/xla_stats_monitor.py#L59-L95

When combining them into DeviceStats we can either pick and choose what we want, i.e. not carry over the @rank_zero_only, or we can keep everything and add some ugly if statements, if isinstance(trainer.accelerator, pl.accelerators.GPUAccelerator)

JackCaoG · 2021-09-16T22:23:56Z

@kaushikb11 Please open a github issue on pt/xla and we can discuss there. In short, sometimes we actually expect some process to hold the libtpu.so lock when we luanch.

daniellepintz · 2021-09-17T21:45:27Z

@kaushikb11 sg will add logging for on_train_batch_start.

epoch_time : @ananthsub and I were thinking it makes more sense for time be part of the profiler, since it is not really a device stat. I can follow up and make sure we have the same functionality on the profiler?

For GPU with XLA, if we end up enabling that we could always just pass a flag to get_device_stats on gpu, whether to use XLA or not?

ananthsub · 2021-09-17T22:40:57Z

epoch_time : @ananthsub and I were thinking it makes more sense for time be part of the profiler, since it is not really a device stat. I can follow up and make sure we have the same functionality on the profiler?

Not fully. I think this could be part of either:

the loop directly using the loops on run start and on run end
the profiler hook over the epoch (but the profiler can include other information like traces, whereas many people just want the overall latency tracked)

i think these serve 2 different means, where a Timer is very lightweight, whereas a Profiler can be pretty heavyweight

daniellepintz · 2021-09-19T23:33:40Z

sounds good thanks @ananthsub, I can add the timing wherever people think it makes the most sense. lmk what you think is best @kaushikb11

twsl · 2021-10-08T22:22:56Z

I just noticed your PRs and it seems like you are just using just the accelerators for logging. In order to simplify the device stats logging I'd like to propose to log other general system metrics as well. Especially the system memory could be relevant for detecting memory leaks if you are using preprocessing on the cpu while using a gpu accelerator.

daniellepintz · 2021-10-14T21:12:02Z

Thanks @twsl for the suggestion! Do you mind explaining a bit more about how this would simplify the device stats logging?

twsl · 2021-10-14T21:22:03Z

Hey @daniellepintz, to me this issue is all about simplification of logging device stats. Currently basic system stats like system memory are not logged yet eventho they can influence the accelerator. My proposal would be that once the CPUAccelerator returns values, it should always log cpu and system memory stats from within the DeviceStatsMonitor, regardless of the used accelerator.

carmocca · 2021-11-03T19:23:16Z

@daniellepintz @ananthsub Can you confirm all proposed changes here are done? And close the issue in that case.

@twsl I understand you are proposing adding a Callback for general stats that aren't necessarily related to a single accelerator (GPU, CPU, ...). This is not strictly part of the revamp proposed in this issue, so I would suggest you open a separate feature request for it. Thank you!

cowwoc · 2022-01-24T07:00:01Z

I'm confused by this change. GPUStatsMonitor logs metrics like "gpu utilitization" but when I migrate to DeviceStatsMonitor I don't see any GPU metrics logged anywhere. Is there any documentation on using DeviceStatsMonitor and more concretely on how to migrate from GPUStatsMonitor to DeviceStatsMonitor?

daniellepintz · 2022-02-02T02:41:31Z

@cowwoc Likely the reason you are seeing different metrics is because with DeviceStatsMonitor we now use torch.cuda.memory_stats to get the gpu stats if you are using torch>=v1.8, rather than using nvidia-smi:
https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/accelerators/gpu.py#L73-L75

Do the stats here https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html provide you with the information you need? I believe it should be comparable to nvidia-smi.

twsl · 2022-02-02T03:55:31Z

The memory stats include a lot of unnecessary information and miss the gpu utilisation

daniellepintz · 2022-02-02T03:59:30Z

I see. well if it's not useful, I can revert it to use nvidia-smi only. What do people think?

carmocca · 2022-02-02T12:00:49Z

The GPU utilization is included as Total memory allocation, but it's true that there's too much information for humans to read.

I'd say the callback needs a --human mode that shows a curated view. The current callback is only useful for logging the values to a Logger

twsl · 2022-02-02T13:52:18Z

@carmocca as far as I understand the torch.cuda.memory_stats, it is only about the memory, not the equivalent of https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/gpu.py#L101
Additional metrics, fan, temperature and power draw are also missing.

cowwoc · 2022-02-02T15:29:11Z

@carmocca

The GPU utilization is included as Total memory allocation, but it's true that there's too much information for humans to read.

To echo what @twsl said, I am looking for "utilization.gpu" (processor use) and "utilization.memory" (memory use). I don't think the new logs include all this information (and if they are, I agree it's not human-readable).

daniellepintz · 2022-02-02T15:40:35Z

Yeah, I agree it does seem that torch.cuda.memory_stats is only about memory and doesn't cover everything that nvidia-smi does. I propose we revert back to using nvidia-smi here. cc @edward-io @ananthsub

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor design Includes a design discussion logging Related to the `LoggerConnector` and `log()` deprecation Includes a deprecation labels Aug 22, 2021

ananthsub mentioned this issue Aug 22, 2021

Fig logging with log_gpu_memory='min_max' #9013

Merged

12 tasks

ananthsub changed the title ~~RFC: Deprecate log_gpu_memory from the Trainer constructor~~ [RFC] Deprecate log_gpu_memory from the Trainer constructor Aug 23, 2021

tchaton added the let's do it! approved to implement label Aug 23, 2021

This was referenced Aug 24, 2021

[RFC] Introduce strategy flag to Trainer #9053

Closed

Deprecate log_gpu_memory off the Trainer constructor #9124

Closed

tchaton assigned ananthsub Aug 26, 2021

daniellepintz self-assigned this Sep 14, 2021

daniellepintz mentioned this issue Sep 14, 2021

Fetch GPU stats using torch.cuda.memory_stats #8780

Closed

daniellepintz mentioned this issue Sep 15, 2021

[Draft] Revamp Device Stats Logging #9531

Closed

12 tasks

ananthsub mentioned this issue Sep 16, 2021

Improve the Rich Progress Bar #9547

Closed

daniellepintz changed the title ~~[RFC] Deprecate log_gpu_memory from the Trainer constructor~~ Revamp Device Stats Logging Sep 17, 2021

This was referenced Sep 17, 2021

[1/4] Add get_device_stats to accelerator interface #9586

Merged

[2/4] Add DeviceStatsMonitor callback daniellepintz/pytorch-lightning#2

Closed

This was referenced Sep 18, 2021

[3/4] Deprecate log_gpu_memory, gpu_metrics, and util funcs daniellepintz/pytorch-lightning#3

Closed

[4/4] Deprecate GPUStatsMonitor and XLAStatsMonitor daniellepintz/pytorch-lightning#4

Closed

daniellepintz mentioned this issue Sep 27, 2021

[2/4] Add DeviceStatsMonitor callback #9712

Merged

12 tasks

This was referenced Oct 14, 2021

Deprecate log_gpu_memory, gpu_metrics, and util funcs in favor of DeviceStatsMonitor callback #9921

Merged

Deprecate GPUStatsMonitor and XLAStatsMonitor in favor of DeviceStatsMonitor #9924

Merged

daniellepintz closed this as completed Nov 16, 2021

ananthsub mentioned this issue Dec 25, 2021

CPU System Metrics collection #11253

Closed

EricWiener mentioned this issue Feb 7, 2022

Improved control of device stats callbacks #11796

Open

zhong-yy mentioned this issue Dec 13, 2022

Failed to run the training example seisbench/pick-benchmark#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp Device Stats Logging #9032

Revamp Device Stats Logging #9032

ananthsub commented Aug 22, 2021 •

edited

Loading

awaelchli commented Aug 22, 2021

tchaton commented Aug 22, 2021

daniellepintz commented Sep 14, 2021

ananthsub commented Sep 14, 2021 •

edited

Loading

daniellepintz commented Sep 15, 2021 •

edited

Loading

ananthsub commented Sep 15, 2021 •

edited

Loading

kaushikb11 commented Sep 15, 2021 •

edited

Loading

daniellepintz commented Sep 15, 2021 •

edited

Loading

daniellepintz commented Sep 15, 2021

kaushikb11 commented Sep 16, 2021

ananthsub commented Sep 16, 2021

daniellepintz commented Sep 16, 2021 •

edited

Loading

JackCaoG commented Sep 16, 2021

daniellepintz commented Sep 17, 2021

ananthsub commented Sep 17, 2021

daniellepintz commented Sep 19, 2021

twsl commented Oct 8, 2021 •

edited

Loading

daniellepintz commented Oct 14, 2021

twsl commented Oct 14, 2021 •

edited

Loading

carmocca commented Nov 3, 2021 •

edited

Loading

cowwoc commented Jan 24, 2022

daniellepintz commented Feb 2, 2022

twsl commented Feb 2, 2022

daniellepintz commented Feb 2, 2022

carmocca commented Feb 2, 2022

twsl commented Feb 2, 2022

cowwoc commented Feb 2, 2022 •

edited

Loading

daniellepintz commented Feb 2, 2022

Revamp Device Stats Logging #9032

Revamp Device Stats Logging #9032

Comments

ananthsub commented Aug 22, 2021 • edited Loading

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

awaelchli commented Aug 22, 2021

tchaton commented Aug 22, 2021

daniellepintz commented Sep 14, 2021

ananthsub commented Sep 14, 2021 • edited Loading

daniellepintz commented Sep 15, 2021 • edited Loading

ananthsub commented Sep 15, 2021 • edited Loading

kaushikb11 commented Sep 15, 2021 • edited Loading

daniellepintz commented Sep 15, 2021 • edited Loading

daniellepintz commented Sep 15, 2021

kaushikb11 commented Sep 16, 2021

ananthsub commented Sep 16, 2021

daniellepintz commented Sep 16, 2021 • edited Loading

JackCaoG commented Sep 16, 2021

daniellepintz commented Sep 17, 2021

ananthsub commented Sep 17, 2021

daniellepintz commented Sep 19, 2021

twsl commented Oct 8, 2021 • edited Loading

daniellepintz commented Oct 14, 2021

twsl commented Oct 14, 2021 • edited Loading

carmocca commented Nov 3, 2021 • edited Loading

cowwoc commented Jan 24, 2022

daniellepintz commented Feb 2, 2022

twsl commented Feb 2, 2022

daniellepintz commented Feb 2, 2022

carmocca commented Feb 2, 2022

twsl commented Feb 2, 2022

cowwoc commented Feb 2, 2022 • edited Loading

daniellepintz commented Feb 2, 2022

ananthsub commented Aug 22, 2021 •

edited

Loading

ananthsub commented Sep 14, 2021 •

edited

Loading

daniellepintz commented Sep 15, 2021 •

edited

Loading

ananthsub commented Sep 15, 2021 •

edited

Loading

kaushikb11 commented Sep 15, 2021 •

edited

Loading

daniellepintz commented Sep 15, 2021 •

edited

Loading

daniellepintz commented Sep 16, 2021 •

edited

Loading

twsl commented Oct 8, 2021 •

edited

Loading

twsl commented Oct 14, 2021 •

edited

Loading

carmocca commented Nov 3, 2021 •

edited

Loading

cowwoc commented Feb 2, 2022 •

edited

Loading