Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp Device Stats Logging #9032

Closed
ananthsub opened this issue Aug 22, 2021 · 30 comments
Closed

Revamp Device Stats Logging #9032

ananthsub opened this issue Aug 22, 2021 · 30 comments
Assignees
Labels
deprecation Includes a deprecation design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement logging Related to the `LoggerConnector` and `log()` refactor

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Aug 22, 2021

Proposed refactoring or deprecation

Deprecate log_gpu_memory from the Trainer constructor

Motivation

We are auditing the Lightning components and APIs to assess opportunities for improvements:

Carrying forward some of the discussion from here: #9006

By default, logging gpu stats is disabled. Users must opt into this by setting.

Issues today with this argument:

  • Decoupling this from the trainer constructor gives us more flexibility to evolve the component which is actually responsible for gathering these stats without worrying about backwards compatibility guarantees for the trainer constructor argument as well. For example, the currently supported values for this flag are nvidia-smi specific. If nvidia-smi isn't available, this will fail. But PyTorch now supports getting CUDA memory stats directly: Fetch GPU stats using torch.cuda.memory_stats #8780 which we could take advantage of.

  • Lack of consistency across different devices: We do not offer corresponding flags for TPUs nor IPUs. Creating one flag per device type would lead to further bloat in the trainer.

  • There are multiple codepaths Lightning offers to log GPU memory. We have logging both internal to the trainer, as well as a GPU stats monitor callback. Having both is confusing and can lead to bugs like log_gpu_memory='min_max' leads to error in parsing metrics keys #9010 .

Pitch

Instead of offering separate flags & callbacks for different device types, why don't we consolidate this on the Accelerator interface, as this is where all the device-specific information already should be.

  1. Offer a method like get_device_stats as an abstractmethod on the Accelerator interface. CPU, GPU, TPU, and IPU classes would implement this to return whatever stats are needed. This payload should be a black box to the trainer as long as it conforms to the expected return type. The implementations of this function could rely on the existing utilities we already have. This would ensure that all the device-specific code lives roughly in the same place. The stats returned should be per-device: the calling entities fetching these stats should determine whether or how to aggregate these stats (e.g. log stats from rank 0 only, reduce across ranks and log from rank 0, log the min/max across ranks, or log data from all ranks).

  2. In terms of what actually fetches these stats, we could either:

  • offer a generic callback like DeviceStatsMonitor to periodically gather and log this data. This callback would consolidate the GPU and XLA ones and provide a general extension point for any new device types Lightning supports via the Accelerator interface.
  • Or, the trainer/loops themselves could periodically gather & log this information.

My preference is to go with the callback-based approach first. It remains the most extensible to customize options like logging frequency (e.g. step based or time-based) .

What do you think? @kaushikb11 @tchaton @awaelchli @carmocca @four4fish @edward-io @yifuwang

Additional context

#8174 (comment)
#9013 (comment)


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor design Includes a design discussion logging Related to the `LoggerConnector` and `log()` deprecation Includes a deprecation labels Aug 22, 2021
@awaelchli
Copy link
Contributor

I feel ok with this mainly because of what you mentioned here:

There are multiple codepaths Lightning offers to log GPU memory. We have logging both internal to the trainer, as well as a GPU stats monitor callback. Having both is confusing and can lead to bugs like log_gpu_memory='min_max' leads to error in parsing metrics keys #9010 .

Plus, several third-party loggers have this out of the box anyway so there is even a third way of doing the same thing and that's why I personally never turn the feature on in Lightning.

Querying the accelerator for it's performance stats seems a reasonable approach to me.

@tchaton
Copy link
Contributor

tchaton commented Aug 22, 2021

Same here !

@ananthsub ananthsub changed the title RFC: Deprecate log_gpu_memory from the Trainer constructor [RFC] Deprecate log_gpu_memory from the Trainer constructor Aug 23, 2021
@tchaton tchaton added the let's do it! approved to implement label Aug 23, 2021
@daniellepintz daniellepintz self-assigned this Sep 14, 2021
@daniellepintz
Copy link
Contributor

question: is DeviceStatsMonitor callback meant to replace GPUStatsMonitor and XLAStatsMonitor?

@ananthsub
Copy link
Contributor Author

ananthsub commented Sep 14, 2021

question: is DeviceStatsMonitor callback meant to replace GPUStatsMonitor and XLAStatsMonitor?

Yup!

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 15, 2021

Two questions:

  1. Currently in GPUStatsMonitor in addition to the stats we get from nvidia-smi --query-gpu we also add some hardcoded stats ("batch_time/inter_step (ms)" and "batch_time/intra_step (ms)") that are dependent on the on_train_batch_start and on_train_batch_end hooks.

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L152
https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L177

In the new DeviceStatsMonitor I don't see a clean way to preserve these stats - is it okay to remove them, especially considering newer versions of Pytorch will use torch.cuda.memory_stats(device=device) anyway instead of these old stats?

  1. In XLAStatsMonitor we call log_metrics in on_train_epoch_end with step=trainer.current_epoch but in GPUStatsMonitor we call log_metrics in on_train_batch_start and on_train_batch_end with step=trainer.global_step- where should I be logging in the new DeviceStatsMonitor?

I created a draft PR below for reference

@ananthsub
Copy link
Contributor Author

ananthsub commented Sep 15, 2021

  1. since these stats are not device-specific and rely on the batch hooks, these stats can be calculated inside of the callback. then they can be merged with the ones from the accelerator and logged. however, timing has come up elsewhere, and I wonder if it makes sense to time this in the loop directly. @carmocca @awaelchli wdyt?

  2. @kaushikb11 any reference for why the XLA stats monitor logs the step differently than the GPU stats monitor? everywhere else we've been using the global step, so that would be my preference. but i'd like to get @awaelchli 's thoughts on this too

@kaushikb11
Copy link
Contributor

kaushikb11 commented Sep 15, 2021

any reference for why the XLA stats monitor logs the step differently than the GPU stats monitor?

@daniellepintz @ananthsub Thank you for this proposal! No reason in particular. It was my personal preference, as while training on TPUs, the epoch time usually decreases by time for the initial epochs. It helped me track the performance.

We could switch the XLA stats monitor logs to global_step. Sounds good to me!

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 15, 2021

Sounds great, thank you @kaushikb11!

for 1. @ananthsub you are right, it just seemed a bit messy to do that, but sg. thanks!

cc @carmocca @awaelchli in case there are other plans for time

@daniellepintz
Copy link
Contributor

another question @kaushikb11 currently with GPUStatsMonitor we only log_metrics with @rank_zero_only. Is it ok to do this with XLAStats as well, and to log in both on_train_batch_start and on_train_batch_end?

@kaushikb11
Copy link
Contributor

@daniellepintz Yup, only logging on rank_zero sounds good! But I don't think it's required to log in both on_train_batch_start and on_train_batch_end for XLAStats?

@ananthsub
Copy link
Contributor Author

another question @kaushikb11 currently with GPUStatsMonitor we only log_metrics with @rank_zero_only. Is it ok to do this with XLAStats as well, and to log in both on_train_batch_start and on_train_batch_end?

IMO this should be handled at the logger level, along the lines of #8608

It can be very useful to have device stats logged across ranks in order to detect outliers for slowdowns in training. By not enforcing this check in the callback, we allow custom loggers to access this data across all processes.

Loggers in the framework today already provide this check here: https://github.com/PyTorchLightning/pytorch-lightning/blob/45200fc858b3e1068abe8e7c99a540996947b916/pytorch_lightning/loggers/tensorboard.py#L215-L216

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 16, 2021

Thanks Kaushik and Ananth. The fundamental problem here is implementations for GPUstats and XLAstats are different right now in terms of logging on rank zero or not and logging in on_train_batch_start or not, and I am trying to combine them:

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L132-L179

https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/xla_stats_monitor.py#L59-L95

When combining them into DeviceStats we can either pick and choose what we want, i.e. not carry over the @rank_zero_only, or we can keep everything and add some ugly if statements, if isinstance(trainer.accelerator, pl.accelerators.GPUAccelerator)

@JackCaoG
Copy link

@kaushikb11 Please open a github issue on pt/xla and we can discuss there. In short, sometimes we actually expect some process to hold the libtpu.so lock when we luanch.

@daniellepintz
Copy link
Contributor

@kaushikb11 sg will add logging for on_train_batch_start.

epoch_time : @ananthsub and I were thinking it makes more sense for time be part of the profiler, since it is not really a device stat. I can follow up and make sure we have the same functionality on the profiler?

For GPU with XLA, if we end up enabling that we could always just pass a flag to get_device_stats on gpu, whether to use XLA or not?

@daniellepintz daniellepintz changed the title [RFC] Deprecate log_gpu_memory from the Trainer constructor Revamp Device Stats Logging Sep 17, 2021
@ananthsub
Copy link
Contributor Author

epoch_time : @ananthsub and I were thinking it makes more sense for time be part of the profiler, since it is not really a device stat. I can follow up and make sure we have the same functionality on the profiler?

Not fully. I think this could be part of either:

  • the loop directly using the loops on run start and on run end
  • the profiler hook over the epoch (but the profiler can include other information like traces, whereas many people just want the overall latency tracked)

i think these serve 2 different means, where a Timer is very lightweight, whereas a Profiler can be pretty heavyweight

@daniellepintz
Copy link
Contributor

sounds good thanks @ananthsub, I can add the timing wherever people think it makes the most sense. lmk what you think is best @kaushikb11

@twsl
Copy link
Contributor

twsl commented Oct 8, 2021

I just noticed your PRs and it seems like you are just using just the accelerators for logging. In order to simplify the device stats logging I'd like to propose to log other general system metrics as well. Especially the system memory could be relevant for detecting memory leaks if you are using preprocessing on the cpu while using a gpu accelerator.

@daniellepintz
Copy link
Contributor

Thanks @twsl for the suggestion! Do you mind explaining a bit more about how this would simplify the device stats logging?

@twsl
Copy link
Contributor

twsl commented Oct 14, 2021

Hey @daniellepintz, to me this issue is all about simplification of logging device stats. Currently basic system stats like system memory are not logged yet eventho they can influence the accelerator. My proposal would be that once the CPUAccelerator returns values, it should always log cpu and system memory stats from within the DeviceStatsMonitor, regardless of the used accelerator.

@carmocca
Copy link
Contributor

carmocca commented Nov 3, 2021

@daniellepintz @ananthsub Can you confirm all proposed changes here are done? And close the issue in that case.

@twsl I understand you are proposing adding a Callback for general stats that aren't necessarily related to a single accelerator (GPU, CPU, ...). This is not strictly part of the revamp proposed in this issue, so I would suggest you open a separate feature request for it. Thank you!

@cowwoc
Copy link
Contributor

cowwoc commented Jan 24, 2022

I'm confused by this change. GPUStatsMonitor logs metrics like "gpu utilitization" but when I migrate to DeviceStatsMonitor I don't see any GPU metrics logged anywhere. Is there any documentation on using DeviceStatsMonitor and more concretely on how to migrate from GPUStatsMonitor to DeviceStatsMonitor?

@daniellepintz
Copy link
Contributor

@cowwoc Likely the reason you are seeing different metrics is because with DeviceStatsMonitor we now use torch.cuda.memory_stats to get the gpu stats if you are using torch>=v1.8, rather than using nvidia-smi:
https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/accelerators/gpu.py#L73-L75

Do the stats here https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html provide you with the information you need? I believe it should be comparable to nvidia-smi.

@twsl
Copy link
Contributor

twsl commented Feb 2, 2022

The memory stats include a lot of unnecessary information and miss the gpu utilisation

@daniellepintz
Copy link
Contributor

I see. well if it's not useful, I can revert it to use nvidia-smi only. What do people think?

@carmocca
Copy link
Contributor

carmocca commented Feb 2, 2022

The GPU utilization is included as Total memory allocation, but it's true that there's too much information for humans to read.

I'd say the callback needs a --human mode that shows a curated view. The current callback is only useful for logging the values to a Logger

@twsl
Copy link
Contributor

twsl commented Feb 2, 2022

@carmocca as far as I understand the torch.cuda.memory_stats, it is only about the memory, not the equivalent of https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/gpu.py#L101
Additional metrics, fan, temperature and power draw are also missing.

@cowwoc
Copy link
Contributor

cowwoc commented Feb 2, 2022

@carmocca

The GPU utilization is included as Total memory allocation, but it's true that there's too much information for humans to read.

To echo what @twsl said, I am looking for "utilization.gpu" (processor use) and "utilization.memory" (memory use). I don't think the new logs include all this information (and if they are, I agree it's not human-readable).

@daniellepintz
Copy link
Contributor

Yeah, I agree it does seem that torch.cuda.memory_stats is only about memory and doesn't cover everything that nvidia-smi does. I propose we revert back to using nvidia-smi here. cc @edward-io @ananthsub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Includes a deprecation design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement logging Related to the `LoggerConnector` and `log()` refactor
Projects
None yet
9 participants