-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp Device Stats Logging #9032
Comments
I feel ok with this mainly because of what you mentioned here:
Plus, several third-party loggers have this out of the box anyway so there is even a third way of doing the same thing and that's why I personally never turn the feature on in Lightning. Querying the accelerator for it's performance stats seems a reasonable approach to me. |
Same here ! |
log_gpu_memory
from the Trainer constructorlog_gpu_memory
from the Trainer constructor
question: is |
Yup! |
Two questions:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5735b85147a7c94608cb497c4bd146c62bc59d86/pytorch_lightning/callbacks/gpu_stats_monitor.py#L152 In the new DeviceStatsMonitor I don't see a clean way to preserve these stats - is it okay to remove them, especially considering newer versions of Pytorch will use
I created a draft PR below for reference |
|
@daniellepintz @ananthsub Thank you for this proposal! No reason in particular. It was my personal preference, as while training on TPUs, the epoch time usually decreases by time for the initial epochs. It helped me track the performance. We could switch the XLA stats monitor logs to |
Sounds great, thank you @kaushikb11! for 1. @ananthsub you are right, it just seemed a bit messy to do that, but sg. thanks! cc @carmocca @awaelchli in case there are other plans for time |
another question @kaushikb11 currently with GPUStatsMonitor we only |
@daniellepintz Yup, only logging on |
IMO this should be handled at the logger level, along the lines of #8608 It can be very useful to have device stats logged across ranks in order to detect outliers for slowdowns in training. By not enforcing this check in the callback, we allow custom loggers to access this data across all processes. Loggers in the framework today already provide this check here: https://github.com/PyTorchLightning/pytorch-lightning/blob/45200fc858b3e1068abe8e7c99a540996947b916/pytorch_lightning/loggers/tensorboard.py#L215-L216 |
Thanks Kaushik and Ananth. The fundamental problem here is implementations for GPUstats and XLAstats are different right now in terms of logging on rank zero or not and logging in on_train_batch_start or not, and I am trying to combine them: When combining them into DeviceStats we can either pick and choose what we want, i.e. not carry over the @rank_zero_only, or we can keep everything and add some ugly if statements, |
@kaushikb11 Please open a github issue on pt/xla and we can discuss there. In short, sometimes we actually expect some process to hold the libtpu.so lock when we luanch. |
@kaushikb11 sg will add logging for
For GPU with XLA, if we end up enabling that we could always just pass a flag to |
log_gpu_memory
from the Trainer constructor
Not fully. I think this could be part of either:
i think these serve 2 different means, where a Timer is very lightweight, whereas a Profiler can be pretty heavyweight |
sounds good thanks @ananthsub, I can add the timing wherever people think it makes the most sense. lmk what you think is best @kaushikb11 |
I just noticed your PRs and it seems like you are just using just the accelerators for logging. In order to simplify the device stats logging I'd like to propose to log other general system metrics as well. Especially the system memory could be relevant for detecting memory leaks if you are using preprocessing on the cpu while using a gpu accelerator. |
Thanks @twsl for the suggestion! Do you mind explaining a bit more about how this would simplify the device stats logging? |
Hey @daniellepintz, to me this issue is all about simplification of logging device stats. Currently basic system stats like system memory are not logged yet eventho they can influence the accelerator. My proposal would be that once the |
@daniellepintz @ananthsub Can you confirm all proposed changes here are done? And close the issue in that case. @twsl I understand you are proposing adding a Callback for general stats that aren't necessarily related to a single accelerator (GPU, CPU, ...). This is not strictly part of the revamp proposed in this issue, so I would suggest you open a separate feature request for it. Thank you! |
I'm confused by this change. |
@cowwoc Likely the reason you are seeing different metrics is because with DeviceStatsMonitor we now use Do the stats here https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html provide you with the information you need? I believe it should be comparable to |
The memory stats include a lot of unnecessary information and miss the gpu utilisation |
I see. well if it's not useful, I can revert it to use |
The GPU utilization is included as Total memory allocation, but it's true that there's too much information for humans to read. I'd say the callback needs a --human mode that shows a curated view. The current callback is only useful for logging the values to a |
@carmocca as far as I understand the |
To echo what @twsl said, I am looking for "utilization.gpu" (processor use) and "utilization.memory" (memory use). I don't think the new logs include all this information (and if they are, I agree it's not human-readable). |
Yeah, I agree it does seem that |
Proposed refactoring or deprecation
Deprecate
log_gpu_memory
from the Trainer constructorMotivation
We are auditing the Lightning components and APIs to assess opportunities for improvements:
Carrying forward some of the discussion from here: #9006
By default, logging gpu stats is disabled. Users must opt into this by setting.
Issues today with this argument:
Decoupling this from the trainer constructor gives us more flexibility to evolve the component which is actually responsible for gathering these stats without worrying about backwards compatibility guarantees for the trainer constructor argument as well. For example, the currently supported values for this flag are nvidia-smi specific. If
nvidia-smi
isn't available, this will fail. But PyTorch now supports getting CUDA memory stats directly: Fetch GPU stats using torch.cuda.memory_stats #8780 which we could take advantage of.Lack of consistency across different devices: We do not offer corresponding flags for TPUs nor IPUs. Creating one flag per device type would lead to further bloat in the trainer.
There are multiple codepaths Lightning offers to log GPU memory. We have logging both internal to the trainer, as well as a GPU stats monitor callback. Having both is confusing and can lead to bugs like log_gpu_memory='min_max' leads to error in parsing metrics keys #9010 .
Pitch
Instead of offering separate flags & callbacks for different device types, why don't we consolidate this on the Accelerator interface, as this is where all the device-specific information already should be.
Offer a method like
get_device_stats
as an abstractmethod on the Accelerator interface. CPU, GPU, TPU, and IPU classes would implement this to return whatever stats are needed. This payload should be a black box to the trainer as long as it conforms to the expected return type. The implementations of this function could rely on the existing utilities we already have. This would ensure that all the device-specific code lives roughly in the same place. The stats returned should be per-device: the calling entities fetching these stats should determine whether or how to aggregate these stats (e.g. log stats from rank 0 only, reduce across ranks and log from rank 0, log the min/max across ranks, or log data from all ranks).In terms of what actually fetches these stats, we could either:
DeviceStatsMonitor
to periodically gather and log this data. This callback would consolidate the GPU and XLA ones and provide a general extension point for any new device types Lightning supports via the Accelerator interface.My preference is to go with the callback-based approach first. It remains the most extensible to customize options like logging frequency (e.g. step based or time-based) .
What do you think? @kaushikb11 @tchaton @awaelchli @carmocca @four4fish @edward-io @yifuwang
Additional context
#8174 (comment)
#9013 (comment)
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: