You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, always grateful for your hard work. Not sure if this is a Deepspeed only bug.
trainer.validate() will load the optimzers for some reason, and when it does, it doesn't load them properly. It has a different behavior from trainer.fit().
Case 0. When using small model w/ DeepspeedCPUAdam + stage2_offload
trainer.validate() will throw AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
trainer.fit() will work fine
Case 1. When using big model w/ torch.optim.Adam + stage2_offload or stage_3
Thought that this was a the DeepspeedCPUAdam issue, so I tried with vanilla Adam, but this confirms it's a Lightning issue.
trainer.validate() will throw CUDA OOM error while loading optimzer
# Tested on T4(16GB), please tweak batch size or select GPT-Neo-2.7B if this doesn't throw OOM errorpythonrun.py--configconfigs/1.OOM_validate.json# BUGpythonrun.py--configconfigs/1.OOM_fit.json# NoBUG
Error messages and logs
Case 0
Traceback (most recent call last):
File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
trainer.validate(model)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
self.strategy.setup(self)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
self.init_deepspeed()
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
self._initialize_deepspeed_inference(model)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
model, _, _, _ = deepspeed.initialize(
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
optimizer = FP16_UnfusedOptimizer(
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
self.initialize_optimizer_states()
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 406, in initialize_optimizer_states
self.optimizer.step()
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 145, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Case 1
Traceback (most recent call last):
File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
trainer.validate(model)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
self.strategy.setup(self)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
self.init_deepspeed()
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
self._initialize_deepspeed_inference(model)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
model, _, _, _ = deepspeed.initialize(
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
optimizer = FP16_UnfusedOptimizer(
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
self.initialize_optimizer_states()
File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 402, in initialize_optimizer_states
param.grad = torch.zeros(param.size(),
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.75 GiB total capacity; 14.11 GiB already allocated; 51.62 MiB free; 14.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
Hello, always grateful for your hard work. Not sure if this is a Deepspeed only bug.
trainer.validate() will load the optimzers for some reason, and when it does, it doesn't load them properly. It has a different behavior from trainer.fit().
Case 0. When using small model w/ DeepspeedCPUAdam + stage2_offload
AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Case 1. When using big model w/ torch.optim.Adam + stage2_offload or stage_3
Thought that this was a the DeepspeedCPUAdam issue, so I tried with vanilla Adam, but this confirms it's a Lightning issue.
How to reproduce the bug
Error messages and logs
Case 0
Case 1
Environment
More info
How to work around the bug
Removing
self.configure_optimzer
, and other training related methods, then callingtrainer.validate()
will remove the bug.cc @awaelchli @akihironitta
The text was updated successfully, but these errors were encountered: