trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

MattYoon · 2022-10-04T16:08:13Z

Bug description

Hello, always grateful for your hard work. Not sure if this is a Deepspeed only bug.
trainer.validate() will load the optimzers for some reason, and when it does, it doesn't load them properly. It has a different behavior from trainer.fit().

Case 0. When using small model w/ DeepspeedCPUAdam + stage2_offload

trainer.validate() will throw AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
trainer.fit() will work fine

Case 1. When using big model w/ torch.optim.Adam + stage2_offload or stage_3
Thought that this was a the DeepspeedCPUAdam issue, so I tried with vanilla Adam, but this confirms it's a Lightning issue.

trainer.validate() will throw CUDA OOM error while loading optimzer
trainer.fit() works fine

How to reproduce the bug

git clone https://github.com/MattYoon/knowledge-unlearning-test

python run.py --config configs/0.cpuadam_validate.json # BUG
python run.py --config configs/0.cpuadam_fit.json # NoBUG

# Tested on T4(16GB), please tweak batch size or select GPT-Neo-2.7B if this doesn't throw OOM error
python run.py --config configs/1.OOM_validate.json # BUG
python run.py --config configs/1.OOM_fit.json # NoBUG

Error messages and logs

Case 0

Traceback (most recent call last):
  File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
    trainer.validate(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
    self.initialize_optimizer_states()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 406, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 145, in step
    assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.

Case 1

Traceback (most recent call last):
  File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
    trainer.validate(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
    self.initialize_optimizer_states()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 402, in initialize_optimizer_states
    param.grad = torch.zeros(param.size(),
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.75 GiB total capacity; 14.11 GiB already allocated; 51.62 MiB free; 14.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment


* CUDA:
        - GPU:
                - Tesla T4
                - Tesla T4
                - Tesla T4
                - Tesla T4
        - available:         True
        - version:           11.1
* Lightning:
        - pytorch-lightning: 1.7.6
        - torch:             1.10.1+cu111
        - torchaudio:        0.10.1+rocm4.1
        - torchmetrics:      0.9.3
        - torchvision:       0.11.2+cu111
* Packages:
        - absl-py:           1.2.0
        - aiohttp:           3.8.1
        - aiosignal:         1.2.0
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - autopep8:          1.7.0
        - best-download:     0.0.9
        - black:             22.8.0
        - boto3:             1.24.72
        - botocore:          1.27.72
        - cachetools:        5.2.0
        - certifi:           2022.6.15
        - chardet:           5.0.0
        - charset-normalizer: 2.1.1
        - click:             8.1.3
        - colorama:          0.4.5
        - conda-pack:        0.6.0
        - configparser:      5.3.0
        - cython:            0.29.32
        - dataproperty:      0.55.0
        - datasets:          1.15.1
        - deepspeed:         0.5.10
        - dill:              0.3.5.1
        - docker-pycreds:    0.4.0
        - dynet38:           2.1
        - filelock:          3.8.0
        - frozenlist:        1.3.1
        - fsspec:            2022.8.2
        - gitdb:             4.0.9
        - gitpython:         3.1.27
        - google-auth:       2.11.0
        - google-auth-oauthlib: 0.4.6
        - gql:               0.2.0
        - graphql-core:      1.1
        - grpcio:            1.48.1
        - hjson:             3.1.0
        - huggingface-hub:   0.9.1
        - idna:              3.4
        - importlib-metadata: 4.12.0
        - iniconfig:         1.1.1
        - jieba:             0.42.1
        - jmespath:          1.0.1
        - joblib:            1.1.0
        - jsonlines:         2.0.0
        - lm-dataformat:     0.0.20
        - lm-eval:           0.2.0
        - markdown:          3.4.1
        - markupsafe:        2.1.1
        - mbstrdecoder:      1.1.1
        - mock:              4.0.3
        - msgfy:             0.2.0
        - multidict:         6.0.2
        - multiprocess:      0.70.13
        - mypy-extensions:   0.4.3
        - nagisa:            0.2.7
        - ninja:             1.10.2.3
        - nlp:               0.4.0
        - nltk:              3.7
        - numexpr:           2.7.2
        - numpy:             1.23.3
        - nvidia-ml-py:      11.495.46
        - nvidia-ml-py3:     7.352.0
        - nvitop:            0.8.1
        - oauthlib:          3.2.1
        - openai:            0.6.4
        - packaging:         21.3
        - pandas:            1.4.4
        - pathspec:          0.10.1
        - pathtools:         0.1.2
        - pathvalidate:      2.5.2
        - pillow:            9.2.0
        - pip:               22.1.2
        - platformdirs:      2.5.2
        - pluggy:            0.13.1
        - portalocker:       2.5.1
        - promise:           2.3
        - protobuf:          3.19.5
        - psutil:            5.9.2
        - py:                1.11.0
        - py-cpuinfo:        8.0.0
        - pyarrow:           9.0.0
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pybind11:          2.6.2
        - pycodestyle:       2.9.1
        - pycountry:         20.7.3
        - pydeprecate:       0.3.2
        - pyparsing:         3.0.9
        - pytablewriter:     0.58.0
        - pytest:            6.2.3
        - python-dateutil:   2.8.2
        - pytorch-lightning: 1.7.6
        - pytz:              2022.2.1
        - pyyaml:            6.0
        - regex:             2022.9.13
        - rehash:            1.0.1
        - requests:          2.28.1
        - requests-oauthlib: 1.3.1
        - rouge:             1.0.1
        - rouge-score:       0.0.4
        - rsa:               4.9
        - s3transfer:        0.6.0
        - sacrebleu:         1.5.0
        - scikit-learn:      1.1.2
        - scipy:             1.9.1
        - sentencepiece:     0.1.94
        - sentry-sdk:        1.9.8
        - setproctitle:      1.3.2
        - setuptools:        63.4.1
        - shortuuid:         1.0.9
        - six:               1.16.0
        - smmap:             5.0.0
        - sqlitedict:        1.6.0
        - subprocess32:      3.5.4
        - tabledata:         1.3.0
        - tcolorpy:          0.1.2
        - tensorboard:       2.10.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - termcolor:         2.0.1
        - threadpoolctl:     3.1.0
        - tokenizers:        0.12.1
        - toml:              0.10.2
        - tomli:             2.0.1
        - torch:             1.10.1+cu111
        - torchaudio:        0.10.1+rocm4.1
        - torchmetrics:      0.9.3
        - torchvision:       0.11.2+cu111
        - tqdm:              4.64.1
        - tqdm-multiprocess: 0.0.11
        - transformers:      4.21.3
        - triton:            1.0.0
        - typepy:            1.3.0
        - typing-extensions: 4.3.0
        - ujson:             5.4.0
        - urllib3:           1.26.12
        - wandb:             0.13.3
        - watchdog:          2.1.9
        - werkzeug:          2.2.2
        - wheel:             0.37.1
        - xxhash:            3.0.0
        - yarl:              1.8.1
        - zipp:              3.8.1
        - zstandard:         0.15.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.13
        - version:           #48~18.04.1-Ubuntu SMP Tue Apr 13 19:41:38 UTC 2021

More info

How to work around the bug

Removing self.configure_optimzer, and other training related methods, then calling trainer.validate() will remove the bug.

cc @awaelchli @akihironitta

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2022-10-05T18:31:09Z

working on it #14944

but looks like it needs more investigation. Will consider this edge case too and test it

stale · 2022-11-13T16:38:53Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

MattYoon added the needs triage Waiting to be triaged by maintainers label Oct 4, 2022

rohitgr7 added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Oct 5, 2022

stale bot added the won't fix This will not be worked on label Nov 13, 2022

awaelchli self-assigned this Mar 18, 2023

stale bot removed the won't fix This will not be worked on label Mar 18, 2023

awaelchli removed their assignment Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

MattYoon commented Oct 4, 2022 •

edited by github-actions bot

rohitgr7 commented Oct 5, 2022

stale bot commented Nov 13, 2022

trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

Comments

MattYoon commented Oct 4, 2022 • edited by github-actions bot

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

How to work around the bug

rohitgr7 commented Oct 5, 2022

stale bot commented Nov 13, 2022

MattYoon commented Oct 4, 2022 •

edited by github-actions bot