Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.validate() will not load optimizer properly, different behavior from trainer.fit() #14993

Open
MattYoon opened this issue Oct 4, 2022 · 2 comments

Comments

@MattYoon
Copy link

MattYoon commented Oct 4, 2022

Bug description

Hello, always grateful for your hard work. Not sure if this is a Deepspeed only bug.
trainer.validate() will load the optimzers for some reason, and when it does, it doesn't load them properly. It has a different behavior from trainer.fit().

Case 0. When using small model w/ DeepspeedCPUAdam + stage2_offload

  • trainer.validate() will throw AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
  • trainer.fit() will work fine

Case 1. When using big model w/ torch.optim.Adam + stage2_offload or stage_3
Thought that this was a the DeepspeedCPUAdam issue, so I tried with vanilla Adam, but this confirms it's a Lightning issue.

  • trainer.validate() will throw CUDA OOM error while loading optimzer
  • trainer.fit() works fine

How to reproduce the bug

git clone https://github.com/MattYoon/knowledge-unlearning-test
python run.py --config configs/0.cpuadam_validate.json # BUG
python run.py --config configs/0.cpuadam_fit.json # NoBUG
# Tested on T4(16GB), please tweak batch size or select GPT-Neo-2.7B if this doesn't throw OOM error
python run.py --config configs/1.OOM_validate.json # BUG
python run.py --config configs/1.OOM_fit.json # NoBUG

Error messages and logs

Case 0

Traceback (most recent call last):
  File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
    trainer.validate(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
    self.initialize_optimizer_states()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 406, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 145, in step
    assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.

Case 1

Traceback (most recent call last):
  File "/mnt/home/dongkeun/knowledge-unlearning-test/run.py", line 109, in <module>
    trainer.validate(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 291, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1147, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1304, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 106, in __init__
    self.initialize_optimizer_states()
  File "/mnt/home/dongkeun/miniconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 402, in initialize_optimizer_states
    param.grad = torch.zeros(param.size(),
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.75 GiB total capacity; 14.11 GiB already allocated; 51.62 MiB free; 14.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment


* CUDA:
        - GPU:
                - Tesla T4
                - Tesla T4
                - Tesla T4
                - Tesla T4
        - available:         True
        - version:           11.1
* Lightning:
        - pytorch-lightning: 1.7.6
        - torch:             1.10.1+cu111
        - torchaudio:        0.10.1+rocm4.1
        - torchmetrics:      0.9.3
        - torchvision:       0.11.2+cu111
* Packages:
        - absl-py:           1.2.0
        - aiohttp:           3.8.1
        - aiosignal:         1.2.0
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - autopep8:          1.7.0
        - best-download:     0.0.9
        - black:             22.8.0
        - boto3:             1.24.72
        - botocore:          1.27.72
        - cachetools:        5.2.0
        - certifi:           2022.6.15
        - chardet:           5.0.0
        - charset-normalizer: 2.1.1
        - click:             8.1.3
        - colorama:          0.4.5
        - conda-pack:        0.6.0
        - configparser:      5.3.0
        - cython:            0.29.32
        - dataproperty:      0.55.0
        - datasets:          1.15.1
        - deepspeed:         0.5.10
        - dill:              0.3.5.1
        - docker-pycreds:    0.4.0
        - dynet38:           2.1
        - filelock:          3.8.0
        - frozenlist:        1.3.1
        - fsspec:            2022.8.2
        - gitdb:             4.0.9
        - gitpython:         3.1.27
        - google-auth:       2.11.0
        - google-auth-oauthlib: 0.4.6
        - gql:               0.2.0
        - graphql-core:      1.1
        - grpcio:            1.48.1
        - hjson:             3.1.0
        - huggingface-hub:   0.9.1
        - idna:              3.4
        - importlib-metadata: 4.12.0
        - iniconfig:         1.1.1
        - jieba:             0.42.1
        - jmespath:          1.0.1
        - joblib:            1.1.0
        - jsonlines:         2.0.0
        - lm-dataformat:     0.0.20
        - lm-eval:           0.2.0
        - markdown:          3.4.1
        - markupsafe:        2.1.1
        - mbstrdecoder:      1.1.1
        - mock:              4.0.3
        - msgfy:             0.2.0
        - multidict:         6.0.2
        - multiprocess:      0.70.13
        - mypy-extensions:   0.4.3
        - nagisa:            0.2.7
        - ninja:             1.10.2.3
        - nlp:               0.4.0
        - nltk:              3.7
        - numexpr:           2.7.2
        - numpy:             1.23.3
        - nvidia-ml-py:      11.495.46
        - nvidia-ml-py3:     7.352.0
        - nvitop:            0.8.1
        - oauthlib:          3.2.1
        - openai:            0.6.4
        - packaging:         21.3
        - pandas:            1.4.4
        - pathspec:          0.10.1
        - pathtools:         0.1.2
        - pathvalidate:      2.5.2
        - pillow:            9.2.0
        - pip:               22.1.2
        - platformdirs:      2.5.2
        - pluggy:            0.13.1
        - portalocker:       2.5.1
        - promise:           2.3
        - protobuf:          3.19.5
        - psutil:            5.9.2
        - py:                1.11.0
        - py-cpuinfo:        8.0.0
        - pyarrow:           9.0.0
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pybind11:          2.6.2
        - pycodestyle:       2.9.1
        - pycountry:         20.7.3
        - pydeprecate:       0.3.2
        - pyparsing:         3.0.9
        - pytablewriter:     0.58.0
        - pytest:            6.2.3
        - python-dateutil:   2.8.2
        - pytorch-lightning: 1.7.6
        - pytz:              2022.2.1
        - pyyaml:            6.0
        - regex:             2022.9.13
        - rehash:            1.0.1
        - requests:          2.28.1
        - requests-oauthlib: 1.3.1
        - rouge:             1.0.1
        - rouge-score:       0.0.4
        - rsa:               4.9
        - s3transfer:        0.6.0
        - sacrebleu:         1.5.0
        - scikit-learn:      1.1.2
        - scipy:             1.9.1
        - sentencepiece:     0.1.94
        - sentry-sdk:        1.9.8
        - setproctitle:      1.3.2
        - setuptools:        63.4.1
        - shortuuid:         1.0.9
        - six:               1.16.0
        - smmap:             5.0.0
        - sqlitedict:        1.6.0
        - subprocess32:      3.5.4
        - tabledata:         1.3.0
        - tcolorpy:          0.1.2
        - tensorboard:       2.10.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - termcolor:         2.0.1
        - threadpoolctl:     3.1.0
        - tokenizers:        0.12.1
        - toml:              0.10.2
        - tomli:             2.0.1
        - torch:             1.10.1+cu111
        - torchaudio:        0.10.1+rocm4.1
        - torchmetrics:      0.9.3
        - torchvision:       0.11.2+cu111
        - tqdm:              4.64.1
        - tqdm-multiprocess: 0.0.11
        - transformers:      4.21.3
        - triton:            1.0.0
        - typepy:            1.3.0
        - typing-extensions: 4.3.0
        - ujson:             5.4.0
        - urllib3:           1.26.12
        - wandb:             0.13.3
        - watchdog:          2.1.9
        - werkzeug:          2.2.2
        - wheel:             0.37.1
        - xxhash:            3.0.0
        - yarl:              1.8.1
        - zipp:              3.8.1
        - zstandard:         0.15.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.13
        - version:           #48~18.04.1-Ubuntu SMP Tue Apr 13 19:41:38 UTC 2021

More info

How to work around the bug

Removing self.configure_optimzer, and other training related methods, then calling trainer.validate() will remove the bug.

cc @awaelchli @akihironitta

@MattYoon MattYoon added the needs triage Waiting to be triaged by maintainers label Oct 4, 2022
@rohitgr7
Copy link
Contributor

rohitgr7 commented Oct 5, 2022

working on it #14944

but looks like it needs more investigation. Will consider this edge case too and test it

@rohitgr7 rohitgr7 added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Oct 5, 2022
@stale
Copy link

stale bot commented Nov 13, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 13, 2022
@awaelchli awaelchli self-assigned this Mar 18, 2023
@stale stale bot removed the won't fix This will not be worked on label Mar 18, 2023
@awaelchli awaelchli removed their assignment Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants