Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: expected scalar type Float but found Half .. : when training image-gpt model with deepspeed #8125

Closed
GeorgeQ-Q opened this issue Jun 25, 2021 · 4 comments 路 Fixed by Lightning-Universe/lightning-bolts#694
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@GeorgeQ-Q
Copy link

馃悰 Bug

Please reproduce using the BoringModel

To Reproduce

####1.

git clone https://github.com/teddykoker/image-gpt.git

(add plugin deepspeed_stage_2 into pl.Trainer )
####2.change:

trainer = pl.Trainer(
            max_steps=config["steps"],
            gpus=config["gpus"],
            precision=config["precision"],
            accumulate_grad_batches=config["accumulate_grad_batches"],
            checkpoint_callback=checkpoint,
            logger=logger,
        )

into:

trainer = pl.Trainer(
            max_steps=config["steps"],
            gpus=config["gpus"],
            precision=config["precision"],
            accumulate_grad_batches=config["accumulate_grad_batches"],
            checkpoint_callback=checkpoint,
            logger=logger,
            plugins='deepspeed_stage_2',
        )

####3. run

python src/run.py --dataset mnist train configs/s_gen.yml

Use following BoringModel and post here

Expected behavior

Traceback (most recent call last):
  File "src/run.py", line 96, in <module>
    args.func(args)
  File "src/run.py", line 65, in train
    trainer.fit(model, train_dl, valid_dl)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
    self.run_evaluation()
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 326, in validation_step
    return self.model(*args, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1098, in forward
    loss = self.module(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
    return super().forward(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 57, in forward
    output = self.module.validation_step(*inputs, **kwargs)
  File "/qiuzihan/image-gpt/src/image_gpt.py", line 125, in validation_step
    logits = self.gpt(x)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/src/gpt2.py", line 74, in forward
    h = layer(h)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/src/gpt2.py", line 24, in forward
    x = self.ln_1(x)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 171, in forward
    input, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/qiuzihan/image-gpt/gpt2-image/lib/python3.6/site-packages/torch/nn/functional.py", line 2202, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: expected scalar type Float but found Half

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0): 1.8.0
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip3
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 11.1
  • GPU models and configuration: V100 -32GB
  • Any other relevant information:
 Package                Version
---------------------- ------------
absl-py                0.13.0
aiohttp                3.7.4.post0
async-timeout          3.0.1
attrs                  21.2.0
cachetools             4.2.2
certifi                2021.5.30
chardet                4.0.0
dataclasses            0.8
deepspeed              0.4.1
fairscale              0.3.7
fsspec                 2021.6.1
future                 0.18.2
google-auth            1.32.0
google-auth-oauthlib   0.4.4
grpcio                 1.38.1
idna                   2.10
idna-ssl               1.1.0
importlib-metadata     4.5.0
Markdown               3.3.4
multidict              5.1.0
ninja                  1.10.0.post2
numpy                  1.19.5
oauthlib               3.1.1
packaging              20.9
Pillow                 8.2.0
pip                    21.1.2
protobuf               3.17.3
psutil                 5.8.0
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pyDeprecate            0.3.0
pyparsing              2.4.7
pytorch-lightning      1.3.7.post0
PyYAML                 5.4.1
requests               2.25.1
requests-oauthlib      1.3.0
rsa                    4.7.2
setuptools             57.0.0
six                    1.16.0
tensorboard            2.4.1
tensorboard-plugin-wit 1.8.0
tensorboardX           1.8
torch                  1.8.0+cu111
torchmetrics           0.3.2
torchvision            0.9.0+cu111
tqdm                   4.61.1
triton                 0.4.2
typing-extensions      3.10.0.0
urllib3                1.26.5
Werkzeug               2.0.1
wheel                  0.36.2
yarl                   1.6.3
zipp                   3.4.1

Additional context

@GeorgeQ-Q GeorgeQ-Q added bug Something isn't working help wanted Open to be worked on labels Jun 25, 2021
@tchaton
Copy link
Contributor

tchaton commented Jun 28, 2021

Dear @GeorgeQ-Q,

Would it be possible for you to reproduce this behaviour with the BoringModel ?

Best,
T.C

@GeorgeQ-Q
Copy link
Author

boringModel

This is the best I can do, since colab do not support ddp

Best,
G

@griff4692
Copy link

griff4692 commented Jul 1, 2021

following this as I am having the same issue :(

I didn't see any memory improvement with fairscale so am hoping deepspeed offers some

@SeanNaren
Copy link
Contributor

Thanks for your reproducible sample @ GeorgeQ-Q

I've made a fix in lightning bolts here: Lightning-Universe/lightning-bolts#694 with the latest DeepSpeed this works as they've fixed the underlying issue with GPT vision models as well :)

For anyone who is doing custom code, make sure the types are correct of any tensors you're making within the forward pass of your module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants