Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on CPU #46

Closed
Sieqfried opened this issue Oct 11, 2019 · 7 comments
Closed

Running on CPU #46

Sieqfried opened this issue Oct 11, 2019 · 7 comments
Assignees

Comments

@Sieqfried
Copy link

Hi,

I am currently trying to run 'simplest_example.py' on a CPU within a docker container.

I have tried modifying the code to run on CPU by passing:

  • "placement=DeviceType.CPU" to the Factory which produces an Error regarding CUDA:

Traceback (most recent call last):
File "simplest_example.py", line 27, in
optimizer="sgd")
File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/core/neural_factory.py", line 526, in train
stop_on_nan_loss=stop_on_nan_loss)
File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/backends/pytorch/actions.py", line 1022, in train
'amp_min_loss_scale', 1.0))
File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/backends/pytorch/actions.py", line 359, in __initialize_amp
opt_level=AmpOptimizations[optim_level],
File "/opt/conda/lib/python3.6/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 170, in _initialize
check_params_fp32(models)
File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 92, in check_params_fp32
name, param.type()))
File "/opt/conda/lib/python3.6/site-packages/apex/amp/_amp_state.py", line 32, in warn_or_err
raise RuntimeError(msg)
RuntimeError: Found param fc1.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

To fix that issue I additionally passed:

  • 'optimization_level=1' to prevent APEX from being called which returned

2019-10-11 09:32:10,688 - WARNING - Data Layer does not have any weights to return. This get_weights call returns None.
Starting .....
Starting epoch 0
Traceback (most recent call last):
File "simplest_example.py", line 27, in
optimizer="sgd")
File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/core/neural_factory.py", line 526, in train
stop_on_nan_loss=stop_on_nan_loss)
File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/backends/pytorch/actions.py", line 1184, in train
final_loss.get_device()))
RuntimeError: Device index must not be negative

How do I run the example on CPU? Thanks.

@okuchaiev
Copy link
Member

Please have a look at this PR: #48
In that PR I am able to run simplest examples on MacBook Pro (without proper GPU) by setting:

from nemo.core import DeviceType
nf = nemo.core.NeuralModuleFactory(placement=DeviceType.CPU)

@okuchaiev okuchaiev self-assigned this Oct 11, 2019
@Sieqfried
Copy link
Author

I tried out the PR but am still getting this error:

Traceback (most recent call last): File "simplest_example.py", line 30, in <module> optimizer="sgd") File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/core/neural_factory.py", line 536, in train stop_on_nan_loss=stop_on_nan_loss) File "/opt/conda/lib/python3.6/site-packages/nemo_toolkit-0.8-py3.6.egg/nemo/backends/pytorch/actions.py", line 1209, in train final_loss.get_device())) RuntimeError: Device index must not be negative

@okuchaiev
Copy link
Member

just pushed an update, could you please try again?

@okuchaiev
Copy link
Member

okuchaiev commented Oct 17, 2019

Pushed more updates. On my Macbook Pro (without proper GPU), the following seems to work:

@Sieqfried
Copy link
Author

Sorry for the delay.

Yes, now I am able to run all the examples in the start_here Folder. I am running them in a docker container so I have not run the ipython scripts.

@okuchaiev
Copy link
Member

merged the PR to master. closing this issue

@JFFerraro5
Copy link

I am trying to run a Training job with a CPU in a docker container. It seems this 'DeviceType' File no longer exists. The error is similar to the first question:

`  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 299, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx`
from nemo.core import DeviceType
ImportError: cannot import name 'DeviceType' from 'nemo.core'

dcurran90 pushed a commit to dcurran90/NeMo that referenced this issue Oct 15, 2024
* Fix `lab` command

* install dependencies from requirements.txt
* include cli and submodules only

Signed-off-by: markstur <mark.sturdevant@ibm.com>

* Update README to use `lab` command

Signed-off-by: markstur <mark.sturdevant@ibm.com>

---------

Signed-off-by: markstur <mark.sturdevant@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants