Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" #371

eloitanguy · 2022-05-05T09:34:02Z

Describe the bug

Running the "plot_sliced_wass_grad_flow_pytorch.py" raises a torch device-based RuntimeError

To Reproduce

Steps to reproduce the behavior:

From the POT source folder, navigate to examples/backends
Run python plot_sliced_wass_grad_flow_pytorch.py

Terminal output (with edited paths)

2022-05-05 11:08:24.082850: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "POT/examples/backends/plot_sliced_wass_grad_flow_pytorch.py", line 82, in
loss = ot.sliced_wasserstein_distance(x1_torch, x2_torch, n_projections=20, seed=gen)
File "POT/ot/sliced.py", line 149, in sliced_wasserstein_distance
projections = get_random_projections(d, n_projections, seed, backend=nx, type_as=X_s)
File "POT/ot/sliced.py", line 58, in get_random_projections
projections = nx.randn(d, n_projections, type_as=type_as)
File "POT/ot/backend.py", line 1777, in randn
return torch.randn(size=size, dtype=type_as.dtype, generator=self.rng_, device=type_as.device)
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Expected behavior

The script should run entirely on gpu and never expect cpu data, since torch.cuda.is_available() == True in this case

Environment (please complete the following information):

OS (e.g. MacOS, Windows, Linux): Linux
Python version: 3.9.12
How was POT installed (source, pip, conda): source
Build command you used (if compiling from source): python setup.py build_ext --inplace
Only for GPU related bugs:
- CUDA version: 10.1.24
- GPU models and configuration: RTX 2070MQ
- Any other relevant information: N/A

Output of the following code snippet:

import platform; print(platform.platform())

Linux-5.13.0-40-generic-x86_64-with-glibc2.31

import sys; print("Python", sys.version)

Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0]

import numpy; print("NumPy", numpy.__version__)

NumPy 1.22.3

import scipy; print("SciPy", scipy.__version__)

SciPy 1.8.0

import ot; print("POT", ot.__version__)

2022-05-05 11:25:05.637911: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
POT 0.8.3dev

import torch;print("torch", torch.__version__)

torch 1.11.0+cu102

(yes, my CUDA version is old as dirt, but this should be irrelevant)

Additional Context

I prepare my conda env as follows:

conda create -n ot_dev
conda activate ot_dev
conda install pip
pip install -r requirements.txt
cd docs/
pip install -r requirements.txt

The text was updated successfully, but these errors were encountered:

ncassereau-idris · 2022-05-05T10:07:52Z

Hi, Thanks for your report, I can reproduce your issue.

It comes from the fact that torch.randn does not allow a device where the generator is not located. It remained unspotted because on github, POT does not have access to a GPU so examples were computed on CPU. From what I understand from https://pytorch.org/docs/stable/generated/torch.randn.html, it seems to be a Pytorch bug ; I don't think this behaviour was intended.

I can see two ways for fixing it. We could either

Replace

POT/ot/backend.py

Line 1777 in eccb138

return torch.randn(size=size, dtype=type_as.dtype, generator=self.rng_, device=type_as.device)

with return torch.randn(size=size, dtype=type_as.dtype, generator=self.rng_).to(type_as.device)
Same thing with

POT/ot/backend.py

Line 1771 in eccb138

return torch.rand(size=size, generator=self.rng_, dtype=type_as.dtype, device=type_as.device)
Have a generator on CPU and another on GPU and choose each time we want a random tensor.

The first solution is more straightforward but it is a bit more time-consuming. See (Benchmark done on Pytorch 1.10 with a V100):

rng = torch.Generator("cpu")
rng.seed()
rng_gpu = torch.Generator("cuda")
rng_gpu.seed()
%timeit torch.randn(100000, generator=rng).to("cuda")
%timeit torch.randn(100000, generator=rng_gpu, device="cuda")

returns

774 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.5 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So the second solution is more efficient, but requires more code.

What do you think @rflamary ?

rflamary · 2022-05-05T11:14:14Z

I think that we need fast generators in a lot of potential applications so I'm sorry @ncassereau-idris but I prefer the second ;). But in this case it means that you need one generator per device because if you have two GPU then it will still has a problem no?

ncassereau-idris · 2022-05-05T11:28:32Z

I think that we need fast generators in a lot of potential applications so I'm sorry @ncassereau-idris but I prefer the second ;). But in this case it means that you need one generator per device because if you have two GPU then it will still has a problem no?

No actually, I just tested it with 2 V100.
With two devices we can use the generator of a GPU and set the device argument to the second GPU and it works just fine. This issue appears with the CPU really, maybe a TPU would be an issue as well idk.
It might come from the fact that the type changes with the device (torch.Tensor vs torch.cuda.Tensor). Maybe the implementation is different as well.
I tried setting a CPU generator with the state of a GPU generator. It turns out that the shape of the internal state changes as well, so there are profound difference which apparently do not enjoy dealing with several device types.

About the correction of the bug, I will suggest a PR tomorrow, or this afternoon if I have time.

rflamary · 2022-05-05T11:48:55Z

great thanks for checking it was not obvious o my side

eloitanguy added bug help wanted labels May 5, 2022

eloitanguy changed the title ~~Torch device error in example "lot_sliced_wass_grad_flow_pytorch.py"~~ Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" May 5, 2022

ncassereau-idris mentioned this issue May 5, 2022

[MRG] Torch random generator not working for Cuda tensor #373

Merged

4 tasks

rflamary closed this as completed in #373 May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" #371

Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" #371

eloitanguy commented May 5, 2022

ncassereau-idris commented May 5, 2022 •

edited

rflamary commented May 5, 2022

ncassereau-idris commented May 5, 2022

rflamary commented May 5, 2022

Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" #371

Torch device error in example "plot_sliced_wass_grad_flow_pytorch.py" #371

Comments

eloitanguy commented May 5, 2022

Describe the bug

To Reproduce

Expected behavior

Environment (please complete the following information):

Additional Context

ncassereau-idris commented May 5, 2022 • edited

rflamary commented May 5, 2022

ncassereau-idris commented May 5, 2022

rflamary commented May 5, 2022

ncassereau-idris commented May 5, 2022 •

edited