Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU usage with default.qubit.torch for qml.qnn.TorchLayer #1705

Merged
merged 26 commits into from
Oct 8, 2021

Conversation

antalszava
Copy link
Contributor

@antalszava antalszava commented Sep 29, 2021

Context

In PennyLane v0.18.0, the default.qubit.torch device was added. This device is used under the hood by QNodes for backpropagation with Torch tensors. The QNode, however, doesn't have a logic for checking what the torch_device keyword argument should be when creating a backdrop device and instantiates default.qubit.torch with default arguments (torch_device='cpu'). That is, there is no opportunity to specify using the GPU.

Therefore, there may be issues with passing Torch tensors using CUDA as QNode arguments (see #1688). What we would like to have in such cases is equivalent to dev = qml.device('default.qubit.torch', wires=3, torch_device='cuda').

Further to 1., the default.qubit.torch device uses the internal methods of DefaultQubit in many cases. These methods use the _asarray method. One detail of _asarray is that it doesn't consider the Torch device that's used, hence computations might shift from the GPU to CPU.

Changes

  1. Changes the QNode to specify torch_device='cuda' internally when creating a dev = qml.device('default.qubit.torch', wires=3) device for backpropagation.
  2. Changes the _asarray method to always put the output on the Torch device specified internally;

Related issues

Closes #1688.

@github-actions
Copy link
Contributor

Hello. You may have forgotten to update the changelog!
Please edit doc/releases/changelog-dev.md with:

  • A one-to-two sentence description of the change. You may include a small working example for new features.
  • A link back to this PR.
  • Your name (or GitHub username) in the contributors section.

@antalszava antalszava changed the title Fix GPU usage with default.qubit.torch Fix GPU usage with default.qubit.torch for qml.qnn.TorchLayer Sep 29, 2021
@antalszava
Copy link
Contributor Author

[ch9374]

@codecov
Copy link

codecov bot commented Sep 29, 2021

Codecov Report

Merging #1705 (da1a7e0) into master (e0dc0dc) will decrease coverage by 0.00%.
The diff coverage is 90.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1705      +/-   ##
==========================================
- Coverage   99.22%   99.21%   -0.01%     
==========================================
  Files         204      204              
  Lines       15424    15429       +5     
==========================================
+ Hits        15304    15308       +4     
- Misses        120      121       +1     
Impacted Files Coverage Δ
pennylane/devices/default_qubit_torch.py 92.30% <90.90%> (-0.63%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0dc0dc...da1a7e0. Read the comment docs.

Comment on lines 188 to 189
phi = torch.tensor([0.011, 0.012], requires_grad=True, device='cuda')
theta = torch.tensor(0.05, requires_grad=True, device='cuda')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antalszava I don't think we should use device='cuda' explicitly in the documentation, since this will fail for users that don't have CUDA installed or GPU access. For example, if I run this snippet locally, I get:

AssertionError: Torch not compiled with CUDA enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Made the change as the text states

with the classical nodes processed on a GPU

and the output was not put on CUDA devices. Maybe having

    dev = "cuda"  torch.cuda.is_available() else "cpu"
    phi = torch.tensor([0.011, 0.012], requires_grad=True, device=dev)
    theta = torch.tensor(0.05, requires_grad=True, device=dev)

could help that the example is executable even in such cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 This is a case where I would probably suggest removing all references to the GPU in this page altogether. Users can refer to PyTorch docs for using GPUs, but we shouldn't have code examples in our docs that not all users can run

Comment on lines 181 to 186
if torch_device is None:
self._torch_device = "cpu"
self._user_def_torch_device = False
else:
self._torch_device = torch_device
self._user_def_torch_device = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice feature about the Torch device argument is that it also treats None as the default device :)

>>> x = torch.tensor(0.2, device=None)
>>> x.device
device(type='cpu')

So you can probably remove this branching here, and simply store

self._torch_device = torch_device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh didn't know that, thank you! 🙂 One piece of information that's also being stored is self._user_def_torch_device. We'd like to know if the user has explicitly set a Torch device because the QNode creates a default.qubit.torch device under the hood. So maybe it's good to use the branching logic for that. 🤔

Comment on lines 667 to 673
# Check if we should be using CUDA
ops_and_obs = self.qtape.operations + self.qtape.observables
any_op_uses_cuda = any(
data.is_cuda for op in ops_and_obs for data in op.data if hasattr(data, "is_cuda")
)

if any_op_uses_cuda and self.device._torch_device == "cpu":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antalszava I must admit, this feels like a check that shouldn't be in the QNode, but instead be part of the default.qubit.torch device.

Instead of passing torch_device=... to the Device, could we not move this logic to the device instead? E.g.,

def apply(circuit,...):
    ops_and_obs = circuit.operations + circuit.observables
    any_op_uses_cuda = any(
        data.is_cuda for op in ops_and_obs for data in op.data if hasattr(data, "is_cuda")
    )
    ...

This way:

  • The QNode avoids having interface and device specific logic
  • The Device simply chooses the torch device to use by checking what torch devices the received circuit uses, which feels much cleaner :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially: the default.qubit.torch device should check, based on the input circuit, whether it is a GPU circuit. If it is, all tensors the device creates should also be on the same device.

Copy link
Contributor Author

@antalszava antalszava Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josh146 that's a fair idea, frankly, I had this logic in the execute method first. The reason why I ended up putting this in the QNode is that the need for a change came from the QNode itself. At the moment, the QNode "silently" creates a default.qubit.torch device with the default torch_device='cpu' option, while the circuit might be using the GPU. This is only happening because the QNode swaps out default.qubit to a default.qubit.torch device. When the user explicitly instanties a default.qubit.torch device, the responsibility is on them to choose the correct torch_device argument matching their circuit (GPU or CPU).

The QNode avoids having interface and device specific logic

We'll likely have to include some logic sooner or later in the QNode because right now the QNode instantiates all backprop devices with default arguments. It feels like we're offloading the duty of the QNode getting the correct device options onto the devices, just to preserve the workflow of the QNode:

a) creating a device with default options, but then
b) the device itself updates those device options under the hood.

The Device simply chooses the torch device to use by checking what torch devices the received circuit uses, which feels much cleaner :)

While this sounds good to have, frankly I'd think we're making up for the poor use of the device by the user if we have logic like this in the device.

All in all, I'd argue keeping the logic in the QNode, as that feels like the place where this logic belongs, just because it's a QNode use case. Having said that, there are indeed merits to moving this to the device. so keen to hear more on this (likely it's also a design decision that will apply in the future for other cases). 🙂

Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @antalszava
I can confirm the issue no longer appears for CUDA or CPU devices:

python ./test_cpu.py 
/home/mlxd/DELME/pyenv/lib/python3.9/site-packages/torch/autograd/__init__.py:147: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at  ../aten/src/ATen/native/Copy.cpp:240.)
  Variable._execution_engine.run_backward(
Average loss over epoch 1: 0.3317
Average loss over epoch 2: 0.2200
python ./test_gpu.py
[W Copy.cpp:240] Warning: Casting complex values to real discards the imaginary part (function operator())
Average loss over epoch 1: 0.3755
Average loss over epoch 2: 0.2203

Though I did notice the GPU version taking quite a long time compared to the CPU.

@albi3ro albi3ro requested a review from josh146 October 7, 2021 13:06
Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end! Thanks everyone for fixing this so fast.

Only requirements from my side are:

  • Minor suggestions to the documentation
  • Fixing the tests, linting, and black 🙂

Comment on lines +206 to +211
>>> phi_final
tensor([0.7345, 0.0120], requires_grad=True)
>>> theta_final
tensor(0.8316, requires_grad=True)
>>> circuit4(phi_final, theta_final)
tensor(0.5000, dtype=torch.float64, grad_fn=<SqueezeBackward0>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

doc/introduction/interfaces/torch.rst Outdated Show resolved Hide resolved
doc/introduction/interfaces/torch.rst Show resolved Hide resolved
doc/introduction/interfaces/torch.rst Outdated Show resolved Hide resolved
Comment on lines +248 to +249
>>> timeit.timeit("circuit_cuda(params)", globals=globals(), number=5)
2.297812332981266
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@@ -176,17 +176,27 @@ def circuit(x):
_norm = staticmethod(torch.norm)
_flatten = staticmethod(torch.flatten)

def __init__(self, wires, *, shots=None, analytic=None, torch_device="cpu"):
self._torch_device = torch_device
def __init__(self, wires, *, shots=None, analytic=None, torch_device=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the torch_device argument be removed now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah... probably.

def _asarray(a, dtype=None):
def execute(self, circuit, **kwargs):
ops_and_obs = circuit.operations + circuit.observables
if any(data.is_cuda for op in ops_and_obs for data in op.data if hasattr(data, "is_cuda")):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess only one downside is that for a large circuit on the CPU, this for loop might take a while? Maybe something we could consider for the Operator overhaul (e.g., op.is_cuda)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this check on the tape. The torch interface tape mixin could override _update_par_info and the tape itself could have a _torch_device attribute.

return res

_cast = _asarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this was changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found any reason to distinguish _cast and _asarray and followed the example we have for default.qubit.jax, where they are identical. The reason why the bug arose in the first place was that _asarray was not doing what _cast was doing already: calling torch.as_tensor by also passing device=self._torch_device. Now this conversion has been added to the end of _asarray, so it seemed like we could just make _cast be the same as _asarray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultQubit is basically calling _asarray under the hood, hence we needed to add torch.as_tensor + device=self._torch_device there (instead of simply using _cast).

Comment on lines 18 to 20
torch = pytest.importorskip("torch")
if not torch.cuda.is_available():
pytest.skip("cuda not available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this line is throwing an error:

Using pytest.skip outside of a test is not allowed.
To decorate a test function, use the @pytest.mark.skip or @pytest.mark.skipif decorators instead,
and to skip a module use `pytestmark = pytest.mark.{skip,skipif}.

pennylane/qnode.py Outdated Show resolved Hide resolved
@antalszava antalszava mentioned this pull request Oct 7, 2021
@josh146 josh146 merged commit 05b1987 into master Oct 8, 2021
@josh146 josh146 deleted the fix_torch_cuda branch October 8, 2021 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] qml.qnn.TorchLayer + GPU stopped working in v0.18.0
4 participants