Fix backward pass on GPU in PyTorch interface #1426

glassnotes · 2021-06-22T16:38:45Z

Context: A user has reported an error through the quantum transfer learning demo; when running on the GPU, the line

 vjp = dy.view(1, -1) @ ctx.jacobian.apply(ctx, *ctx.saved_tensors)

in torch.py yields:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument mat2 in method wrapper_mm)

The reason for the error was discovered to be that the result of ctx.jacobian.apply()was on the CPU, even when the tensors it acted on were on the GPU.

Description of the Change: Updated the backward method in the torch interface to split the above operation into two parts, and check the device of the results. If at least one of them was on the GPU, the other is moved to the GPU if it was not already there.

Note: I needed to use torch 1.9 to test this, due to the issue described here.

Benefits: Running on the GPU with the torch interface should now work.

Possible Drawbacks: While this fix enables the demo to be run on the GPU, we cannot fully test it.

Related GitHub Issues: I believe this is the same issue reported in #1290, so merging this should enable us to close that.

github-actions · 2021-06-22T16:39:12Z

Hello. You may have forgotten to update the changelog!
Please edit .github/CHANGELOG.md with:

A one-to-two sentence description of the change. You may include a small working example for new features.
A link back to this PR.
Your name (or GitHub username) in the contributors section.

josh146 · 2021-06-23T08:07:22Z

pennylane/interfaces/torch.py

+            if not jac_res.is_cuda:
+                jac_res = jac_res.cuda()
+
+        vjp = dyv @ jac_res


🙌 thanks for fixing this @glassnotes!

Quick question, will this need to be fixed up on line 141 as well?

Oh that's a really good question. When I ran the example, this was the section that actually threw the error. I've tested on both the transfer learning demo, and the small example in the docs #1225 and both work with the fix. Do you know of a code example that would lead to the earlier part being called?

The other backward method is called for Hessian computations. That is, if you have

hess = torch.autograd.functional.hessian(cost_fn, params)

albi3ro

Before I approve, I'd like to know if we get the same errors on calculating the Hessian, as Josh mentioned.

I know this is a fairly small change, but I assume we still need a changelog for it?

codecov · 2021-06-23T13:54:08Z

Codecov Report

Merging #1426 (8253336) into master (559b1bf) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1426   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files         160      160           
  Lines       11966    11966           
=======================================
  Hits        11755    11755           
  Misses        211      211

Impacted Files	Coverage Δ
pennylane/interfaces/torch.py	`100.00% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 559b1bf...8253336. Read the comment docs.

glassnotes · 2021-06-23T15:34:12Z

Before I approve, I'd like to know if we get the same errors on calculating the Hessian, as Josh mentioned.

I know this is a fairly small change, but I assume we still need a changelog for it?

Changelog updated!

Computing the Hessian went smoothly - I didn't have to update anything. It looks like there is some logic already in that section to keep things on the GPU (I'm genuinely confused why this doesn't happen for the backward pass though, I wonder if the actual issue is within the .jacobian.apply function).

josh146

Thanks @glassnotes! Looks good on my end. Approving with the caveat that I am unable to test it locally as I do not have a GPU, so recommend getting approval from someone that does.

josh146 · 2021-06-24T06:03:48Z

pennylane/interfaces/torch.py

+        # When using CUDA, dyv seems to remain on the GPU, while the result
+        # of jac_res is returned on CPU, even though the saved_tensors arguments are
+        # themselves on the GPU. Check whether this has happened, and move things
+        # back to the GPU if required.
+        if dyv.is_cuda or jac_res.is_cuda:
+            if not dyv.is_cuda:
+                dyv = torch.as_tensor(dyv, device=jac_res.get_device())
+            if not jac_res.is_cuda:
+                jac_res = torch.as_tensor(jac_res, device=dyv.get_device())


It occurs to me that we could factor this out as a utility function, so that this logic can be re-used where needed in the future:

def match_gpu_device(tensors): """Ensure that tensors are placed on the same device. This function checks whether any tensor within the input list of tensors is on a GPU. If this is the case, all tensors are moved to the location of the *first CUDA tensor in the list*. If no tensors are on a GPU, the tensor locations remain unchanged. Args: tensors (sequence[torch.Tensor]): list of input tensors """ device = None for t in tensors: if t.is_cuda: device = t.get_device() break if device is None: return for t in tensors: if t.get_device() is not device: torch.as_tensor(t, device=device)

mlxd · 2021-06-24T14:46:22Z

Just a quick comment, I have successfully run the tutorial on my device (GTX 1060) running ArchLinux, CUDA 11.3, Driver v465.31, PyTorch 1.9.0+CUDA11.1 using this branch. PyTorch 1.8.1 fails with cuBlas errors.

Output as below, along with the prediction windows:

% python ./tutorial_quantum_transfer_learning.py
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/mlxd/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100.0%
Training started:
/tmp/pl_gpu/pyenv/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Phase: train Epoch: 1/1 Loss: 0.6993 Acc: 0.5246        
Phase: validation   Epoch: 1/1 Loss: 0.6432 Acc: 0.6536        
Training completed in 0m 38s
Best test loss: 0.6432 | Best test accuracy: 0.6536
python ./tutorial_quantum_transfer_learning.py  45.20s user 3.36s system 63% cpu 1:16.84 total

I can confirm the GPU was used during this demo also as the memory for above process was allocated and freed. Happy to run more if needed.

albi3ro

Thanks @mlxd for checking that it executes :)

Thanks for figuring this out :)

Add fix to push results back to GPU.

7665c75

glassnotes requested a review from albi3ro June 22, 2021 16:38

josh146 linked an issue Jun 23, 2021 that may be closed by this pull request

Torchlayer error when running on GPU - Tensor is on CPU, but expected it to be on GPU #1290

Closed

josh146 reviewed Jun 23, 2021

View reviewed changes

albi3ro reviewed Jun 23, 2021

View reviewed changes

Merge branch 'master' into fix_pytorch_cuda_jacobian

dee8833

josh146 added the bug 🐛 Something isn't working label Jun 23, 2021

Update changelog.

0ea9830

glassnotes requested review from albi3ro and josh146 June 23, 2021 15:34

josh146 approved these changes Jun 24, 2021

View reviewed changes

Merge branch 'master' into fix_pytorch_cuda_jacobian

8253336

albi3ro approved these changes Jun 28, 2021

View reviewed changes

glassnotes merged commit 614db57 into master Jun 28, 2021

glassnotes deleted the fix_pytorch_cuda_jacobian branch June 28, 2021 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backward pass on GPU in PyTorch interface #1426

Fix backward pass on GPU in PyTorch interface #1426

glassnotes commented Jun 22, 2021

github-actions bot commented Jun 22, 2021

josh146 Jun 23, 2021

glassnotes Jun 23, 2021

josh146 Jun 23, 2021

albi3ro left a comment

codecov bot commented Jun 23, 2021 •

edited

Loading

glassnotes commented Jun 23, 2021

josh146 left a comment

josh146 Jun 24, 2021 •

edited

Loading

mlxd commented Jun 24, 2021

albi3ro left a comment

Fix backward pass on GPU in PyTorch interface #1426

Fix backward pass on GPU in PyTorch interface #1426

Conversation

glassnotes commented Jun 22, 2021

github-actions bot commented Jun 22, 2021

josh146 Jun 23, 2021

Choose a reason for hiding this comment

glassnotes Jun 23, 2021

Choose a reason for hiding this comment

josh146 Jun 23, 2021

Choose a reason for hiding this comment

albi3ro left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 23, 2021 • edited Loading

Codecov Report

glassnotes commented Jun 23, 2021

josh146 left a comment

Choose a reason for hiding this comment

josh146 Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

mlxd commented Jun 24, 2021

albi3ro left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 23, 2021 •

edited

Loading

josh146 Jun 24, 2021 •

edited

Loading