Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backward pass on GPU in PyTorch interface #1426

Merged
merged 4 commits into from
Jun 28, 2021

Conversation

glassnotes
Copy link
Contributor

Context: A user has reported an error through the quantum transfer learning demo; when running on the GPU, the line

 vjp = dy.view(1, -1) @ ctx.jacobian.apply(ctx, *ctx.saved_tensors)

in torch.py yields:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument mat2 in method wrapper_mm)

The reason for the error was discovered to be that the result of ctx.jacobian.apply()was on the CPU, even when the tensors it acted on were on the GPU.

Description of the Change: Updated the backward method in the torch interface to split the above operation into two parts, and check the device of the results. If at least one of them was on the GPU, the other is moved to the GPU if it was not already there.

Note: I needed to use torch 1.9 to test this, due to the issue described here.

Benefits: Running on the GPU with the torch interface should now work.

Possible Drawbacks: While this fix enables the demo to be run on the GPU, we cannot fully test it.

Related GitHub Issues: I believe this is the same issue reported in #1290, so merging this should enable us to close that.

@glassnotes glassnotes requested a review from albi3ro June 22, 2021 16:38
@github-actions
Copy link
Contributor

Hello. You may have forgotten to update the changelog!
Please edit .github/CHANGELOG.md with:

  • A one-to-two sentence description of the change. You may include a small working example for new features.
  • A link back to this PR.
  • Your name (or GitHub username) in the contributors section.

if not jac_res.is_cuda:
jac_res = jac_res.cuda()

vjp = dyv @ jac_res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌 thanks for fixing this @glassnotes!

Quick question, will this need to be fixed up on line 141 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a really good question. When I ran the example, this was the section that actually threw the error. I've tested on both the transfer learning demo, and the small example in the docs #1225 and both work with the fix. Do you know of a code example that would lead to the earlier part being called?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other backward method is called for Hessian computations. That is, if you have

hess = torch.autograd.functional.hessian(cost_fn, params)

Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I approve, I'd like to know if we get the same errors on calculating the Hessian, as Josh mentioned.

I know this is a fairly small change, but I assume we still need a changelog for it?

@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #1426 (8253336) into master (559b1bf) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1426   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files         160      160           
  Lines       11966    11966           
=======================================
  Hits        11755    11755           
  Misses        211      211           
Impacted Files Coverage Δ
pennylane/interfaces/torch.py 100.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 559b1bf...8253336. Read the comment docs.

@josh146 josh146 added the bug 🐛 Something isn't working label Jun 23, 2021
@glassnotes
Copy link
Contributor Author

Before I approve, I'd like to know if we get the same errors on calculating the Hessian, as Josh mentioned.

I know this is a fairly small change, but I assume we still need a changelog for it?

Changelog updated!

Computing the Hessian went smoothly - I didn't have to update anything. It looks like there is some logic already in that section to keep things on the GPU (I'm genuinely confused why this doesn't happen for the backward pass though, I wonder if the actual issue is within the .jacobian.apply function).

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glassnotes! Looks good on my end. Approving with the caveat that I am unable to test it locally as I do not have a GPU, so recommend getting approval from someone that does.

Comment on lines +191 to +199
# When using CUDA, dyv seems to remain on the GPU, while the result
# of jac_res is returned on CPU, even though the saved_tensors arguments are
# themselves on the GPU. Check whether this has happened, and move things
# back to the GPU if required.
if dyv.is_cuda or jac_res.is_cuda:
if not dyv.is_cuda:
dyv = torch.as_tensor(dyv, device=jac_res.get_device())
if not jac_res.is_cuda:
jac_res = torch.as_tensor(jac_res, device=dyv.get_device())
Copy link
Member

@josh146 josh146 Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could factor this out as a utility function, so that this logic can be re-used where needed in the future:

def match_gpu_device(tensors):
    """Ensure that tensors are placed on the same device.

    This function checks whether any tensor within the input list of
    tensors is on a GPU. If this is the case, all tensors are moved
    to the location of the *first CUDA tensor in the list*.

    If no tensors are on a GPU, the tensor locations remain unchanged.

    Args:
        tensors (sequence[torch.Tensor]): list of input tensors
    """
    device = None

    for t in tensors:
        if t.is_cuda:
            device = t.get_device()
            break

    if device is None:
        return

    for t in tensors:
        if t.get_device() is not device:
            torch.as_tensor(t, device=device)

@mlxd
Copy link
Member

mlxd commented Jun 24, 2021

Just a quick comment, I have successfully run the tutorial on my device (GTX 1060) running ArchLinux, CUDA 11.3, Driver v465.31, PyTorch 1.9.0+CUDA11.1 using this branch. PyTorch 1.8.1 fails with cuBlas errors.

Output as below, along with the prediction windows:

% python ./tutorial_quantum_transfer_learning.py
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/mlxd/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100.0%
Training started:
/tmp/pl_gpu/pyenv/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Phase: train Epoch: 1/1 Loss: 0.6993 Acc: 0.5246        
Phase: validation   Epoch: 1/1 Loss: 0.6432 Acc: 0.6536        
Training completed in 0m 38s
Best test loss: 0.6432 | Best test accuracy: 0.6536
python ./tutorial_quantum_transfer_learning.py  45.20s user 3.36s system 63% cpu 1:16.84 total

I can confirm the GPU was used during this demo also as the memory for above process was allocated and freed. Happy to run more if needed.

Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mlxd for checking that it executes :)

Thanks for figuring this out :)

@glassnotes glassnotes merged commit 614db57 into master Jun 28, 2021
@glassnotes glassnotes deleted the fix_pytorch_cuda_jacobian branch June 28, 2021 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Torchlayer error when running on GPU - Tensor is on CPU, but expected it to be on GPU
4 participants