Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnalogConv2d fails when using TT-v2 #642

Closed
Zhaoxian-Wu opened this issue Apr 8, 2024 · 5 comments
Closed

AnalogConv2d fails when using TT-v2 #642

Zhaoxian-Wu opened this issue Apr 8, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@Zhaoxian-Wu
Copy link

Description

When I tried to use the TT-v2 algorithm to train the convolutional network, I got a Cuda error.

How to reproduce

After running the following main.py file, I got an error RuntimeError: CUDA_CALL Error 'an illegal memory access was encountered' at cuda_util.cu:653

# main.py
import torch
from aihwkit.nn import AnalogConv2d
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.configs import build_config
from aihwkit.simulator.configs.devices import SoftBoundsReferenceDevice
DEVICE = 'cuda:0'

rpu_config = build_config('ttv2', device=SoftBoundsReferenceDevice())
model = AnalogConv2d(
    in_channels=1, out_channels=3, kernel_size=5, rpu_config=rpu_config
).to(DEVICE)

optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)
# if I use images = torch.empty((128, 1, 32, 32)), even the forward fails
images = torch.ones((128, 1, 32, 32))
images = images.to(DEVICE)
output = model(images)
loss = output.norm()**2
loss.backward()
optimizer.step()

Besides, if I create torch.empty() instead of torch.one(), the forward clause model(images) never stops, I guess there could be some endless loop happening.

Other information

  • Pytorch version: 2.1.2
  • Package version: aihwkit-gpu 0.9.0
  • OS: Linux
  • Python version: 3.10.13
  • Conda version (or N/A) :23.11.0
@Zhaoxian-Wu Zhaoxian-Wu added the bug Something isn't working label Apr 8, 2024
@kaoutar55
Copy link
Collaborator

Hi @Zhaoxian-Wu, Thanks for reporting this issue. what GPU were you using when you encountered this issue?

@kaoutar55 kaoutar55 self-assigned this May 8, 2024
@kkvtran
Copy link
Collaborator

kkvtran commented May 8, 2024

Hi @Zhaoxian-Wu for the CUDA memory problem, it looked like the problem had to do with how to set the DEVICE. If I set it by DEVICE = torch.cuda.set_device(0) instead of DEVICE = torch.device('cuda:0') then I did not see the problem. I found the solution in this issue

@kkvtran
Copy link
Collaborator

kkvtran commented May 8, 2024

It tried the same technique with torch.empty and I did not see the hanging(looping) issue either. So this is torch problem. Let us know if you have any questions.

@kaoutar55
Copy link
Collaborator

@Zhaoxian-Wu do you still have this issue. if not, we can close this. Please let us know

@Zhaoxian-Wu
Copy link
Author

Hi @Zhaoxian-Wu for the CUDA memory problem, it looked like the problem had to do with how to set the DEVICE. If I set it by DEVICE = torch.cuda.set_device(0) instead of DEVICE = torch.device('cuda:0') then I did not see the problem. I found the solution in this issue

I tried this solution and it worked! It seems to be the issue from Pytorch. Thank you for your help @kkvtran! It is weird for the Pytorch community to leave this issue for such a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants