-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error when I apply my own dataset. #4
Comments
Hi, edit: I am using RTX 6000 with cuda 11.8 |
Hi! it seems the error happens in CUDA part. but currently I don't have any idea on it. I tested the code on two machines with different GPUs (H800 and 4080) but can't reproduce this error. I will appreciate if you can provide with more information. Thank you! |
Hi!
|
Same error. I check render's input, scale rot opacity have Nan. Why? |
Thank you for info. This issue seems related to the machines. Currently, RTX 4080 with CUDA 12.1 works well. I'm looking for other computers to reproduce this error and fix it. |
I found this issue still exists on CUDA 12.1 and RTX 3090. It occasionally happens during training.
|
Grad Nan after backwarding on custom data. Need Help. Thanks! A quick test: this grad error still exist after updating forward.cu in your PR |
Same error. |
I've encountered the same issue with CUDA 12.1. |
I have also encountered the same issue on RTX 4090 with CUDA 11.8, Pytorch 2.1.2, Ubuntu 22.04. As mentioned by others earlier, this error occurs randomly during training process.
One solution that might help is issues/41, but I haven't try it... |
Same error as well. Grad NaN on two different datasets. |
Thank you for important information. I have fixed the problem. Please update the code. |
Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from? |
Have you verified whether it is due to distortion loss? An issue was reported in 2DGS and then they changed the default value of its corresponding hyperparameter to 0.0 |
I printed every 100th image. The images are very good. Different from what I see in the viewer. Maybe there is some conversion issue while saving the ply file? |
No, that might be the reason? What did you change? Maybe that's caused by mip? |
Yes, I made some modification for 3D filters. You can use it in the same way as the original viewer. And I think we get the reason and I'll update the README for the viewer. Looking forward to good news. |
Obviously that was the issue. The rendering is actually quite nice and confirms the reported psnr. Thx for the help. |
Please use the viewer. |
I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx |
I can't open the ply files by the original viewer so I can't reproduce it. But I guess the ply files are wrongly parsed because other codes don't know my format (meaning or order of the variables in the file). |
I can confirm that the latest updates fixed the CUDA error for me. Anybody else in the same situation, don't forget to reinstall the module with |
Hi @Liu-SD @zhouilu, can I ask if you solved this issue or not? I also encountered this issue at the first iteration and couldn't work it out by updating the submodule. |
The resolution of my dataset is 5236x3909. I scale down the resolution by 4 and the actual render resolution is 1309x977.
Now I get the runtime error as follows:
cameras extent: 381.5180541992188 [19/06 15:31:45]
Loading Training Cameras: 10 . [19/06 15:56:00]
0it [00:00, ?it/s]
Loading Test Cameras: 0 . [19/06 15:56:00]
Number of points at initialisation : 23947 [19/06 15:56:00]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/liu/nerf/RaDe-GS/train.py", line 312, in
training(dataset=lp.extract(args),
File "/home/liu/nerf/RaDe-GS/train.py", line 115, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/liu/nerf/RaDe-GS/gaussian_renderer/init.py", line 87, in render
"visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered
What's the reason and how to solve it? Thanks a lot!
The text was updated successfully, but these errors were encountered: