Skip to content

CUDA error: an illegal memory access was encountered #225

@Iron486

Description

@Iron486

Hi!

I am running the following command to train from scratch:
python train_full_pipeline.py -s /home/farchid/research/project_1/south-building -r "dn_consistency" --high_poly True --export_obj True

where south-building is one of the COLMAP datasets downloaded from here. The problem now is that I get these errors:

This is the output:

Using high poly config.
Will export a UV-textured mesh as an .obj file.
Will export a ply file with the refined 3D Gaussians at the end of the training.
Optimizing output/vanilla_gs/south-building/
Output folder: output/vanilla_gs/south-building/ [03/11 18:34:56]
Tensorboard not available: not logging progress [03/11 18:34:56]
Reading camera 128/128 [03/11 18:34:57]
Loading Training Cameras [03/11 18:34:57]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [03/11 18:34:57]
Loading Test Cameras [03/11 18:35:15]
Number of points at initialisation :  61342 [03/11 18:35:15]
Training progress:  16%|███████████████████████████▋                                                                                                                                                    | 1100/7000 [00:20<01:43, 57.04it/s, Loss=0.5989216]Traceback (most recent call last):
  File "/home/farchid/research/project_1/SuGaR/./gaussian_splatting/train.py", line 220, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "/home/farchid/research/project_1/SuGaR/./gaussian_splatting/train.py", line 87, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, bg)
  File "/home/farchid/research/project_1/SuGaR/gaussian_splatting/gaussian_renderer/__init__.py", line 99, in render
    "visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:  16%|███████████████████████████▋                                                                                                                                                    | 1100/7000 [00:20<01:50, 53.19it/s, Loss=0.5989216]
Using original 3DGS rasterizer from Inria.
Using high poly config.
Will export a UV-textured mesh as an .obj file.
Will export a ply file with the refined 3D Gaussians at the end of the training.
Changing sh_levels to match the loaded model: 4
-----Parsed parameters-----
Source path: /home/farchid/research/project_1/south-building
   > Content: 6
Gaussian Splatting checkpoint path: output/vanilla_gs/south-building/
   > Content: 3
SUGAR checkpoint path: ./output/coarse/south-building/sugarcoarse_3Dgs7000_densityestim02_sdfnorm02/
Iteration to load: 7000
Output directory: ./output/coarse/south-building
Depth-Normal consistency factor: 0.05
SDF estimation factor: 0.2
SDF better normal factor: 0.2
Eval split: True
White background: False
---------------------------
Using device: 0
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| GPU reserved memory   |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|


Loading config output/vanilla_gs/south-building/...
Performing train/eval split...
Found image extension .JPG
Traceback (most recent call last):
  File "/home/farchid/research/project_1/SuGaR/train.py", line 133, in <module>
    coarse_sugar_path = coarse_training_with_density_regularization_and_dn_consistency(coarse_args)
  File "/home/farchid/research/project_1/SuGaR/sugar_trainers/coarse_density_and_dn_consistency.py", line 377, in coarse_training_with_density_regularization_and_dn_consistency
    nerfmodel = GaussianSplattingWrapper(
  File "/home/farchid/research/project_1/SuGaR/sugar_scene/gs_model.py", line 162, in __init__
    self.gaussians.load_ply(
  File "/home/farchid/research/project_1/SuGaR/gaussian_splatting/scene/gaussian_model.py", line 216, in load_ply
    plydata = PlyData.read(path)
  File "/home/farchid/anaconda3/envs/sugar/lib/python3.9/site-packages/plyfile.py", line 401, in read
    (must_close, stream) = _open_stream(stream, 'read')
  File "/home/farchid/anaconda3/envs/sugar/lib/python3.9/site-packages/plyfile.py", line 481, in _open_stream
    return (True, open(stream, read_or_write[0] + 'b'))
FileNotFoundError: [Errno 2] No such file or directory: 'output/vanilla_gs/south-building/point_cloud/iteration_7000/point_cloud.ply'

How can I solve this issue? I have CUDA 11.8 installed, and I am using an RTX 4090 with 24 GB VRAM. The error appears when the training progress bar is at 16%.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions