Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault happened while calling DMTetGeometry #13

Open
mush881212 opened this issue Nov 13, 2022 · 7 comments
Open

Segmentation fault happened while calling DMTetGeometry #13

mush881212 opened this issue Nov 13, 2022 · 7 comments

Comments

@mush881212
Copy link

mush881212 commented Nov 13, 2022

Hi Team,

Thanks for your amazing work!
I try to run the program, but I get the segmentation fault happen when calling DMTetGeometry(FLAGS.dmtet_grid, FLAGS.mesh_scale, FLAGS) in train.py.
I tracked down the error and found that it happens when calling ou.OptiXContext() in dmtet.py.
I think the error might be happening because of calling _plugin.OptiXStateWrapper(os.path.dirname(file), torch.utils.cpp_extension.CUDA_HOME) in ops.py, but I don't know how to fix it.

I tried to reduce the batch size from 8 to 1 and the image train resolution from 512x512 to 128x128, but the problem persists.
Can you give some advice on how to solve this problem?

GPU Hardware:
Nvidia A100 (32G) on server
Console error:

Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module optixutils_plugin...
Config / Flags:
--
iter 5000
batch 8
spp 1
layers 1
train_res [512, 512]
display_res [512, 512]
texture_res [1024, 1024]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/nerd_gold
config configs/nerd_gold.json
ref_mesh data/nerd/moldGoldCape_rescaled
base_mesh None
validate True
n_samples 12
bsdf pbr
denoiser bilateral
denoiser_demodulate True
mtl_override None
dmtet_grid 128
mesh_scale 2.5
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000.0
pre_load True
no_perturbed_nrm False
decorrelated False
kd_min [0.03, 0.03, 0.03]
kd_max [0.8, 0.8, 0.8]
ks_min [0, 0.08, 0.0]
ks_max [0, 1.0, 1.0]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.025
lambda_diffuse 0.15
lambda_specular 0.0025
random_textures True
--
DatasetLLFF: 119 images with shape [512, 512]
DatasetLLFF: auto-centering at [-0.04492672 1.3252479 1.1068335 ]
DatasetLLFF: 119 images with shape [512, 512]
DatasetLLFF: auto-centering at [-0.04492672 1.3252479 1.1068335 ]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
Segmentation fault (core dumped)

@mush881212 mush881212 changed the title Segmentation fault happened while setting DMTetGeometry Segmentation fault happened while calling DMTetGeometry Nov 13, 2022
@jmunkberg
Copy link
Collaborator

Thanks @mush881212 ,

I suspect OptiX, which is quite sensitive to the driver installed on the machine. Our code uses OptiX 7.3, which requires an Nvidia display driver numbered 465 or higher. Perhaps verify that some standalone OptiX example from the OptiX 7.3 SDK runs fine on that machine.

One alternative may be to use our old code base, https://github.com/NVlabs/nvdiffrec, which is a similar reconstruction pipeline, but without the OptiX ray tracing step.

@jmunkberg
Copy link
Collaborator

Note also that the A100 GPU does not have any RT Cores (for ray tracing acceleration), so the ray tracing performance will be lower than what we reported in the paper (we measured on an A6000 RTX GPU).

@mush881212
Copy link
Author

Hi @jmunkberg,

I think the problem is due to the driver version, because my driver version is too old to support OptiX 7.3.
I will try it on another device and upgrade the driver version.
Thanks for your help!

@Sheldonmao
Copy link

Hi, I wonder if the problem is solved? I have the same problem with a driver version 520.61.05 on V100, wonder how to solve it.

@mush881212
Copy link
Author

Hi @Sheldonmao,

I solved this issue by updating the driver version and using an RTX3090 device instead.
Driver version: 465.19.01
CUDA version: 11.3
You could try using these settings.

@sungmin9939
Copy link

would relatively high version of GPU hardware be a problem? My GPU is RTX 3090 and driver version is 535.146.02 and I'm getting segmentation fault as the original author

@jmunkberg
Copy link
Collaborator

Newer GPUs and drivers shouldn't be an issue, I hope. It has been a while since we released this code, but I just tested on two setups without issues.

Setup 1: Windows Desktop
RTX 6000 Ada Gen w/ driver 545.84
PyTorch version: 2.0.0+cu117

Setup 2: Linux Server
V100 w/ Driver 515.86.01
Using the Dockerfile from the nvdiffrecmc repo https://github.com/NVlabs/nvdiffrecmc/blob/main/docker/Dockerfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants