-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pending after computing MSE and PSNR #13
Comments
After the first pass, we run xatlas to create a UV parameterization on the triangle mesh. If the first pass failed to create a reasonable mesh, this step can take quite some time or even fail. How does the mesh look in your case at the end of the first pass? For memory consumption, you can log the usage using |
Thanks @jmunkberg! Why i raised this issue is because the general command 'nvidia-smi' without parameters showed it only used 7.4G totally, and that GPU memory has 11G, so i suspected that is not memory issue. Sorry, once i checked the related output including mesh, but i forgot what output it is, and now that host was destroyed, so i am only able to continue to track this after making my new environment. In addition, the GPU system even was made pending(nvidia-smi was blocked, no any response.) every time train.py failed. Does team have the plan to optimize this feature for it's memory allocation strategy? Regards |
Hello @hzhshok , Looking at the error metrics:
That's extremely large errors, so I assume the first pass did not train properly. What do the images look like in the I suspect something else is wrong already in the first pass. A few things that can affect quality:
Also, just to verify, is the example from the readme |
Hello, Thanks @jmunkberg! Is the lighting setup constant in all training data? Regards |
Hello, The question: Hardware: 3080(24G) - win11 Samples: (see the attached two picture for the example) { The console log: Regards |
@hzhshok I ran into the same error about "texture2d_mipBackward returned an invalid gradient" , in my case, it is "got [1,2,2,3] but expected [1,3,3,3]". @jmunkberg thanks for your great work btw, what would you suggest that can cause this issue? bad segmentation or sth. ? |
That is an error from nvdiffrast. I would try to use power-of-two resolutions on the textures and training, e.g.,
In case the |
@ZirongChan, i am not sure if it is caused by memory, you know i used images near 2kX2k which costed more memory for my single 24G GPU, so I used the strategy of spliting image to small blocks as Jmunkberg talked inside other issue, at least it did not happen such error. Regards |
Hello,
Recently i made myself test data to practically check what resolution and parameters is suitable for my desktop, i made the resolution of 1264x1264, and last it pended even for 24 hours, i checked it consumed the approximate 7408g GPU, so, is my data too rough(MSE - 0.2591xx, last img_loss=1.109134) or the other reason?
So, someone give me a share/suggestion about how to do to my high resolution please? e.g. the relationship, 11g GPU: max resolution and scale.
In addition: actually my images is very high resolution(3000x4000), but to my GPU(3080 11g), i scale that to 1264x1264 after some tries.
Hardware: DESKTOP RTX 3080(11g), cuda 11.3.
Software: ubuntu20.04
Running GPU cost during pending: always fixed cost, approximate 7.xg
Running output and strace log: see the following.
The console output:
python3 train.py --config ./configs/nerf_handong.json
Config / Flags:
config ./configs/nerf_handong.json
iter 2000
batch 1
spp 1
layers 4
train_res [1264, 1264]
display_res [1264, 1264]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.01]
min_roughness 0.08
custom_mip False
random_textures True
background white
loss logl1
out_dir out/nerf_handong
ref_mesh data/nerf_synthetic/handong
base_mesh None
validate True
mtl_override None
dmtet_grid 128
mesh_scale 2.75
env_scale 1.0
envmap None
display [{'latlong': True}, {'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
camera_space_light False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000
pre_load True
kd_min [0.0, 0.0, 0.0, 0.0]
kd_max [1.0, 1.0, 1.0, 1.0]
ks_min [0, 0.1, 0.0]
ks_max [1.0, 1.0, 1.0]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
cam_near_far [0.1, 1000.0]
learn_light True
local_rank 0
multi_gpu False
DatasetNERF: 28 images with shape [1264, 1264]
DatasetNERF: 28 images with shape [1264, 1264]
Encoder output: 32 dims
Using /home/xx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xx/.cache/torch_extensions/py38_cu113/renderutils_plugin/build.ninja...
Building extension module renderutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module renderutils_plugin...
iter= 0, img_loss=1.166577, reg_loss=0.334082, lr=0.02999, time=468.3 ms, rem=15.61 m
...
iter= 1990, img_loss=0.936701, reg_loss=0.014961, lr=0.01199, time=484.8 ms, rem=4.85 s
iter= 2000, img_loss=1.109134, reg_loss=0.014806, lr=0.01194, time=503.9 ms, rem=0.00 s
Running validation
MSE, PSNR
0.25911513, 6.480
^C^CKilled
strace log:
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
...
sched_yield() = 0
sched_yield() = 0
Thanks for your contribution!
Regards
The text was updated successfully, but these errors were encountered: