Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending after computing MSE and PSNR #13

Open
hzhshok opened this issue Apr 2, 2022 · 8 comments
Open

Pending after computing MSE and PSNR #13

hzhshok opened this issue Apr 2, 2022 · 8 comments

Comments

@hzhshok
Copy link

hzhshok commented Apr 2, 2022

Hello,
Recently i made myself test data to practically check what resolution and parameters is suitable for my desktop, i made the resolution of 1264x1264, and last it pended even for 24 hours, i checked it consumed the approximate 7408g GPU, so, is my data too rough(MSE - 0.2591xx, last img_loss=1.109134) or the other reason?

So, someone give me a share/suggestion about how to do to my high resolution please? e.g. the relationship, 11g GPU: max resolution and scale.

In addition: actually my images is very high resolution(3000x4000), but to my GPU(3080 11g), i scale that to 1264x1264 after some tries.

Hardware: DESKTOP RTX 3080(11g), cuda 11.3.
Software: ubuntu20.04
Running GPU cost during pending: always fixed cost, approximate 7.xg
Running output and strace log: see the following.

The console output:
python3 train.py --config ./configs/nerf_handong.json
Config / Flags:

config ./configs/nerf_handong.json
iter 2000
batch 1
spp 1
layers 4
train_res [1264, 1264]
display_res [1264, 1264]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.01]
min_roughness 0.08
custom_mip False
random_textures True
background white
loss logl1
out_dir out/nerf_handong
ref_mesh data/nerf_synthetic/handong
base_mesh None
validate True
mtl_override None
dmtet_grid 128
mesh_scale 2.75
env_scale 1.0
envmap None
display [{'latlong': True}, {'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
camera_space_light False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000
pre_load True
kd_min [0.0, 0.0, 0.0, 0.0]
kd_max [1.0, 1.0, 1.0, 1.0]
ks_min [0, 0.1, 0.0]
ks_max [1.0, 1.0, 1.0]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
cam_near_far [0.1, 1000.0]
learn_light True
local_rank 0
multi_gpu False

DatasetNERF: 28 images with shape [1264, 1264]
DatasetNERF: 28 images with shape [1264, 1264]
Encoder output: 32 dims
Using /home/xx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xx/.cache/torch_extensions/py38_cu113/renderutils_plugin/build.ninja...
Building extension module renderutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module renderutils_plugin...
iter= 0, img_loss=1.166577, reg_loss=0.334082, lr=0.02999, time=468.3 ms, rem=15.61 m
...
iter= 1990, img_loss=0.936701, reg_loss=0.014961, lr=0.01199, time=484.8 ms, rem=4.85 s
iter= 2000, img_loss=1.109134, reg_loss=0.014806, lr=0.01194, time=503.9 ms, rem=0.00 s
Running validation
MSE, PSNR
0.25911513, 6.480

^C^CKilled

strace log:
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
...
sched_yield() = 0
sched_yield() = 0

Thanks for your contribution!

Regards

@jmunkberg
Copy link
Collaborator

After the first pass, we run xatlas to create a UV parameterization on the triangle mesh. If the first pass failed to create a reasonable mesh, this step can take quite some time or even fail. How does the mesh look in your case at the end of the first pass?

For memory consumption, you can log the usage using nvidia-smi --query-gpu=memory.used --format=csv -lms 100 when you run to get a feel for the usage. Memory is a function of image resolution, batch size and if you have depth peeling enabled. We ran the results in the paper using HPUs with 32+GB of memory, but it should run on lower-spec GPUs if you decrease the rendering resolution and/or batch size.

@hzhshok
Copy link
Author

hzhshok commented Apr 10, 2022

Thanks @jmunkberg!

Why i raised this issue is because the general command 'nvidia-smi' without parameters showed it only used 7.4G totally, and that GPU memory has 11G, so i suspected that is not memory issue.

Sorry, once i checked the related output including mesh, but i forgot what output it is, and now that host was destroyed, so i am only able to continue to track this after making my new environment.

In addition, the GPU system even was made pending(nvidia-smi was blocked, no any response.) every time train.py failed.
I understand why this feature uses Starvation algorithm for the GPU memory allocation, but,

Does team have the plan to optimize this feature for it's memory allocation strategy?

Regards

@jmunkberg
Copy link
Collaborator

jmunkberg commented Apr 13, 2022

Hello @hzhshok ,

Looking at the error metrics:

MSE, PSNR
0.25911513, 6.480

That's extremely large errors, so I assume the first pass did not train properly. What do the images look like in the out/nerf_handong folder (or the name of your current experiment)? If the reconstruction succeeded, I would expect a PSNR of 25 dB or higher. If the reconstruction fails, it is very hard to create a uv-parameterization (it is hard to uv map a triangle soup), and xatlas would fail/hang.

I suspect something else is wrong already in the first pass. A few things that can affect quality:

  • Is the lighting setup constant in all training data?
  • Do you have high quality foreground segmentation masks?
  • Are you sure that the poses are correct and that the pose indices and corresponding images match?
  • Does the training images contain substantial motion or defocus blur

Also, just to verify, is the example from the readme python train.py --config configs/bob.json working without issues on your setup?

@hzhshok
Copy link
Author

hzhshok commented Apr 14, 2022

Hello, Thanks @jmunkberg!
Yes, that something should be wrong in the first pass, and i will check the trained effect of the labeled images using current images as the fundamental.

Is the lighting setup constant in all training data?
-- No, some of images has the strong light, because this images is from outdoor.
Do you have high quality foreground segmentation masks?
--Do you mean the high quality foreground is that the image was labeled on the pure background as the examples(chair or other...)?
This time, I just used the wild outdoor images to check if this can be done for non-constant light and relatively pure color in images, and did not label the images, and i will check the effect with the high quality foreground.-)
In addition, i just want to check the effect of texturing mesh after using the traditional sfm tool.
Are you sure that the poses are correct and that the pose indices and corresponding images match?
-- Yes, it should be, the colmap can get result although it did not had good effect.
Does the training images contain substantial motion or defocus blur
-- it no more blur, that images are resized from high resolution 3003x4000 to 1264x1264, i think i should have impact, but

Regards

@hzhshok
Copy link
Author

hzhshok commented May 10, 2022

Hello,
Now i found other samples about building to try this feature, and labeled the samples to eliminate the background, it seems like passing the first train/mesh , but it failed to do second train/mesh, so please give a suggestion, thanks!

The question:
a. Why it breaked off during running? the log seeing the console output?
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3]
Maybe the MSE/PSNR are not the good one, if so, could you help give me suggestion please about how to imporve training images aspects? or trainning images not suitable the rule?
b. Why the result have serious blur? see the intermediate picture and two original image sampels.
c. How to imporve the last result? input trining images or other.

img_dmtet_pass1_000000
img_dmtet_pass1_000041
img_dmtet_pass1_000045
img_dmtet_pass1_000079
img_dmtet_pass1_000080

IMG_4287
IMG_4332

Hardware: 3080(24G) - win11

Samples: (see the attached two picture for the example)
a. 50 images.
b. Resolution: 2456(width)x1638(hight)
c. Transform parameters with: --aabb_scale 2 for colmap.

{
"ref_mesh": "data/nerf_synthetic/building",
"random_textures": true,
"iter": 8000,
"save_interval": 100,
"texture_res": [5120,5120],
"train_res": [1638, 2456],
"batch": 1,
"learning_rate": [0.03, 0.0001],
"ks_min" : [0, 0.08, 0.0],
"dmtet_grid" : 128,
"mesh_scale" : 5,
"laplace_scale" : 3000,
"display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}],
"layers" : 4,
"background" : "white",
"out_dir": "nerf_building"
}

The console log:
iter= 8000, img_loss=0.061592, reg_loss=0.016066, lr=0.00075, time=499.0 ms, rem=0.00 s
Running validation
MSE, PSNR
0.02140407, 17.038
Base mesh has 214359 triangles and 105082 vertices.
Writing mesh: out/nerf_building\dmtet_mesh/mesh.obj
writing 105082 vertices
writing 224020 texcoords
writing 105082 normals
writing 214359 faces
Writing material: out/nerf_building\dmtet_mesh/mesh.mtl
Done exporting mesh
Traceback (most recent call last):
File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 620, in
geometry, mat = optimize_mesh(glctx, geometry, base_mesh.material, lgt, dataset_train, dataset_validate, FLAGS,
File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 428, in optimize_mesh
total_loss.backward()
File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch\autograd_init_.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3]

Regards

@ZirongChan
Copy link

ZirongChan commented Aug 3, 2022

@hzhshok I ran into the same error about "texture2d_mipBackward returned an invalid gradient" , in my case, it is "got [1,2,2,3] but expected [1,3,3,3]".
Did you solve this issue? or did you find the reason about it .

@jmunkberg thanks for your great work btw, what would you suggest that can cause this issue? bad segmentation or sth. ?

@jmunkberg
Copy link
Collaborator

That is an error from nvdiffrast. I would try to use power-of-two resolutions on the textures and training, e.g.,

    "texture_res": [ 1024, 1024 ],
    "train_res": [1024, 1024],

In case the texture2d_mipBackward is not stable for all (non-pow2) resolutions.

@hzhshok
Copy link
Author

hzhshok commented Aug 10, 2022

@ZirongChan, i am not sure if it is caused by memory, you know i used images near 2kX2k which costed more memory for my single 24G GPU, so I used the strategy of spliting image to small blocks as Jmunkberg talked inside other issue, at least it did not happen such error.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants