Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Docker Container #38

Closed
iam-machine opened this issue Aug 18, 2023 · 20 comments
Closed

Problem with Docker Container #38

iam-machine opened this issue Aug 18, 2023 · 20 comments

Comments

@iam-machine
Copy link

Hi! Your project is so amazing that even a person who knows nothing about coding (yes, that's me) decided to try it :) Expectedly, I had some problems getting everything to work. I use WSL2 on Windows 11, and when I run the script:

docker run -it chenhsuanlin/colmap:3.8 /bin/bash

I get the following warning about nvidia driver:

==========
== CUDA ==

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

But when I run the same with --gpus, like:
docker run --gpus all -it chenhsuanlin/colmap:3.8 /bin/bash

I get this:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/1140653493db5ca0f2b71b42c2194624b1e9e50bd0f9f72121bf836058a77900/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0000] error waiting for container:

What am I doing wrong?

@chenhsuanlin
Copy link
Contributor

This is likely a duplicate of #29, which we have not yet been able to reproduce.

@iam-machine
Copy link
Author

This is so sad :( And very strange also, because I run two commands in wsl - nvidia-smi and nvcc --version, both show info without any errors. So I really don't understand why docker run with --gpus all fails

@iam-machine
Copy link
Author

What strange is that I tried another docker image, from nvidia, and it worked! No warnings, no errors.

iammachine@User:~$ docker run --gpus all -it nvidia/cuda:12.2.0-devel-ubuntu20.04

==========
== CUDA ==
==========

CUDA Version 12.2.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

ChatGPT was kindly helping me with all of that, and I will just copy here the reply:

"It looks like the chenhsuanlin/colmap:3.8 image still has an issue, as indicated by the error about the libnvidia-ml.so.1 file. This could be a compatibility problem with how the image was constructed, or it might have something to do with specific libraries it's trying to leverage that are clashing with the WSL setup.

However, the successful run of the CUDA 12.2.0 container is a promising sign that your Docker setup with NVIDIA GPUs is functioning correctly.

If you're mainly looking to utilize GPU power inside Docker, you might be able to proceed with other Docker images that are compatible with your setup."

@iam-machine
Copy link
Author

WORKING!!


iammachine@User:~$ gh repo clone NVlabs/neuralangelo
Cloning into 'neuralangelo'...
remote: Enumerating objects: 243, done.
remote: Counting objects: 100% (87/87), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 243 (delta 40), reused 48 (delta 32), pack-reused 156
Receiving objects: 100% (243/243), 14.82 MiB | 3.57 MiB/s, done.
Resolving deltas: 100% (94/94), done.
iammachine@User:~$ cd neuralangelo
iammachine@User:~/neuralangelo$ docker build -t chenhsuanlin/colmap:3.8 -f docker/Dockerfile-colmap .
[+] Building 392.4s (12/12) FINISHED                                                                     docker:default
 => [internal] load build definition from Dockerfile-colmap                                                        0.1s
 => => transferring dockerfile: 1.28kB                                                                             0.0s
 => [internal] load .dockerignore                                                                                  0.1s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04                                      2.9s
 => [1/8] FROM nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04@sha256:594e37669bf42ae55fa2b4a06af4dcf9becc86045ab26a  0.4s
 => => resolve nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04@sha256:594e37669bf42ae55fa2b4a06af4dcf9becc86045ab26a  0.0s
 => => sha256:594e37669bf42ae55fa2b4a06af4dcf9becc86045ab26a92c39bb93eb5781d0b 743B / 743B                         0.0s
 => => sha256:36dbd102712c20d1553eba13e2ad6d691bbb869845355e03e54fd24b77738d7c 2.63kB / 2.63kB                     0.0s
 => => sha256:ff3049f3773d3606a7188e1ba3b678c23f8c4b821d0ef3443fad1ad9aac28000 18.41kB / 18.41kB                   0.0s
 => [2/8] RUN apt-get update && apt-get install -y     git     cmake     ninja-build     build-essential     li  159.9s
 => [3/8] RUN apt-get update && apt-get install -y     xvfb                                                        7.4s
 => [4/8] RUN git clone https://github.com/colmap/colmap.git && cd colmap && git checkout 3.8                      5.0s
 => [5/8] RUN cd colmap && mkdir build && cd build && cmake .. -DCUDA_ENABLED=ON -DCMAKE_CUDA_ARCHITECTURES="70;7  2.1s
 => [6/8] RUN cd colmap/build && ninja && ninja install                                                          147.6s
 => [7/8] RUN apt-get update && apt-get install -y     pip     ffmpeg                                             37.0s
 => [8/8] RUN pip install     addict     opencv-python-headless     pillow     pyyaml     trimesh                 25.5s
 => exporting to image                                                                                             4.4s
 => => exporting layers                                                                                            4.4s
 => => writing image sha256:73a26d407346e37486b00c0bf191dcd5e6cad5424ad2d1eb2e0009af4d1ecc25                       0.0s
 => => naming to docker.io/chenhsuanlin/colmap:3.8                                                                 0.0s
iammachine@User:~/neuralangelo$ docker run --gpus all --ipc=host -it chenhsuanlin/colmap:3.8 /bin/bash

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Basically, after properly installing everything related to cuda, I cloned git repo, didn't run the docker image but built it, and now it works without errors. Windows 11, WSL2 Ubuntu. Later I'll check how video processing works, hoping for the best :)

@chenhsuanlin
Copy link
Contributor

Thanks @iam-machine for helping look into this! Could you elaborate more on what you had to go through to get it working? This would help others with similar issues as well. Unfortunately we don't have Windows machines so cannot test with WSL, so your experience would be very helpful to us as well!

@chenhsuanlin
Copy link
Contributor

Also for additional context, what GPU models and driver versions did you have?

@iam-machine
Copy link
Author

@chenhsuanlin Of course, but first I will test the whole pipeline to work properly, and if it does - I will then describe what I've done to make it work :) Right now I already successfully completed the Data Preparation instructions, installed Blender with BlenderNeuralangelo add-on and completed steps there. Now, as I understand, I need to run another docker image (docker-neuralangelo), enter some code there from guide and do the Isosurface extraction.

@iam-machine
Copy link
Author

@chenhsuanlin Started training and stopped it somewhere at 1500 epoch, checked whether I have checkpoints saved and found nothing in neuralangelo folder. Nothing was saved. I checked config.yaml - it literally says that checkpoints must be created each 9999999999 epoch. Why it's so by default? I changed this to be 500, saved, everything really was saved, but after starting training something is going on that reverts back the config.yaml file. Hence, no matter how long I wait for training, I do not get checkpoint files. Strange.

@chenhsuanlin
Copy link
Contributor

@iam-machine the checkpoints should be saved in the {root}/logs directory by default. We measure by iterations, so this is what actually matters (see also #6).

@iam-machine
Copy link
Author

@chenhsuanlin Thank you, it helped! almost a week of hell and I made it work ot windows 11. Tested the whole pipeline, to the point where I exported .ply file from Meshlab to Maya as .obj. Wasn't able to export textured mesh, but I noticed that I used a bit older version of repo, so I updated it yesterday, new training is still running. Windows + wsl 2 + Ubuntu 22 (don't remember the exact version) + docker. The amount of errors I solved to make it work is insane. I'm not even a programmer, I'm a 3d artist, so that was really hell troubleshooting everything. I'm not able to describe my steps in details because I was working on it for like a week, from morning to night. Huge amount of work, it's just crazy.

@chenhsuanlin
Copy link
Contributor

Thanks @iam-machine, great to know you finally got it working! Sorry you had to go through this trouble, I don't have a Windows + WSL to develop with so I couldn't really put together a formal doc.
Quick note though -- you should also be able to load the old checkpoint and extract textured meshes from the newly pulled code. Let me know if it doesn't work!

@iam-machine
Copy link
Author

@chenhsuanlin By the way, when I exported the mesh I noticed it has too much polygons, if it would have 2 times less there would be no difference in details at all. Maybe there is some kind of parameter that regulates this? I don't mind doing auto retopo in Maya, but I just thought that using some parameter to lower the amount of polygons would speed up some processes. Though, the mesh extraction process is quick, the training is slow... And this exact second I thought that polygons have nothing to do with training speed 😂

@chenhsuanlin
Copy link
Contributor

Yes, the triangle count is controlled by the marching cubes resolution (argument --resolution in the mesh extraction script) and not directly related to training (you can try 1024 or 512 for faster and more lightweight extraction).

@iam-machine
Copy link
Author

@chenhsuanlin Got it, thanks :) What about the video resolution - are there any rules regarding that? I mean, I thought if I shoot video in 2k at least, maybe this way I would be able to get better mesh detailing, and get it even an earlier stages of checkpoint than when training on video with small resolution. No? Or maybe it's exactly opposite - the bigger the video resolution the slower the training, and the slower details reconstruction?

@smandava98
Copy link

Hi @iam-machine and @chenhsuanlin. I noticed this thread was open and I had a similar question so I am deciding it to post on here.
I also am having some issues with the training process being slow (I am on an A10 GPU and it is taking a while). I am representing indoor scenes and a little bit outdoor scenes. They are small scenes (e.g a video of my table with some books on it) for example. Is there a recommended approach to decrease the training speed to have it take 10-20 min or less?

Thank you :)
This repo is by far one of the best I personally have experienced in terms of ease to run (on Ubuntu at least) and I have worked with a lot of repos in the past where this was not the case..

@chenhsuanlin
Copy link
Contributor

chenhsuanlin commented Aug 22, 2023

@iam-machine in general it would be the latter -- the larger the video resolution, the more time it would take to recover the fine details. If you want to get more details at earlier stages, you can try increasing the hyperparameter model.object.sdf.encoding.coarse2fine.init_active_level.

@chenhsuanlin
Copy link
Contributor

@smandava98 thanks for the kind words! You can adjust the below hyperparameters correspondingly to optimization with fewer iterations:

  • max_iter (max iterations, obviously)
  • model.object.sdf.encoding.coarse2fine.step (increment one progressive level every N iterations)
  • optim.sched.warm_up_end (learning rate warmup iterations)
  • optim.sched.two_steps (decrease the learning rate at these iteration numbers)

Please also see #4 and the FAQ for details.

@smandava98
Copy link

Thanks @chenhsuanlin ! Just to clarify, no matter the range of hyper parameters I give (because I want to optimize for speed in my case), will the results at the very minimum be the same quality as instant NGP? From the paper it seems that is being used and it is being optimized over that (pls correct me if I'm wrong)

@chenhsuanlin
Copy link
Contributor

@smandava98 this is not guaranteed (and we did not do a systematic study on this). Despite the backbone being the same, there is still a difference between the 3D representations (NeRF vs. neural SDF). Also, the instant-ngp library is very well-engineered and the implementation is very optimized (including sampling for volume rendering); for Neuralangelo, we only borrow the same backbone architecture for its representation power.

@chenhsuanlin
Copy link
Contributor

Closing due to inactivity, please feel free to reopen if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants