Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues installing torch with the GPU on top of synerbi/sirf:service-gpu #213

Open
Imraj-Singh opened this issue Mar 6, 2024 · 13 comments
Open

Comments

@Imraj-Singh
Copy link
Contributor

Imraj-Singh commented Mar 6, 2024

I ran into some issues installing torch within the synerbi/sirf:service-gpu container.

I cloned the SIRF-Exercises locally, then opened the folder within VSCode changing the .devcontainer base image from devel-sevice to sevice-gpu. I then "Reopen in devcontainer" this takes a while to pull the image and update the environment.

In the new container the command nvidia-smi is not found, but nvcc is found. I tried installing torch, but can't seem to get the GPU working: torch.cuda.is_available() is false.

It was giving some more sinister error when I tried a couple of weeks ago (something to do with not finding some cuda libraries) but can't seem to recreate... I am quite sure mamba is just choosing the cpu version of torch due to not being able to find some drivers? Perhaps all what's needed is a apt update but I run into permission issues...

@casperdcl apologies if this isn't the right place to put this issue!

@casperdcl
Copy link
Member

casperdcl commented Apr 12, 2024

changing the .devcontainer base image from devel-sevice to sevice-gpu

but current config uses ghcr.io/synerbi/sirf:latest not :devel-sevice

"image": "ghcr.io/synerbi/sirf:latest",

Also it doesn't have any GPU-specific options like --gpus all etc... and I don't know if VSCode locally works with the underlying docker-stacks image (SyneRBI/SIRF-SuperBuild#865)

Does this not work for you?

$ docker run --rm --gpus all -it ghcr.io/synerbi/sirf:latest-gpu /bin/bash
sirf$ pip install torch
sirf$ python -c 'import torch; print(torch.cuda.is_available())'

@Imraj-Singh
Copy link
Contributor Author

I must've used the older config.

I've just tried with the the new image and ir still does not recognise nvidia-smi, and when I run sirf$ python -c 'import torch; print(torch.cuda.is_available())' it returns false. I pulled the image then ran pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 to install pytorch.

@casperdcl
Copy link
Member

casperdcl commented Apr 15, 2024

hmm... on my machine I have:

$ docker run --rm --gpus all -it ghcr.io/synerbi/sirf:latest-gpu nvidia-smi
Mon Apr 15 09:40:26 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
...

(So cuda version 12.4)

@Imraj-Singh
Copy link
Contributor Author

It works! I just needed launch the devcontainer with slight updates:
"image": "ghcr.io/synerbi/sirf:latest-gpu", "runArgs": ["--gpus=all"],
This is in-line with what you wrote here, my apologies for not seeing this earlier. I guess it could be worth adding a note in the DocForParticipants, although this may be too niche and you do have links to the docker README in there.

@KrisThielemans
Copy link
Member

Great. Is there a way to have 2 devcontainer specs?

@casperdcl
Copy link
Member

casperdcl commented Apr 16, 2024

Apart from templates (vis. SyneRBI/SIRF-SuperBuild#865) not really afaik

@casperdcl
Copy link
Member

casperdcl commented May 13, 2024

@paskino just made this for @gschramm

SIRF-SuperBuild/docker/compose.sh -bg -- \
  --build-arg EXTRA_BUILD_FLAGS="-DSIRF_TAG=2c66faff3bc0f12c864cfc2a2931eba5ade60ba0"
FROM synerbi/sirf:latest-gpu
RUN pip install torch
docker build -t ghcr.io/synerbi/sirf:training2024_0.1 .

and pushed it to ghcr.io/synerbi/sirf:training2024_0.1

Though I think it should be:

FROM synerbi/sirf:latest-gpu
RUN mamba install jupytext parallelproj \
  && pip install --no-cache-dir torch \
  && mamba clean -a -y -f && fix-permissions "${CONDA_DIR}" /home/${NB_USER}

@KrisThielemans
Copy link
Member

Not 100% sure about adding parallelproj here. We build one ourselves already. Could lead to interesting conflicts (as our build is independent of pip). We could avoid that by using instructions on https://github.com/SyneRBI/SIRF/wiki/Building-SIRF-and-CIL-with-conda, but it's too late for that now.

So... do we need parallelproj Python for this image?

(Obviously, all the fix-permissions stuff is a bit ugly, but as long as you manage this...)

@gschramm
Copy link
Contributor

parallelproj (and also jupytext) are indeed not needed. Sorry for the confusion (I asked a while ago for parallelproj). The only extra package I need for the DL exercises is pytorch

@KrisThielemans
Copy link
Member

@samdporter @NicoleJurjew dod you need jupytext?

If so, it should be added to the environment.yml

@NicoleJurjew
Copy link
Contributor

NicoleJurjew commented May 14, 2024 via email

@samdporter
Copy link
Contributor

I'm not using it either (although having a read about it, it looks like I probably should?)

@casperdcl
Copy link
Member

casperdcl commented May 14, 2024

Ok. --no-cache-dir would also make the images 2GB smaller (!).

I'll just add jupytext and remove caches in the current image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants