Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error building torch on clean docker compose --profile auto up --build #420

Closed
2 tasks done
dmarx opened this issue Apr 24, 2023 · 9 comments
Closed
2 tasks done
Labels
bug Something isn't working

Comments

@dmarx
Copy link

dmarx commented Apr 24, 2023

Has this issue been opened before?

  • It is not in the FAQ, I checked.
  • It is not in the issues, I searched.

Describe the bug

first attempt at building. the docker compose --profile download up --build step worked fine. attempting to run docker compose --profile auto up --build resulted in the following error:

=> => extracting sha256:3fd92eeca8f54976c24de929011349e191dc349bf932629b  0.0s
 => [xformers 2/3] RUN apk add --no-cache aria2                            3.0s
 => [xformers 3/3] RUN aria2c -x 5 --dir / --out wheel.whl 'https://gith  24.0s
 => [download 2/8] COPY clone.sh /clone.sh                                 0.1s
 => [download 3/8] RUN . /clone.sh taming-transformers https://github.co  16.0s
 => [download 4/8] RUN . /clone.sh stable-diffusion-stability-ai https:/  11.4s
 => [download 5/8] RUN . /clone.sh CodeFormer https://github.com/sczhou/C  2.0s
 => [download 6/8] RUN . /clone.sh BLIP https://github.com/salesforce/BLI  1.8s
 => [download 7/8] RUN . /clone.sh k-diffusion https://github.com/crowson  0.8s
 => [download 8/8] RUN . /clone.sh clip-interrogator https://github.com/p  0.9s
 => ERROR [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip  64.7s
------
 > [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip   pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117:
#0 1.315 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
#0 2.060 Collecting torch==1.13.1+cu117
#0 2.077   Downloading https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1801.8 MB)
#0 63.72      ━━━━━━━━━━━                              0.5/1.8 GB 11.7 MB/s eta 0:01:52
#0 63.72 ERROR: Exception:
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 437, in _error_catcher
#0 63.72     yield
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
#0 63.72     data = self._fp_read(amt) if not fp_closed else b""
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 526, in _fp_read
#0 63.72     return self._fp.read(amt) if amt is not None else self._fp.read()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 90, in read
#0 63.72     data = self.__fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/http/client.py", line 465, in read
#0 63.72     s = self.fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
#0 63.72     return self._sock.recv_into(b)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
#0 63.72     return self.read(nbytes, buffer)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
#0 63.72     return self._sslobj.read(len, buffer)
#0 63.72 TimeoutError: The read operation timed out
#0 63.72 
#0 63.72 During handling of the above exception, another exception occurred:
#0 63.72 
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
#0 63.72     status = run_func(*args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
#0 63.72     return func(self, options, args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
#0 63.72     requirement_set = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
#0 63.72     result = self._result = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
#0 63.72     state = resolution.resolve(requirements, max_rounds=max_rounds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
#0 63.72     self._add_to_criteria(self.state.criteria, r, parent=None)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
#0 63.72     if not criterion.candidates:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
#0 63.72     return bool(self._sequence)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
#0 63.72     return any(self)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
#0 63.72     return (c for c in iterator if id(c) not in self._incompatible_ids)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
#0 63.72     candidate = func()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
#0 63.72     self._link_candidate_cache[link] = LinkCandidate(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
#0 63.72     super().__init__(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
#0 63.72     self.dist = self._prepare()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
#0 63.72     dist = self._prepare_distribution()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
#0 63.72     return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
#0 63.72     return self._prepare_linked_requirement(req, parallel_builds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 536, in _prepare_linked_requirement
#0 63.72     local_file = unpack_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 166, in unpack_url
#0 63.72     file = get_http_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 107, in get_http_url
#0 63.72     from_path, content_type = download(link, temp_dir.path)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/download.py", line 147, in __call__
#0 63.72     for chunk in chunks:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/progress_bars.py", line 53, in _rich_progress_bar
#0 63.72     for chunk in iterable:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 63, in response_chunks
#0 63.72     for chunk in response.raw.stream(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 621, in stream
#0 63.72     data = self.read(amt=amt, decode_content=decode_content)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 559, in read
#0 63.72     with self._error_catcher():
#0 63.72   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
#0 63.72     self.gen.throw(typ, value, traceback)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 442, in _error_catcher
#0 63.72     raise ReadTimeoutError(self._pool, None, "Read timed out.")
#0 63.72 pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='download.pytorch.org', port=443): Read timed out.
#0 64.51 
#0 64.51 [notice] A new release of pip available: 22.3.1 -> 23.1.1
#0 64.51 [notice] To update, run: pip install --upgrade pip
------
failed to solve: executor failed running [/bin/sh -c pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117]: exit code: 2

Which UI

auto

Hardware / Software

  • OS: Ubuntu 22.04
  • OS version: Ubuntu 22.04.2 LTS
  • Docker Version:
Client: Docker Engine - Community
 Cloud integration: v1.0.31
 Version:           23.0.4
 API version:       1.41 (downgraded from 1.42)
 Go version:        go1.19.8
 Git commit:        f480fb1
 Built:             Fri Apr 14 10:32:03 2023
 OS/Arch:           linux/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112)
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.7
  Git commit:       5d6db84
  Built:            Tue Apr  4 18:18:42 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • Docker compose version: v2.17.2
  • Repo version: 2a0de02
  • RAM: plenty
  • GPU/VRAM: 3090
@dmarx dmarx added the bug Something isn't working label Apr 24, 2023
@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

let's see if this works

EDIT: yeah... don't do this. that torch version is pinned for a reason.

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

new error now after unpinning torch:

 => [stage-2 15/15] WORKDIR /stable-diffusion-webui                                            0.0s 
 => exporting to image                                                                        21.1s 
 => => exporting layers                                                                       21.1s 
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742   0.0s 
 => => naming to docker.io/library/sd-auto:51                                                  0.0s
[+] Running 1/1
 ✔ Container webui-docker-auto-1  Created                                                      0.2s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

i think this means I'm missing my cuda drivers?

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

deleted and rebuilt containers and images, still no luck.

=> => exporting layers                                                                                                        0.0s
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742                                   0.0s
 => => naming to docker.io/library/sd-auto:51                                                                                  0.0s
[+] Running 1/0
 ✔ Container webui-docker-auto-1  Created                                                                                      0.0s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

tried sudo-ing the command, seems to have at least gotten past the previous error. believe the root of the problem is discussed here: NVIDIA/nvidia-container-toolkit#154

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

services build, getting an error when trying to run a test prompt with everything else set to defaults...

webui-docker-auto-1  | Running on local URL:  http://0.0.0.0:7860
webui-docker-auto-1  | 
webui-docker-auto-1  | To create a public link, set `share=True` in `launch()`.
webui-docker-auto-1  | Startup time: 13.9s (import gradio: 0.8s, import ldm: 0.4s, other imports: 1.2s, load scripts: 0.2s, load SD checkpoint: 10.9s, create ui: 0.1s).
webui-docker-auto-1  | Error completing request
webui-docker-auto-1  | Arguments: ('task(td9v3amy7jrkdya)', 'a delicious cheeseburger', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, '', False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 56, in f
webui-docker-auto-1  |     res = list(func(*args, **kwargs))
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 37, in f
webui-docker-auto-1  |     res = func(*args, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/txt2img.py", line 56, in txt2img
webui-docker-auto-1  |     processed = process_images(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 486, in process_images
webui-docker-auto-1  |     res = process_images_inner(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 625, in process_images_inner
webui-docker-auto-1  |     uc = get_conds_with_caching(prompt_parser.get_learned_conditioning, negative_prompts, p.steps, cached_uc)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 570, in get_conds_with_caching
webui-docker-auto-1  |     cache[1] = function(shared.sd_model, required_prompts, steps)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/prompt_parser.py", line 140, in get_learned_conditioning
webui-docker-auto-1  |     conds = model.get_learned_conditioning(texts)
webui-docker-auto-1  |   File "/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 669, in get_learned_conditioning
webui-docker-auto-1  |     c = self.cond_stage_model(c)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
webui-docker-auto-1  |     return forward_call(*input, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 229, in forward
webui-docker-auto-1  |     z = self.process_tokens(tokens, multipliers)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 254, in process_tokens
webui-docker-auto-1  |     z = self.encode_with_transformers(tokens)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 302, in encode_with_transformers
webui-docker-auto-1  |     outputs = self.wrapped.transformer(input_ids=tokens, output_hidden_states=-opts.CLIP_stop_at_last_layers)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
webui-docker-auto-1  |     result = hook(self, input)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/lowvram.py", line 35, in send_me_to_gpu
webui-docker-auto-1  |     module.to(devices.device)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
webui-docker-auto-1  |     return self._apply(convert)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   [Previous line repeated 2 more times]
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
webui-docker-auto-1  |     param_applied = fn(param)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
webui-docker-auto-1  |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
webui-docker-auto-1  | RuntimeError: CUDA error: unspecified launch failure
webui-docker-auto-1  | CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
webui-docker-auto-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@AbdBarho
Copy link
Owner

the first error that you got was just a timeout because of wonky internet connection, if you try building again it should be fixed (hopefully). Please keep pytorch pinned, otherwise you would get a lot of unexpected errors.

The second error seems weird, what is the output of this command?

docker run --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

if you get the same error, then it is probably a problem with docker not being able to see your GPU.

Make sure you have nvidia container toolkit installed and working.

@dmarx
Copy link
Author

dmarx commented Apr 24, 2023

i think the issue might've been that i had nvidia-container-toolkit-base installed as well. I uninstalled both, reinstalled nvidia-container-toolkit, restarted, and i've got the test image generating successfully now. not sure if the issue was that package or that I just needed to restart. i'm only able to get docker to see my GPU when I run with sudo though which I'm not a huge fan of... anyway, looks like the issue was with me not realizing I'd skipped the pre-reqs on a too-fresh ubuntu re-install.

@dmarx dmarx closed this as completed Apr 24, 2023
@AviVarma
Copy link

I've just had this issue too on Ubuntu 23.04. I fixed it by re-installing nvidia-container-toolkit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants