Error building torch on clean `docker compose --profile auto up --build` #420

dmarx · 2023-04-24T04:33:26Z

Has this issue been opened before?

It is not in the FAQ, I checked.
It is not in the issues, I searched.

Describe the bug

first attempt at building. the docker compose --profile download up --build step worked fine. attempting to run docker compose --profile auto up --build resulted in the following error:

=> => extracting sha256:3fd92eeca8f54976c24de929011349e191dc349bf932629b  0.0s
 => [xformers 2/3] RUN apk add --no-cache aria2                            3.0s
 => [xformers 3/3] RUN aria2c -x 5 --dir / --out wheel.whl 'https://gith  24.0s
 => [download 2/8] COPY clone.sh /clone.sh                                 0.1s
 => [download 3/8] RUN . /clone.sh taming-transformers https://github.co  16.0s
 => [download 4/8] RUN . /clone.sh stable-diffusion-stability-ai https:/  11.4s
 => [download 5/8] RUN . /clone.sh CodeFormer https://github.com/sczhou/C  2.0s
 => [download 6/8] RUN . /clone.sh BLIP https://github.com/salesforce/BLI  1.8s
 => [download 7/8] RUN . /clone.sh k-diffusion https://github.com/crowson  0.8s
 => [download 8/8] RUN . /clone.sh clip-interrogator https://github.com/p  0.9s
 => ERROR [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip  64.7s
------
 > [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip   pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117:
#0 1.315 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
#0 2.060 Collecting torch==1.13.1+cu117
#0 2.077   Downloading https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1801.8 MB)
#0 63.72      ━━━━━━━━━━━                              0.5/1.8 GB 11.7 MB/s eta 0:01:52
#0 63.72 ERROR: Exception:
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 437, in _error_catcher
#0 63.72     yield
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
#0 63.72     data = self._fp_read(amt) if not fp_closed else b""
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 526, in _fp_read
#0 63.72     return self._fp.read(amt) if amt is not None else self._fp.read()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 90, in read
#0 63.72     data = self.__fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/http/client.py", line 465, in read
#0 63.72     s = self.fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
#0 63.72     return self._sock.recv_into(b)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
#0 63.72     return self.read(nbytes, buffer)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
#0 63.72     return self._sslobj.read(len, buffer)
#0 63.72 TimeoutError: The read operation timed out
#0 63.72 
#0 63.72 During handling of the above exception, another exception occurred:
#0 63.72 
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
#0 63.72     status = run_func(*args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
#0 63.72     return func(self, options, args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
#0 63.72     requirement_set = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
#0 63.72     result = self._result = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
#0 63.72     state = resolution.resolve(requirements, max_rounds=max_rounds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
#0 63.72     self._add_to_criteria(self.state.criteria, r, parent=None)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
#0 63.72     if not criterion.candidates:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
#0 63.72     return bool(self._sequence)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
#0 63.72     return any(self)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
#0 63.72     return (c for c in iterator if id(c) not in self._incompatible_ids)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
#0 63.72     candidate = func()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
#0 63.72     self._link_candidate_cache[link] = LinkCandidate(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
#0 63.72     super().__init__(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
#0 63.72     self.dist = self._prepare()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
#0 63.72     dist = self._prepare_distribution()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
#0 63.72     return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
#0 63.72     return self._prepare_linked_requirement(req, parallel_builds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 536, in _prepare_linked_requirement
#0 63.72     local_file = unpack_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 166, in unpack_url
#0 63.72     file = get_http_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 107, in get_http_url
#0 63.72     from_path, content_type = download(link, temp_dir.path)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/download.py", line 147, in __call__
#0 63.72     for chunk in chunks:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/progress_bars.py", line 53, in _rich_progress_bar
#0 63.72     for chunk in iterable:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 63, in response_chunks
#0 63.72     for chunk in response.raw.stream(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 621, in stream
#0 63.72     data = self.read(amt=amt, decode_content=decode_content)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 559, in read
#0 63.72     with self._error_catcher():
#0 63.72   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
#0 63.72     self.gen.throw(typ, value, traceback)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 442, in _error_catcher
#0 63.72     raise ReadTimeoutError(self._pool, None, "Read timed out.")
#0 63.72 pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='download.pytorch.org', port=443): Read timed out.
#0 64.51 
#0 64.51 [notice] A new release of pip available: 22.3.1 -> 23.1.1
#0 64.51 [notice] To update, run: pip install --upgrade pip
------
failed to solve: executor failed running [/bin/sh -c pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117]: exit code: 2

Which UI

auto

Hardware / Software

OS: Ubuntu 22.04
OS version: Ubuntu 22.04.2 LTS
Docker Version:

Client: Docker Engine - Community
 Cloud integration: v1.0.31
 Version:           23.0.4
 API version:       1.41 (downgraded from 1.42)
 Go version:        go1.19.8
 Git commit:        f480fb1
 Built:             Fri Apr 14 10:32:03 2023
 OS/Arch:           linux/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112)
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.7
  Git commit:       5d6db84
  Built:            Tue Apr  4 18:18:42 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker compose version: v2.17.2
Repo version: 2a0de02
RAM: plenty
GPU/VRAM: 3090

The text was updated successfully, but these errors were encountered:

dmarx · 2023-04-24T04:44:44Z

let's see if this works

EDIT: yeah... don't do this. that torch version is pinned for a reason.

dmarx · 2023-04-24T06:44:47Z

new error now after unpinning torch:

 => [stage-2 15/15] WORKDIR /stable-diffusion-webui                                            0.0s 
 => exporting to image                                                                        21.1s 
 => => exporting layers                                                                       21.1s 
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742   0.0s 
 => => naming to docker.io/library/sd-auto:51                                                  0.0s
[+] Running 1/1
 ✔ Container webui-docker-auto-1  Created                                                      0.2s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

i think this means I'm missing my cuda drivers?

dmarx · 2023-04-24T07:33:46Z

confirmed... I didn't have my cuda stuff configured :/

For posterity:

nvidia-smi works properly, so does the hello-world nvidia docker container. still getting the same error :(

dmarx · 2023-04-24T07:37:27Z

deleted and rebuilt containers and images, still no luck.

=> => exporting layers                                                                                                        0.0s
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742                                   0.0s
 => => naming to docker.io/library/sd-auto:51                                                                                  0.0s
[+] Running 1/0
 ✔ Container webui-docker-auto-1  Created                                                                                      0.0s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

dmarx · 2023-04-24T07:41:45Z

tried sudo-ing the command, seems to have at least gotten past the previous error. believe the root of the problem is discussed here: NVIDIA/nvidia-container-toolkit#154

dmarx · 2023-04-24T16:37:30Z

services build, getting an error when trying to run a test prompt with everything else set to defaults...

webui-docker-auto-1  | Running on local URL:  http://0.0.0.0:7860
webui-docker-auto-1  | 
webui-docker-auto-1  | To create a public link, set `share=True` in `launch()`.
webui-docker-auto-1  | Startup time: 13.9s (import gradio: 0.8s, import ldm: 0.4s, other imports: 1.2s, load scripts: 0.2s, load SD checkpoint: 10.9s, create ui: 0.1s).
webui-docker-auto-1  | Error completing request
webui-docker-auto-1  | Arguments: ('task(td9v3amy7jrkdya)', 'a delicious cheeseburger', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, '', False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 56, in f
webui-docker-auto-1  |     res = list(func(*args, **kwargs))
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 37, in f
webui-docker-auto-1  |     res = func(*args, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/txt2img.py", line 56, in txt2img
webui-docker-auto-1  |     processed = process_images(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 486, in process_images
webui-docker-auto-1  |     res = process_images_inner(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 625, in process_images_inner
webui-docker-auto-1  |     uc = get_conds_with_caching(prompt_parser.get_learned_conditioning, negative_prompts, p.steps, cached_uc)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 570, in get_conds_with_caching
webui-docker-auto-1  |     cache[1] = function(shared.sd_model, required_prompts, steps)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/prompt_parser.py", line 140, in get_learned_conditioning
webui-docker-auto-1  |     conds = model.get_learned_conditioning(texts)
webui-docker-auto-1  |   File "/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 669, in get_learned_conditioning
webui-docker-auto-1  |     c = self.cond_stage_model(c)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
webui-docker-auto-1  |     return forward_call(*input, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 229, in forward
webui-docker-auto-1  |     z = self.process_tokens(tokens, multipliers)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 254, in process_tokens
webui-docker-auto-1  |     z = self.encode_with_transformers(tokens)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 302, in encode_with_transformers
webui-docker-auto-1  |     outputs = self.wrapped.transformer(input_ids=tokens, output_hidden_states=-opts.CLIP_stop_at_last_layers)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
webui-docker-auto-1  |     result = hook(self, input)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/lowvram.py", line 35, in send_me_to_gpu
webui-docker-auto-1  |     module.to(devices.device)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
webui-docker-auto-1  |     return self._apply(convert)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   [Previous line repeated 2 more times]
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
webui-docker-auto-1  |     param_applied = fn(param)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
webui-docker-auto-1  |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
webui-docker-auto-1  | RuntimeError: CUDA error: unspecified launch failure
webui-docker-auto-1  | CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
webui-docker-auto-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

AbdBarho · 2023-04-24T16:49:24Z

the first error that you got was just a timeout because of wonky internet connection, if you try building again it should be fixed (hopefully). Please keep pytorch pinned, otherwise you would get a lot of unexpected errors.

The second error seems weird, what is the output of this command?

docker run --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

if you get the same error, then it is probably a problem with docker not being able to see your GPU.

Make sure you have nvidia container toolkit installed and working.

dmarx · 2023-04-24T17:10:21Z

i think the issue might've been that i had nvidia-container-toolkit-base installed as well. I uninstalled both, reinstalled nvidia-container-toolkit, restarted, and i've got the test image generating successfully now. not sure if the issue was that package or that I just needed to restart. i'm only able to get docker to see my GPU when I run with sudo though which I'm not a huge fan of... anyway, looks like the issue was with me not realizing I'd skipped the pre-reqs on a too-fresh ubuntu re-install.

AviVarma · 2023-04-29T01:27:12Z

I've just had this issue too on Ubuntu 23.04. I fixed it by re-installing nvidia-container-toolkit!

dmarx added the bug Something isn't working label Apr 24, 2023

dmarx closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error building torch on clean `docker compose --profile auto up --build` #420

Error building torch on clean `docker compose --profile auto up --build` #420

dmarx commented Apr 24, 2023 •

edited

Loading

dmarx commented Apr 24, 2023 •

edited

Loading

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

AbdBarho commented Apr 24, 2023

dmarx commented Apr 24, 2023

AviVarma commented Apr 29, 2023

Error building torch on clean docker compose --profile auto up --build #420

Error building torch on clean docker compose --profile auto up --build #420

Comments

dmarx commented Apr 24, 2023 • edited Loading

dmarx commented Apr 24, 2023 • edited Loading

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

dmarx commented Apr 24, 2023

AbdBarho commented Apr 24, 2023

dmarx commented Apr 24, 2023

AviVarma commented Apr 29, 2023

Error building torch on clean `docker compose --profile auto up --build` #420

Error building torch on clean `docker compose --profile auto up --build` #420

dmarx commented Apr 24, 2023 •

edited

Loading

dmarx commented Apr 24, 2023 •

edited

Loading