Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search engine: docker deployment issues #415

Open
sbesson opened this issue Feb 2, 2024 · 5 comments
Open

Search engine: docker deployment issues #415

sbesson opened this issue Feb 2, 2024 · 5 comments

Comments

@sbesson
Copy link
Member

sbesson commented Feb 2, 2024

Possibly affects the IDR monitoring stack as well

Initially reported by @dominikl in the context of a pilot VM,

- role: ome.docker
docker_use_ipv4_nic_mtu: True
currently fails with

RUNNING HANDLER [ome.docker : restart docker] *****************************************************************************************************************************************************************
fatal: [test120-searchengine]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "msg": "Unable to restart service docker: Job for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}

Looking at the logs

Feb 02 13:42:21 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:21.663010221Z" level=info msg="Starting up"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.450727005Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.451427414Z" level=info msg="Loading containers: start."
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.529980377Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP 
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.530549152Z" level=error msg="Failed to set bridge MTU docker0 via netlink" error="invalid argument"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.532190944Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: failed to start daemon: Error initializing network controller: error creating default "bridge" network: invalid argument
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: Failed to start Docker Application Container Engine.

Removing /etc/docker/daemon.json or simply commenting out the mtu variable (as docker_use_ipv4_nic_mtu: false) suffices to restart the Docker service. But docker ps fails with

[sbesson@test120-searchengine ~]$ sudo docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

The version of Docker is

[sbesson@test120-searchengine ~]$ docker -v
Docker version 25.0.2, build 29cf629

while on a recent successful environment, it is

[sbesson@prod120-searchengine ~]$ docker -v
Docker version 24.0.7, build afdd53b
@sbesson
Copy link
Member Author

sbesson commented Feb 2, 2024

Forcing the Docker version to 24.0.7

diff --git a/ansible/idr-docker.yml b/ansible/idr-docker.yml
index 2a53643..e87fc6a 100644
--- a/ansible/idr-docker.yml
+++ b/ansible/idr-docker.yml
@@ -6,7 +6,7 @@
   roles:
     - role: ome.docker
       docker_use_ipv4_nic_mtu: True
-
+      docker_version: 24.0.7
   tasks:
   - name: install docker-python
     become: yes

seems to be sufficient to make progress with the playbook. So I suspect some upstream changes incompatible with our way to deploy Docker using ome.docker.

@sbesson
Copy link
Member Author

sbesson commented Feb 3, 2024

moby/moby#47308 looks related and is expected to be resolved with Docker 25.0.3 (or the migration to Rocky Linux 9)

@jburel
Copy link
Member

jburel commented Feb 5, 2024

When testing devspace using the testing RHEL 9 VM, I had to edit the dockerd file
What is currently in is
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
and it is expecting
ExecStart=/usr/bin/dockerd -H unix:///var/run/docker.sock --containerd=/run/containerd/containerd.sock
Note that i did not have the issue on the physical RHEL 9 machine

@jburel
Copy link
Member

jburel commented Feb 5, 2024

Downgrading to 24.x version might also solve the problem I have when running devspace (omero-server takes a long time to start). I am currently running

docker --version
Docker version 25.0.2, build 29cf629

sbesson added a commit that referenced this issue Feb 5, 2024
See #415 for more context
This should be re-evaluated with Docker 25.0.3 or the migration to
Rocky Linux 9
@sbesson
Copy link
Member Author

sbesson commented Feb 5, 2024

I was able to spin up test120 on Friday by downgrading Docker to the last 24.x version. Pushed 825c70b accordingly so that we unblock the creation of production & pilot environments. Once Docker 25.0.3 is released or we migrate to Rocky Linux 9, we can evaluate dropping the version pinning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants