Stuck in starting state #61

mousetwentytwo · 2023-07-17T21:38:43Z

Add-on may appear stuck in starting state.
Watchdog is advised to be turned off in this case.

It looks like the healthcheck is introduced for port 8888 hardcoded with a http curl call.
Altough if HTTP service is enabled it starts on 8889, and by default it has a TCP service n 8888.

Related:
Originally posted by @mousetwentytwo in #60 (comment)

Healthcheck code:

ha-addons/ebusd/Dockerfile

Line 21 in 5dd5631

CMD curl --fail http://127.0.0.1:8888 || exit 1

Not sure for the cause, may be unrelated to HTTP.

mainmind83 · 2023-07-18T10:30:27Z

Same problem here, after upgrade to 23.2.1

ech0-py · 2023-07-25T07:43:22Z

The same here

Upd: it seems after --interval=5m the container goes to unhealthy state and then HA suppose it as running (watchdog is off)

Danit2 · 2023-07-26T17:48:42Z

Same problem here.
And when you have the watchdog on then you have a reboot every 15 minutes.

LukasGrebe · 2023-07-27T13:09:26Z

Unfortunately I can not work on the code until about about mid August. That said two thoughts:

Regarding @mousetwentytwo suggestion referenced above, Would it be a good idea to check if the deamon is up and running? Maybe checking for a known result of an ebusctl call?
Feel free to submit a pull request. I’m new to this too and need to read docs and learn how this works…

thank you for raising this issue!

cociweb · 2023-08-11T23:41:24Z

Hello,
Some words about the current health check:
The healthcheck is introduced with #54 as seen here

docker containers has no explicit "starting" state. It has 'created' and 'running' states. in our case we have running state:

$ docker inspect -f '{{.State.Status}}' addon_12341234_ebusd
running

The problem appears first, when the container starts and there is no proper response for curl command on http://127.0.0.1:8888 after 5 minutes as desribed here:
https://github.com/LukasGrebe/ha-addons/blob/5dd56311f043f9238f1a3895d40f9365dd0eed21/ebusd/Dockerfile#L19C1-L21C50

I assume that on port 8888 the ebusd is running and it accepts only http0.9 requests (because others are fail).

So, after entering into the container with docker exec -it addon_12341234_ebusd /bin/bash you can easily check the curl command:

$curl --fail http://127.0.0.1:8888
curl: (1) Received HTTP/0.9 when not allowed

after narrow down the http request version you will get another error and it hangs by curl:

curl --http0.9 --fail-with-body http://127.0.0.1:8888
ERR: command not found

(additionally, You can eliminate the hang with '--max-time 1' parameter but it does not solve the problem.)

Anyway, the ultimate goal should be any non-error (200-OK) response from ebusd via http. I've stucked here. - I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication. So I think this (correct) direction is a dead end, more over these two ports are user configurable... - I'm assume that we are not able to check the health of the ebusd service via http requests.
As a workaround we are able to check the status/availability of the container if we use another service. I would recommend an additional lightweight http service (Lighttpd or nginx) where we can curl/wget a dummy HTTP-200 answer on localhost on another port, or be more simple: a dummy shell script which always returns 0 (https://docs.docker.com/engine/reference/builder/#healthcheck)...

Additionally, don't forget, that the current image contains the version of curl 8.1.2. with several CVE-s, so it should be updated at least to version of 8.2.1 as soon as possible....

ech0-py · 2023-08-13T08:55:12Z

I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication

For TCP try echo "INFO" | nc localhost 8888

version: ebusd 23.2.p20230716
update check: revision 23.2 available
device: 192.168.88.112:9999
signal: acquired
symbol rate: 23
max symbol rate: 96
min arbitration micros: 2
max arbitration micros: 49
min symbol latency: 5
max symbol latency: 57
scan: finished
... <cropped>...

For HTTP it's curl http://localhost:8889/datatypes

  {"type": "BCD", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 1, "result": "number"},
  {"type": "BCD:2", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 2, "result": "number"}
... <cropped>...

I believe all we need it's change HEALTHCHECK to curl --fail http://127.0.0.1:8889/datatypes || exit 1 to prove that ebusd is still alive, but the --httpport=8889 is mandatory in such case which is present by default, but user is able to remove it and thus corrupt the healtcheck.

The other way is check using TCP way, but I'm not sure what should indicate the daemon healthiness (the "signal" status?)

Unfortunately I'm not familiar with HA addons, so I don't know how to test both approaches

cociweb · 2023-08-14T20:55:13Z

Well,
according to @ech0-py suggestion, the healthcheck can be done by nc as well (instead of curl). My proposal based on the suggestion is:

HEALTHCHECK --interval=5m --timeout=3s \
   CMD nc -z localhost 8888 || exit 1

I've not tried it, but it should work. In this case port 8889 is not necessary.

LukasGrebe · 2023-09-24T14:16:12Z

@mousetwentytwo could you check if the problems persist post merge of @cociweb's fix?

tjorim · 2023-09-24T19:18:55Z

It's still there: the fix does not change anything as port 8888 is only enabled when the option to expose the http server is set.

23-09-24 21:16:13 WARNING (MainThread) [supervisor.addons.addon] Timeout while waiting for addon eBUSd to start, took more then 120 seconds

cociweb · 2023-09-24T21:55:39Z

@tjorim, Have you tried to restart the supervisor?
the fix solved for me and it is healthy for hours now:

since the healthcheck is inside the docker container, there is no need to expose any ports.
My addon also seems to be healthy from HA as well. - It's worth to restart Supervisor&Ha-Core

If the Supervisor restart does not resolve your problem, maybe your supervisor tries to reach a dead/renamed docker container.. In this case, please, try to reinstall your addon - maybe something messed up for you.
(As mentioned above, by default 8888 is used for tcp service and http service is optional and by default it uses 8889. as tcp service runs always, the container NetCats it's localhost, so no need any further network config than the defaults)

Danit2 · 2023-09-25T11:03:16Z

For me it works.
But you must restart your system or the supervisor.
Thanks for the work.

ech0-py · 2023-09-25T11:59:37Z

Yep, fix work, but consider that you should wait for 5 minutes until container becomes alive according to HEALTHCHECK --interval=5m, until then you'll see "starting" status and spinner in UI

LukasGrebe · 2023-12-01T17:59:52Z

@ech0-py should we reduce the interval to say 10s or close this ticket as resolved?

cociweb · 2023-12-06T20:52:04Z

Well, I've also faced this 5min stuff today.
In the next PR we can add a function where the first query issued after the first 90 secs. (In my opinion at least 1 min is required to start it up on slower environments at least after fresh install...) My recommendation is to keep the 5min as default interval.

mousetwentytwo changed the title ~~Stuck in **starting** state if HTTP enabled due to container HEALTHCHECK~~ Stuck in **starting** state Jul 17, 2023

cociweb mentioned this issue Sep 9, 2023

Fix docker healthcheck #67

Merged

LukasGrebe mentioned this issue Sep 24, 2023

Addon still no status RUN #70

Closed

cociweb mentioned this issue Dec 6, 2023

healthcheck at startup #88

Merged

tjorim closed this as completed in #88 Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in starting state #61

Stuck in starting state #61

mousetwentytwo commented Jul 17, 2023 •

edited

mainmind83 commented Jul 18, 2023

ech0-py commented Jul 25, 2023 •

edited

Danit2 commented Jul 26, 2023 •

edited

LukasGrebe commented Jul 27, 2023

cociweb commented Aug 11, 2023

ech0-py commented Aug 13, 2023 •

edited

cociweb commented Aug 14, 2023 •

edited

LukasGrebe commented Sep 24, 2023

tjorim commented Sep 24, 2023

cociweb commented Sep 24, 2023 •

edited

Danit2 commented Sep 25, 2023

ech0-py commented Sep 25, 2023 •

edited

LukasGrebe commented Dec 1, 2023

cociweb commented Dec 6, 2023 •

edited

Stuck in **starting** state #61

Stuck in **starting** state #61

Comments

mousetwentytwo commented Jul 17, 2023 • edited

mainmind83 commented Jul 18, 2023

ech0-py commented Jul 25, 2023 • edited

Danit2 commented Jul 26, 2023 • edited

LukasGrebe commented Jul 27, 2023

cociweb commented Aug 11, 2023

ech0-py commented Aug 13, 2023 • edited

cociweb commented Aug 14, 2023 • edited

LukasGrebe commented Sep 24, 2023

tjorim commented Sep 24, 2023

cociweb commented Sep 24, 2023 • edited

Danit2 commented Sep 25, 2023

ech0-py commented Sep 25, 2023 • edited

LukasGrebe commented Dec 1, 2023

cociweb commented Dec 6, 2023 • edited

Stuck in starting state #61

Stuck in starting state #61

mousetwentytwo commented Jul 17, 2023 •

edited

ech0-py commented Jul 25, 2023 •

edited

Danit2 commented Jul 26, 2023 •

edited

ech0-py commented Aug 13, 2023 •

edited

cociweb commented Aug 14, 2023 •

edited

cociweb commented Sep 24, 2023 •

edited

ech0-py commented Sep 25, 2023 •

edited

cociweb commented Dec 6, 2023 •

edited