Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in **starting** state #61

Closed
mousetwentytwo opened this issue Jul 17, 2023 · 14 comments · Fixed by #88
Closed

Stuck in **starting** state #61

mousetwentytwo opened this issue Jul 17, 2023 · 14 comments · Fixed by #88

Comments

@mousetwentytwo
Copy link

mousetwentytwo commented Jul 17, 2023

Add-on may appear stuck in starting state.
Watchdog is advised to be turned off in this case.

It looks like the healthcheck is introduced for port 8888 hardcoded with a http curl call.
Altough if HTTP service is enabled it starts on 8889, and by default it has a TCP service n 8888.

Related:
Originally posted by @mousetwentytwo in #60 (comment)

Healthcheck code:

CMD curl --fail http://127.0.0.1:8888 || exit 1

Not sure for the cause, may be unrelated to HTTP.

@mousetwentytwo mousetwentytwo changed the title Stuck in **starting** state if HTTP enabled due to container HEALTHCHECK Stuck in **starting** state Jul 17, 2023
@mainmind83
Copy link

Same problem here, after upgrade to 23.2.1

@ech0-py
Copy link

ech0-py commented Jul 25, 2023

The same here
image


Upd: it seems after --interval=5m the container goes to unhealthy state and then HA suppose it as running (watchdog is off)
image
image

@Danit2
Copy link

Danit2 commented Jul 26, 2023

Same problem here.
And when you have the watchdog on then you have a reboot every 15 minutes.
image

@LukasGrebe
Copy link
Owner

Unfortunately I can not work on the code until about about mid August. That said two thoughts:

  1. Regarding @mousetwentytwo suggestion referenced above, Would it be a good idea to check if the deamon is up and running? Maybe checking for a known result of an ebusctl call?
  2. Feel free to submit a pull request. I’m new to this too and need to read docs and learn how this works…

thank you for raising this issue!

@cociweb
Copy link
Collaborator

cociweb commented Aug 11, 2023

Hello,
Some words about the current health check:
The healthcheck is introduced with #54 as seen here

docker containers has no explicit "starting" state. It has 'created' and 'running' states. in our case we have running state:

$ docker inspect -f '{{.State.Status}}' addon_12341234_ebusd
running

The problem appears first, when the container starts and there is no proper response for curl command on http://127.0.0.1:8888 after 5 minutes as desribed here:
https://github.com/LukasGrebe/ha-addons/blob/5dd56311f043f9238f1a3895d40f9365dd0eed21/ebusd/Dockerfile#L19C1-L21C50

I assume that on port 8888 the ebusd is running and it accepts only http0.9 requests (because others are fail).

So, after entering into the container with docker exec -it addon_12341234_ebusd /bin/bash you can easily check the curl command:

$curl --fail http://127.0.0.1:8888
curl: (1) Received HTTP/0.9 when not allowed

after narrow down the http request version you will get another error and it hangs by curl:

curl --http0.9 --fail-with-body http://127.0.0.1:8888
ERR: command not found

(additionally, You can eliminate the hang with '--max-time 1' parameter but it does not solve the problem.)

Anyway, the ultimate goal should be any non-error (200-OK) response from ebusd via http. I've stucked here. - I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication. So I think this (correct) direction is a dead end, more over these two ports are user configurable... - I'm assume that we are not able to check the health of the ebusd service via http requests.
As a workaround we are able to check the status/availability of the container if we use another service. I would recommend an additional lightweight http service (Lighttpd or nginx) where we can curl/wget a dummy HTTP-200 answer on localhost on another port, or be more simple: a dummy shell script which always returns 0 (https://docs.docker.com/engine/reference/builder/#healthcheck)...

Additionally, don't forget, that the current image contains the version of curl 8.1.2. with several CVE-s, so it should be updated at least to version of 8.2.1 as soon as possible....

@ech0-py
Copy link

ech0-py commented Aug 13, 2023

I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication

For TCP try echo "INFO" | nc localhost 8888

version: ebusd 23.2.p20230716
update check: revision 23.2 available
device: 192.168.88.112:9999
signal: acquired
symbol rate: 23
max symbol rate: 96
min arbitration micros: 2
max arbitration micros: 49
min symbol latency: 5
max symbol latency: 57
scan: finished
... <cropped>...

For HTTP it's curl http://localhost:8889/datatypes

  {"type": "BCD", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 1, "result": "number"},
  {"type": "BCD:2", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 2, "result": "number"}
... <cropped>...

I believe all we need it's change HEALTHCHECK to curl --fail http://127.0.0.1:8889/datatypes || exit 1 to prove that ebusd is still alive, but the --httpport=8889 is mandatory in such case which is present by default, but user is able to remove it and thus corrupt the healtcheck.

The other way is check using TCP way, but I'm not sure what should indicate the daemon healthiness (the "signal" status?)

Unfortunately I'm not familiar with HA addons, so I don't know how to test both approaches

@cociweb
Copy link
Collaborator

cociweb commented Aug 14, 2023

Well,
according to @ech0-py suggestion, the healthcheck can be done by nc as well (instead of curl). My proposal based on the suggestion is:

HEALTHCHECK --interval=5m --timeout=3s \
   CMD nc -z localhost 8888 || exit 1

I've not tried it, but it should work. In this case port 8889 is not necessary.

@LukasGrebe
Copy link
Owner

@mousetwentytwo could you check if the problems persist post merge of @cociweb's fix?

@tjorim
Copy link
Collaborator

tjorim commented Sep 24, 2023

It's still there: the fix does not change anything as port 8888 is only enabled when the option to expose the http server is set.

23-09-24 21:16:13 WARNING (MainThread) [supervisor.addons.addon] Timeout while waiting for addon eBUSd to start, took more then 120 seconds

@cociweb
Copy link
Collaborator

cociweb commented Sep 24, 2023

@tjorim, Have you tried to restart the supervisor?
the fix solved for me and it is healthy for hours now:
image
since the healthcheck is inside the docker container, there is no need to expose any ports.
My addon also seems to be healthy from HA as well. - It's worth to restart Supervisor&Ha-Core

If the Supervisor restart does not resolve your problem, maybe your supervisor tries to reach a dead/renamed docker container.. In this case, please, try to reinstall your addon - maybe something messed up for you.
(As mentioned above, by default 8888 is used for tcp service and http service is optional and by default it uses 8889. as tcp service runs always, the container NetCats it's localhost, so no need any further network config than the defaults)

@Danit2
Copy link

Danit2 commented Sep 25, 2023

For me it works.
But you must restart your system or the supervisor.
Thanks for the work.

@ech0-py
Copy link

ech0-py commented Sep 25, 2023

Yep, fix work, but consider that you should wait for 5 minutes until container becomes alive according to HEALTHCHECK --interval=5m, until then you'll see "starting" status and spinner in UI

@LukasGrebe
Copy link
Owner

@ech0-py should we reduce the interval to say 10s or close this ticket as resolved?

@cociweb
Copy link
Collaborator

cociweb commented Dec 6, 2023

Well, I've also faced this 5min stuff today.
In the next PR we can add a function where the first query issued after the first 90 secs. (In my opinion at least 1 min is required to start it up on slower environments at least after fresh install...) My recommendation is to keep the 5min as default interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants