Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker healthcheck to all containers #415

Closed
tablatronix opened this issue Sep 28, 2021 · 8 comments · Fixed by #563
Closed

Add docker healthcheck to all containers #415

tablatronix opened this issue Sep 28, 2021 · 8 comments · Fixed by #563

Comments

@tablatronix
Copy link

It would be a nice have to have healthchecks for all containers.

Most can be pretty trivial, some examples might exists already for most services

https://docs.docker.com/engine/reference/builder/#healthcheck

@Paraphraser
Copy link

Agree on both the nice-to-have and that it is simple enough to implement. I just did some experiments with the core MING components. Node-RED already has a health check but the others don't. This is what I came up with:

# mosquitto
    healthcheck:
      test: ["CMD", "nc", "-w", "1", "localhost", "1883"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

# influxdb
    healthcheck:
      test: ["CMD", "curl", "http://localhost:8086"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

# grafana
    healthcheck:
      test: ["CMD", "wget", "-O", "/dev/null", "http://localhost:3000"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

The mixture of curl and wget is down to the order in which I did things. The grafana container doesn't have curl while the influxdb container has both.

The result on my test Pi (lots of experimental containers running):

$ DPS
NAMES            CREATED          STATUS
grafana          32 seconds ago   Up 31 seconds (healthy)
mosquitto        8 minutes ago    Up 8 minutes (healthy)
influxdb         15 minutes ago   Up 13 minutes (healthy)
pihole           56 minutes ago   Up 56 minutes (healthy)
prometheus       58 minutes ago   Up 58 minutes
nextcloud        58 minutes ago   Up 58 minutes
nodeexporter     58 minutes ago   Up 58 minutes
nodered          58 minutes ago   Up 58 minutes (healthy)
home_assistant   58 minutes ago   Up 58 minutes
homebridge       58 minutes ago   Up 58 minutes
traefik          58 minutes ago   Up 58 minutes
cadvisor         58 minutes ago   Up 58 minutes (healthy)
nextcloud_db     58 minutes ago   Up 58 minutes
portainer-ce     58 minutes ago   Up 58 minutes
homer            58 minutes ago   Up 58 minutes (healthy)
whoami           58 minutes ago   Up 58 minutes

In the case of Mosquitto, we're building that from a Dockerfile so I'd lean towards adding it there. Ditto for MariaDB (which would also take care of nextcloud's database). The others (the ones we're not building from Dockerfiles) would need augmented service definitions.

But, taking a step back, I'm not sure what would happen if an upstream image started to supply its own health check. I assume (without checking) that the last one in the chain (ie IOTstack's) would prevail. What I'm more concerned about is if an upstream container started providing a better test.

For example, Mosquitto issue 10 was opened in 2016 with very little movement but a recent post proposes running mosquitto_sub against a wildcard # topic:

test: ["CMD-SHELL", "mosquitto_sub -h $MQTT_HOST -p $MQTT_POST -t '#' -u $MQTT_USER -P $MQTT_PASSWORD -C 1 | grep -v Error || exit 1"]

Aside from one or two cosmetic issues (eg not every IOTstack user will have gone to the trouble of setting up credentials so those variables would need to be quoted, so they'd turn into null strings) that's quite promising.

I was tempted to use it but a bit of testing on my own system revealed a few wrinkles. The success of the mosquitto_sub command seems to depend on either having a "retained" message lying about or "getting lucky" with a message being published while the test is running. The command hangs if neither condition is met. I have no idea how a hung "health check" affects things so I added a timeout parameter.

Also, at least on my system, the mosquitto_sub command seems to behave differently depending on whether it is run inside the container or outside. Keep in mind that this is with no retained messages and an idle instance of the container that isn't receiving any messages from upstream publishers - which would be the starting point for any newly-spun-up IOTstack (ie the last thing a newbie IOTstacker needs is Mosquitto saying "unhealthy" when it's perfectly fine and just waiting for a message).

  • Run from outside:

     $ MQTT_PORT=1883
     $ unset MQTT_USER MQTT_PASSWORD
     $ mosquitto_sub -h localhost -p "$MQTT_PORT" -t "#" -u "$MQTT_USER" -P "$MQTT_PASSWORD" -W 2 -C 1
     $ echo $?
     0
    
  • Run from inside:

     $ docker exec -it mosquitto ash
     # MQTT_PORT=1883
     # unset MQTT_USER MQTT_PASSWORD
     # mosquitto_sub -h localhost -p "$MQTT_PORT" -t "#" -u "$MQTT_USER" -P "$MQTT_PASSWORD" -W 2 -C 1
     Timed out
     # echo $?
     27
     # exit
     $ 
    

The unset commands are making the point about null credentials, and that's before we start to worry about hiding credentials in compose files.

If I set up a retained message:

$ mosquitto_pub -h localhost -r -t 'test' -m 'data'

and repeat the tests, both return "0" immediately.

Anyway, outside the container always gets what I think of as the correct answer while inside the container the mileage varies depending on the situation and is, accordingly, prone to returning false "unhealthy" messages.

An improved scheme might start with a retained mosquitto_pub to a known topic like "docker/healthcheck" and embedding the current time in the message, which the following mosquitto_sub would then check that it actually receives.

However, I think you can probably see my point. Assuming all these issues could be addressed, this would actually be a better health check and I wouldn't want to risk getting in its way if it was adopted by the Eclipse people.

Thoughts?

@Paraphraser
Copy link

Paraphraser commented Sep 29, 2021

How about this for Mosquitto?

Dockerfile:

  • add these lines:

     # copy the health-check script into place
     ENV HEALTHCHECK_SCRIPT "iotstack_healthcheck"
     COPY ${HEALTHCHECK_SCRIPT} /usr/local/bin/${HEALTHCHECK_SCRIPT}
     
     # define the health check
     HEALTHCHECK \
        --start-period=30s \
        --interval=30s \
        --timeout=10s \
        --retries=3 \
        CMD ${HEALTHCHECK_SCRIPT} || exit 1
  • completed result (for context):

     # Download base image
     FROM eclipse-mosquitto:latest
     
     # see https://github.com/alpinelinux/docker-alpine/issues/98
     RUN sed -i 's/https/http/' /etc/apk/repositories
     
     # Add support tools
     RUN apk update && apk add --no-cache rsync tzdata
     
     # where IOTstack template files are stored
     ENV IOTSTACK_DEFAULTS_DIR="iotstack_defaults"
     
     # copy template files to image
     COPY --chown=mosquitto:mosquitto ${IOTSTACK_DEFAULTS_DIR} /${IOTSTACK_DEFAULTS_DIR}
     
     # copy the health-check script into place
     ENV HEALTHCHECK_SCRIPT "iotstack_healthcheck"
     COPY ${HEALTHCHECK_SCRIPT} /usr/local/bin/${HEALTHCHECK_SCRIPT}
     
     # define the health check
     HEALTHCHECK \
        --start-period=30s \
        --interval=30s \
        --timeout=10s \
        --retries=3 \
        CMD ${HEALTHCHECK_SCRIPT} || exit 1
     
     # replace the docker entry-point script
     ENV IOTSTACK_ENTRY_POINT="docker-entrypoint.sh"
     COPY ${IOTSTACK_ENTRY_POINT} /${IOTSTACK_ENTRY_POINT}
     RUN chmod 755 /${IOTSTACK_ENTRY_POINT}
     ENV IOTSTACK_ENTRY_POINT=
     
     # IOTstack also declares these paths
     VOLUME ["/mosquitto/config", "/mosquitto/pwfile"]
     
     # EOF

Healthcheck script:

  • installed path (mode 755):

     ~/IOTstack/.templates/mosquitto/iotstack_healthcheck
    
  • script content:

     #!/usr/bin/env sh
     
     # assume the following environment variables, all of which may be null
     #    HEALTHCHECK_PORT
     #    HEALTHCHECK_USER
     #    HEALTHCHECK_PASSWORD
     #    HEALTHCHECK_TOPIC
     
     # set a default for the port
     HEALTHCHECK_PORT="${HEALTHCHECK_PORT:-1883}"
     
     # strip any quotes from username and password
     HEALTHCHECK_USER="$(eval echo $HEALTHCHECK_USER)"
     HEALTHCHECK_PASSWORD="$(eval echo $HEALTHCHECK_PASSWORD)"
     
     # set a default for the topic
     HEALTHCHECK_TOPIC="${HEALTHCHECK_TOPIC:-iotstack/mosquitto/healthcheck}"
     HEALTHCHECK_TOPIC="$(eval echo $HEALTHCHECK_TOPIC)"
     
     # record the current date and time for the test payload
     PUBLISH=$(date)
     
     # publish a retained message containing the timestamp
     mosquitto_pub \
        -h localhost \
        -p "$HEALTHCHECK_PORT" \
        -t "$HEALTHCHECK_TOPIC" \
        -m "$PUBLISH" \
        -u "$HEALTHCHECK_USER" \
        -P "$HEALTHCHECK_PASSWORD" \
        -r
     
     # did that succeed?
     if [ $? -eq 0 ] ; then
     
        # yes! now, subscribe to that same topic with a 2-second timeout
        # plus returning on the first message
        SUBSCRIBE=$(mosquitto_sub \
                     -h localhost \
                     -p "$HEALTHCHECK_PORT" \
                     -t "$HEALTHCHECK_TOPIC" \
                     -u "$HEALTHCHECK_USER" \
                     -P "$HEALTHCHECK_PASSWORD" \
                     -W 2 \
                     -C 1 \
                   )
     
        # did the subscribe succeed?
        if [ $? -eq 0 ] ; then
     
           # yes! do the publish and subscribe payloads compare equal?
           if [ "$PUBLISH" = "$SUBSCRIBE" ] ; then
     
              # yes! return success
              exit 0
     
           fi
     
        fi
        
     fi
     
     # otherwise, return failure
     exit 1

Basic operation

Credentials

Should work out-of-the-box on systems that do not have password schemes. For those that do, the following will need to be added to the service definition in docker-compose.yml:

    environment:
      - HEALTHCHECK_USER=someusername
      - HEALTHCHECK_PASSWORD=somepassword

In the original version of this, I wrote:

I haven't checked what happens if the right hand sides are quoted. Docker tends to pass everything after the "=" verbatim. That's why you can't quote TZ= because the right hand side is used to construct a path and the surrounding quotes get in the way. They might do harm here too.

I have since checked what happens and, indeed, all the quote marks do get passed through verbatim so I have updated the script to handle that problem for the username, password and topic variables. I also explicitly checked the return code from the subscribe.

Listener port

In the reasonably unlikely event someone is using something other than internal port 1883, there's:

    environment:
      - HEALTHCHECK_PORT=12345

Test topic

The test topic defaults to iotstack/mosquitto/healthcheck and in the also somewhat unlikely event of that producing a collision:

    environment:
      - HEALTHCHECK_TOPIC=some/other/topic

Basic test

The Dockerfile runs the test every 30 seconds so an external subscriber should be able to see the timestamps appearing with that frequency:

$ mosquitto_sub -v -h localhost -t "iotstack/mosquitto/healthcheck" -F "%I %t %p"
2021-09-29T16:46:28+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:46:28 AEST 2021
2021-09-29T16:46:59+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:46:59 AEST 2021
2021-09-29T16:47:29+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:47:29 AEST 2021
2021-09-29T16:47:59+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:47:59 AEST 2021
2021-09-29T16:48:29+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:48:29 AEST 2021
2021-09-29T16:49:00+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:49:00 AEST 2021
2021-09-29T16:49:30+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:49:30 AEST 2021
2021-09-29T16:50:00+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:50:00 AEST 2021
2021-09-29T16:50:31+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:50:31 AEST 2021
2021-09-29T16:51:01+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:51:01 AEST 2021
2021-09-29T16:51:31+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:51:31 AEST 2021
2021-09-29T16:52:02+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:52:02 AEST 2021
2021-09-29T16:52:32+1000 iotstack/mosquitto/healthcheck Wed Sep 29 16:52:32 AEST 2021
$ DPS mosquitto
NAMES       CREATED          STATUS
mosquitto   14 minutes ago   Up 14 minutes (healthy)

What do you think?

@Paraphraser
Copy link

Oh, if I deliberately force a bad port by adding:

    environment:
      - HEALTHCHECK_PORT=12345

the result is:

NAMES            CREATED              STATUS
mosquitto        About a minute ago   Up About a minute (unhealthy)

@Paraphraser
Copy link

I had some pull requests for Mosquitto open already. I did a lot more testing and I'm pretty happy with it so I've pushed the changes into the existing PRs:

  • PR406 - master branch
  • PR407 - old-menu branch
  • PR408 - experimental branch

I added a chunk of words about the topic to the IOTstack Mosquitto documentation on the master branch PR. The easiest way to see it in advance of the PR being accepted/rejected is via the PR branch at:

@tablatronix
Copy link
Author

Nice, I was going to look into this a little, but it looks like you jumped right on it. This also shows up nicely in portainer and lets you pull it in reports like cadvisor and nodeexporter.

Not sure about the precedent and override of built in checks, have you tested it with the nodered one? Either way its good, and someone can expand on it later if they want to add anything advanced like influxdb real consistency checks, or actual file system stuff

Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue Oct 2, 2021
Follows on from suggestion in [Issue 415](SensorsIot#415)
to add health-check to more containers. See also
[PR 406](SensorsIot@dbb6217).

Changes:

* Adds `iotstack_healthcheck.sh` script to template.
* Adds commands to Dockerfile to copy that script into the local image
and activate health-checking on launch.
* Describes health-check functionality in the MariaDB documentation.
* References MariaDB health-check documentation in NextCloud
documentation.
Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue Oct 2, 2021
Follows on from suggestion in [Issue 415](SensorsIot#415)
to add health-check to more containers. See also
[PR 406](SensorsIot@dbb6217).

Changes:

* Adds `iotstack_healthcheck.sh` script to template.
* Adds commands to Dockerfile to copy that script into the local image
and activate health-checking on launch.
* Reduces old-menu MariaDB documentation to a stub pointing to new-menu
documentation (this is already the situation for old-menu NextCloud
documentation).
Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue Oct 2, 2021
Follows on from suggestion in [Issue 415](SensorsIot#415)
to add health-check to more containers. See also
[PR 406](SensorsIot@dbb6217).

Changes:

* Adds `iotstack_healthcheck.sh` script to template.
* Moves Dockerfile into `buildFiles` directory, and adds commands to
copy the health-check script into the local image and activate
health-checking on launch.

Does not change any documentation on experimental branch.
@Paraphraser
Copy link

@tablatronix I know that I can disable the built-in check that comes with the base Node-RED image, and that I can do it in either ~/IOTstack/docker-compose.yml or ~/IOTstack/services/nodered/Dockerfile but I have not tried replacing it.

I think the Mosquitto script will turn out to be pretty robust and, if anyone does go to the trouble of proposing a similar PR for Eclipse-Mosquitto, having "ours" pre-empt "theirs" probably won't amount to a hill of beans.

For the benefit of anyone reading this issue who would like to take the Mosquitto iotstack_healthcheck.sh script and either use it as-is or improve it and then propose a PR for Eclipse-Mosquitto, please go right ahead and do that. The only reason I haven't done it myself is because of the need to register with the Eclipse Foundation and sign the Eclipse Contributor Agreement. I simply can't be bothered jumping through hoops like that but, at the same time, I have no intention of standing in the way of someone who is happy to jump through those hoops.

I've just submitted PRs for adding a similar script to MariaDB (which will be inherited by Nextcloud_DB):

  • PR416 - master branch
  • PR417 - old-menu branch
  • PR418 - experimental branch

In this case, the script runs mysqladmin ping which, supposedly, is a reasonably good test but can return false positives if the daemon isn't listening to port 3306, so I followed-up with an "is something listening to port 3306?" test.

If I knew a bit more about MySQL/MariaDB I'd probably try to fashion something that went further. It's this scenario that worries me more because there's much greater potential for a true MySQL guru to come up with a really good health-check script, and I wouldn't want my "I suppose it's a bit better than having no health-checking at all" solution to block something better coming to us from upstream.

@simonmcnair
Copy link

Sorry, I have no experience at all with Git or Docker but I thought I'd try and help get a health check merged in to Mosquitto. Please be gentle with your criticism ;-)

@simonmcnair
Copy link

just noticed the topic may be incorrect in healthcheck.sh. I'm sure they'll change that

Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue May 17, 2022
Adds health-check functionality to Grafana and InfluxDB 1.8, as
discussed in SensorsIot#415.

Health-check functionality already added to Mosquitto via SensorsIot#406.

Closes SensorsIot#415

Signed-off-by: Phill Kelley <34226495+Paraphraser@users.noreply.github.com>
Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue May 17, 2022
Adds health-check functionality to Grafana and InfluxDB 1.8, as
discussed in SensorsIot#415.

Health-check functionality already added to Mosquitto via SensorsIot#409.

Closes SensorsIot#415

Signed-off-by: Phill Kelley <34226495+Paraphraser@users.noreply.github.com>
Paraphraser added a commit to Paraphraser/IOTstack that referenced this issue May 17, 2022
Adds health-check functionality to Grafana and InfluxDB 1.8, as
discussed in SensorsIot#415.

Health-check functionality already added to Mosquitto via SensorsIot#410.

Closes SensorsIot#415

Signed-off-by: Phill Kelley <34226495+Paraphraser@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants