Healthcheck not restarting unhealthy module #6358

gri6507 · 2022-05-17T18:57:18Z

Expected Behavior

When a custom module's healthcheck fails, thus putting the module into "unhealthy" status, edgeAgent should restart the module

Current Behavior

edgeAgent does not restart the module

Steps to Reproduce

Create a custom module with the following at the end of it's Dockerfile

HEALTHCHECK --interval=15s --timeout=3s --retries=2 \
  CMD exit 1

Deploy this module with a deployment.json which has modulesContent.$edgeAgent.properties.desired

"modules": {
  "MyModule": {
    "version": "34136",
    "type": "docker",
    "status": "running",
    "restartPolicy": "on-failure",
    "settings": {
      "image": "...",
      "createOptions": "{\"Healthcheck\":{\"Test\":[],\"Interval\":0,\"Timeout\":0,\"Retries\":0,\"StartPeriod\":0}, \"HostConfig\":{\"CapAdd\":[\"SYS_PTRACE\"], \"ExtraHosts\":[\"dockerhost:172.16.99.99\"], \"Binds\":[\"/etc/daikin-software-gateway/:/etc/storage/\"]}}"
    },
    "env": {
      "storageFolder": {
        "value": "/etc/storage/"
      }
    }
  }
}

According to documentation, those Test settings should just inherit the healthcheck definition from the container.

Let the deployment happen. Then run sudo iotedge list && sudo docker container ls and observe that iotedge lists the module as running and docker shows the state of the container as Up 10 seconds (starting)
Wait until the container has run for at least 30 seconds, which comes from the Dockerfile --interval=15s --retries=2 settings. Observe that iotedge lists the module as running and docker shows the state of the container as Up 1 minute (unhealthy)
Observe that regardless of how long you let edgeAgent run, it does not restart the custom module. Furthermore, sudo docker logs -f edgeAgent | grep "scheduled to restart after" does not show any attempts to restart the custom module.

Context (Environment)

Output of `iotedge check`

Click to show

Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: identity certificates expiry - OK
√ production readiness: certificates - OK
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with DPS endpoint - OK
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

22 check(s) succeeded.
2 check(s) raised warnings. Re-run with --verbose for more details.

Device Information

Host OS Ubuntu 18.04]:
Architecture [amd64,]:
Container OS Linux containers]:

Runtime Versions

iotedge 1.1.13
mcr.microsoft.com/azureiotedge-agent 1.1 d1bbbd315877
mcr.microsoft.com/azureiotedge-hub 1.1 c25873c1704a
Docker/Moby [run docker version]: 20.10.8+azure

Additional Information

I have also tried a deployment with an explicit Healthcheck specification, but that also did not work

    "modules": {
      "MyModule": {
        "version": "34136",
        "type": "docker",
        "status": "running",
        "restartPolicy": "on-failure",
        "settings": {
          "image": "...",
          "createOptions": "{\"Healthcheck\":{\"Test\",[\"CMD\":\"exit 1\"],\"Interval\":15000000000,\"Timeout\":3000000000,\"Retries\":2,\"StartPeriod\":1000000000}, \"HostConfig\":{\"CapAdd\":[\"SYS_PTRACE\"], \"ExtraHosts\":[\"dockerhost:172.16.99.99\"], \"Binds\":[\"/etc/daikin-software-gateway/:/etc/storage/\"]}}"
        },  
        "env": {
          "storageFolder": {
            "value": "/etc/storage/"
          }   
        }   
      }   
    }

The text was updated successfully, but these errors were encountered:

gri6507 · 2022-05-17T19:00:15Z

After doing more searching in this Git repo, I found #5746. That ticket talks about an on-unhealthy restartPolicy. According to docker documentation (and that ticket), that is an unsupported restartPolicy. In my case, I am using on-failure restartPolicy. Is that also not supported for failed health checks?

gauravIoTEdge · 2022-05-24T23:17:41Z

@gri6507 - Yes, I'm afraid that policy is also not supported.

Currently, our recommendation is that you use the docker health checks. Like the issue you linked suggested, you can post it as a suggestion and vote - https://aka.ms/iotedge-suggestion.

I'll work with someone to make sure the documentation reflects this too. Sorry about that.

veyalla · 2022-05-25T03:57:34Z

As workaround, you can achieve the desired result without relying on IoT Edge runtime. For example for the health check below:

HEALTHCHECK \
 --interval=30s \
 --timeout=30s \
 --start-period=60s \
 --retries=3 \
 CMD curl -f http://localhost:9070/health || exit 1

...use these createOptions:

"createOptions": "{\"Healthcheck\":{\"Test\":[\"CMD-SHELL\",\"curl -f http://localhost:9070/health || exit 1\"],\"Interval\":30000000000,\"Timeout\":30000000000,\"Retries\":1,\"StartPeriod\":60000000000},\"HostConfig\":{\"PortBindings\":{\"9070/tcp\":[{\"HostPort\":\"9070\"}]},\"Binds\":[\"/data/misc:/data/misc\"]}}"

If you use this approach, ensure curl is available in your container image.

gri6507 · 2022-05-25T14:36:59Z

@veyalla - thank you for that workaround idea. One thing that isn't clear is what is the HostPort 9070 or why /data/misc bind mount is needed?

veyalla · 2022-05-25T21:56:57Z

9070 is just an example. /data/misc is also just a copy/paste artifact and not needed.

The uber point is that only the module knows whether it is healthy or not. So it exposes a binary health signal on a local endpoint of its choosing (9070 in this example).

Docker engine is configured to periodically call the health endpoint (per the Healthcheck parameters) and crashes the module if unhealthy status is returned. EdgeAgent will subsequently restart the module, and that hopefully unstucks the module from a "running but not working" state.

gri6507 · 2022-05-26T19:00:32Z

@veyalla - I agree with your statement that

Docker engine is configured to periodically call the health endpoint (per the Healthcheck parameters)

However, I disagree with your continuation

and crashes the module if unhealthy status is returned

Docker does not stop the container. It simply marks it as unhealthy. If docker actually did stop the container, then I agree that EdgeAgent would have picked that up and restarted the Edge module.

veyalla · 2022-05-26T20:00:10Z

My expectation is that the module exits if the health endpoint indicates unhealthy, which is equivalent to crashing itself.

Example from heath check above:

CMD curl -f http://localhost:9070/health || exit 1

bqstony · 2022-05-31T08:32:47Z

Hi @veyalla

I am digging arount in the issues to find the right way for a working health check in iotedge v1.2.

When i set the the restart policy, in the Manifest, like: modules.{moduleId}.restartPolicy = always, i get always the following result on the edgeHost, no matter what value i use:

$ docker inspect --format '{{json .HostConfig.RestartPolicy}}' mycustommodule| jq
{
  "Name": "",
  "MaximumRetryCount": 0
}
# This is wrong for my opinion, the name Should have the value always, i dont know how to change this value, whenn i do it in the manifest, the container does not startup.

$ docker inspect --format '{{json .Config.Healthcheck}}' mycustommodule | jq
{
  "Test": [
    "CMD-SHELL",
    "curl --fail http://localhost:80/health || exit 1"
  ],
  "Interval": 30000000000,
  "Timeout": 10000000000,
  "Retries": 3
}
# the Healtcheck locks good to me and the calls are made by docker, i see the calls in the logs

For information, the configured of the health is in the dockerfile:

FROM mcr.microsoft.com/dotnet/aspnet:6.0-bullseye-slim AS base
WORKDIR /app
EXPOSE 80

HEALTHCHECK --interval=30s --timeout=10s --retries=3 CMD curl --fail http://localhost:80/health || exit 1

# Install prerequisites for healthcheck
RUN apt-get update && apt-get install -y \
    curl && \
    rm -rf /var/lib/apt/lists/*

...

My Question is: What else i have to do for a working restart after my container goes to unhealthy state?

veyalla · 2022-05-31T16:34:33Z

I'm not sure what the expectation here is.

The health-check mechanism is transparent to IoT Edge and controlled exclusively by the Docker engine. There will be nothing reflected in Edge Agent's twin. You should just confirm the container is indeed restarted by Docker when health endpoint reports unhealthy.

bqstony · 2022-05-31T18:44:12Z

that is also what i am expecting

bqstony · 2022-06-01T06:25:31Z

After found a few things, i think your solution doesn't work out of the box. Because moby-engine does not support the restart behavior on health probe, see moby/moby#22719 . The restart is triggered by orchestrators like docker swarm (docker-compose) or other tools.

I Think it should be the job of the edgeAgent. Actually the RestartPolicyManager does only look at the state of the Container and not the Healt. See the remark in the code:

iotedge/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/RestartPolicyManager.cs

Line 24 in 719bffd

    
           // TODO: If the module state is "running", when we have health-probes implemented,

veyalla · 2022-06-01T19:02:08Z

Thanks for investigation. I believe you are correct, Docker Swarm-like restart-on-unhealthy behavior is not implemented in IoT Edge. For feature asks, please file or upvote an idea at https://aka.ms/iotedge-suggestion.

However, I was able to test a workaround that achieved the core requirement. The createOptions snippet below will call the health command every minute (controlled by the Interval parameter) and restart the container if the health check fails or the health endpoint cannot be reached; edgeAgent will subsequently restart the module.

{
    "Healthcheck": {
        "Test": [
            "CMD-SHELL",
            "curl -f http://localhost:9070/health || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)'"
        ],
        "Interval": 60000000000,
        "Timeout": 60000000000,
        "Retries": 1,
        "StartPeriod": 120000000000
    },
    "HostConfig": {
        "Init": true
    }
}

Ensure curl and bash commands are available in the container image or replace with equivalent

rafaelschlatter · 2023-02-08T09:48:25Z

I created an issue to ask for the implementation of restart "on-unhealthy" here: https://feedback.azure.com/d365community/idea/32f1a962-70a0-ed11-a81b-6045bd79fc6e

Please upvote this if you think it is useful.

github-actions bot added customer-reported iotedge labels May 17, 2022

gauravIoTEdge self-assigned this May 21, 2022

gauravIoTEdge closed this as completed May 24, 2022

gauravIoTEdge mentioned this issue May 25, 2022

Update restart policies docs MicrosoftDocs/azure-docs#93412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck not restarting unhealthy module #6358

Healthcheck not restarting unhealthy module #6358

gri6507 commented May 17, 2022

gri6507 commented May 17, 2022 •

edited

gauravIoTEdge commented May 24, 2022

veyalla commented May 25, 2022

gri6507 commented May 25, 2022 •

edited

veyalla commented May 25, 2022

gri6507 commented May 26, 2022

veyalla commented May 26, 2022

bqstony commented May 31, 2022 •

edited

veyalla commented May 31, 2022 •

edited

bqstony commented May 31, 2022

bqstony commented Jun 1, 2022

veyalla commented Jun 1, 2022

rafaelschlatter commented Feb 8, 2023

Healthcheck not restarting unhealthy module #6358

Healthcheck not restarting unhealthy module #6358

Comments

gri6507 commented May 17, 2022

Expected Behavior

Current Behavior

Steps to Reproduce

Context (Environment)

Output of iotedge check

Device Information

Runtime Versions

Additional Information

gri6507 commented May 17, 2022 • edited

gauravIoTEdge commented May 24, 2022

veyalla commented May 25, 2022

gri6507 commented May 25, 2022 • edited

veyalla commented May 25, 2022

gri6507 commented May 26, 2022

veyalla commented May 26, 2022

bqstony commented May 31, 2022 • edited

veyalla commented May 31, 2022 • edited

bqstony commented May 31, 2022

bqstony commented Jun 1, 2022

veyalla commented Jun 1, 2022

rafaelschlatter commented Feb 8, 2023

Output of `iotedge check`

gri6507 commented May 17, 2022 •

edited

gri6507 commented May 25, 2022 •

edited

bqstony commented May 31, 2022 •

edited

veyalla commented May 31, 2022 •

edited