Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck not restarting unhealthy module #6358

Closed
gri6507 opened this issue May 17, 2022 · 13 comments
Closed

Healthcheck not restarting unhealthy module #6358

gri6507 opened this issue May 17, 2022 · 13 comments

Comments

@gri6507
Copy link
Contributor

gri6507 commented May 17, 2022

Expected Behavior

When a custom module's healthcheck fails, thus putting the module into "unhealthy" status, edgeAgent should restart the module

Current Behavior

edgeAgent does not restart the module

Steps to Reproduce

  1. Create a custom module with the following at the end of it's Dockerfile
    HEALTHCHECK --interval=15s --timeout=3s --retries=2 \
      CMD exit 1
  2. Deploy this module with a deployment.json which has modulesContent.$edgeAgent.properties.desired
    "modules": {
      "MyModule": {
        "version": "34136",
        "type": "docker",
        "status": "running",
        "restartPolicy": "on-failure",
        "settings": {
          "image": "...",
          "createOptions": "{\"Healthcheck\":{\"Test\":[],\"Interval\":0,\"Timeout\":0,\"Retries\":0,\"StartPeriod\":0}, \"HostConfig\":{\"CapAdd\":[\"SYS_PTRACE\"], \"ExtraHosts\":[\"dockerhost:172.16.99.99\"], \"Binds\":[\"/etc/daikin-software-gateway/:/etc/storage/\"]}}"
        },
        "env": {
          "storageFolder": {
            "value": "/etc/storage/"
          }
        }
      }
    }
    According to documentation, those Test settings should just inherit the healthcheck definition from the container.
  3. Let the deployment happen. Then run sudo iotedge list && sudo docker container ls and observe that iotedge lists the module as running and docker shows the state of the container as Up 10 seconds (starting)
  4. Wait until the container has run for at least 30 seconds, which comes from the Dockerfile --interval=15s --retries=2 settings. Observe that iotedge lists the module as running and docker shows the state of the container as Up 1 minute (unhealthy)
  5. Observe that regardless of how long you let edgeAgent run, it does not restart the custom module. Furthermore, sudo docker logs -f edgeAgent | grep "scheduled to restart after" does not show any attempts to restart the custom module.

Context (Environment)

Output of iotedge check

Click to show
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: identity certificates expiry - OK
√ production readiness: certificates - OK
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with DPS endpoint - OK
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

22 check(s) succeeded.
2 check(s) raised warnings. Re-run with --verbose for more details.

Device Information

  • Host OS Ubuntu 18.04]:
  • Architecture [amd64,]:
  • Container OS Linux containers]:

Runtime Versions

  • iotedge 1.1.13
  • mcr.microsoft.com/azureiotedge-agent 1.1 d1bbbd315877
  • mcr.microsoft.com/azureiotedge-hub 1.1 c25873c1704a
  • Docker/Moby [run docker version]: 20.10.8+azure

Additional Information

I have also tried a deployment with an explicit Healthcheck specification, but that also did not work

    "modules": {
      "MyModule": {
        "version": "34136",
        "type": "docker",
        "status": "running",
        "restartPolicy": "on-failure",
        "settings": {
          "image": "...",
          "createOptions": "{\"Healthcheck\":{\"Test\",[\"CMD\":\"exit 1\"],\"Interval\":15000000000,\"Timeout\":3000000000,\"Retries\":2,\"StartPeriod\":1000000000}, \"HostConfig\":{\"CapAdd\":[\"SYS_PTRACE\"], \"ExtraHosts\":[\"dockerhost:172.16.99.99\"], \"Binds\":[\"/etc/daikin-software-gateway/:/etc/storage/\"]}}"
        },  
        "env": {
          "storageFolder": {
            "value": "/etc/storage/"
          }   
        }   
      }   
    }  
@gri6507
Copy link
Contributor Author

gri6507 commented May 17, 2022

After doing more searching in this Git repo, I found #5746. That ticket talks about an on-unhealthy restartPolicy. According to docker documentation (and that ticket), that is an unsupported restartPolicy. In my case, I am using on-failure restartPolicy. Is that also not supported for failed health checks?

@gauravIoTEdge gauravIoTEdge self-assigned this May 21, 2022
@gauravIoTEdge
Copy link
Contributor

@gri6507 - Yes, I'm afraid that policy is also not supported.

Currently, our recommendation is that you use the docker health checks. Like the issue you linked suggested, you can post it as a suggestion and vote - https://aka.ms/iotedge-suggestion.

I'll work with someone to make sure the documentation reflects this too. Sorry about that.

@veyalla
Copy link
Contributor

veyalla commented May 25, 2022

As workaround, you can achieve the desired result without relying on IoT Edge runtime. For example for the health check below:

HEALTHCHECK \
 --interval=30s \
 --timeout=30s \
 --start-period=60s \
 --retries=3 \
 CMD curl -f http://localhost:9070/health || exit 1

...use these createOptions:

"createOptions": "{\"Healthcheck\":{\"Test\":[\"CMD-SHELL\",\"curl -f http://localhost:9070/health || exit 1\"],\"Interval\":30000000000,\"Timeout\":30000000000,\"Retries\":1,\"StartPeriod\":60000000000},\"HostConfig\":{\"PortBindings\":{\"9070/tcp\":[{\"HostPort\":\"9070\"}]},\"Binds\":[\"/data/misc:/data/misc\"]}}"

If you use this approach, ensure curl is available in your container image.

@gri6507
Copy link
Contributor Author

gri6507 commented May 25, 2022

@veyalla - thank you for that workaround idea. One thing that isn't clear is what is the HostPort 9070 or why /data/misc bind mount is needed?

@veyalla
Copy link
Contributor

veyalla commented May 25, 2022

9070 is just an example. /data/misc is also just a copy/paste artifact and not needed.

The uber point is that only the module knows whether it is healthy or not. So it exposes a binary health signal on a local endpoint of its choosing (9070 in this example).

Docker engine is configured to periodically call the health endpoint (per the Healthcheck parameters) and crashes the module if unhealthy status is returned. EdgeAgent will subsequently restart the module, and that hopefully unstucks the module from a "running but not working" state.

@gri6507
Copy link
Contributor Author

gri6507 commented May 26, 2022

@veyalla - I agree with your statement that

Docker engine is configured to periodically call the health endpoint (per the Healthcheck parameters)

However, I disagree with your continuation

and crashes the module if unhealthy status is returned

Docker does not stop the container. It simply marks it as unhealthy. If docker actually did stop the container, then I agree that EdgeAgent would have picked that up and restarted the Edge module.

@veyalla
Copy link
Contributor

veyalla commented May 26, 2022

My expectation is that the module exits if the health endpoint indicates unhealthy, which is equivalent to crashing itself.

Example from heath check above:

CMD curl -f http://localhost:9070/health || exit 1

@bqstony
Copy link

bqstony commented May 31, 2022

Hi @veyalla

I am digging arount in the issues to find the right way for a working health check in iotedge v1.2.

When i set the the restart policy, in the Manifest, like: modules.{moduleId}.restartPolicy = always, i get always the following result on the edgeHost, no matter what value i use:

$ docker inspect --format '{{json .HostConfig.RestartPolicy}}' mycustommodule| jq
{
  "Name": "",
  "MaximumRetryCount": 0
}
# This is wrong for my opinion, the name Should have the value always, i dont know how to change this value, whenn i do it in the manifest, the container does not startup.

$ docker inspect --format '{{json .Config.Healthcheck}}' mycustommodule | jq
{
  "Test": [
    "CMD-SHELL",
    "curl --fail http://localhost:80/health || exit 1"
  ],
  "Interval": 30000000000,
  "Timeout": 10000000000,
  "Retries": 3
}
# the Healtcheck locks good to me and the calls are made by docker, i see the calls in the logs

For information, the configured of the health is in the dockerfile:

FROM mcr.microsoft.com/dotnet/aspnet:6.0-bullseye-slim AS base
WORKDIR /app
EXPOSE 80

HEALTHCHECK --interval=30s --timeout=10s --retries=3 CMD curl --fail http://localhost:80/health || exit 1

# Install prerequisites for healthcheck
RUN apt-get update && apt-get install -y \
    curl && \
    rm -rf /var/lib/apt/lists/*

...

My Question is: What else i have to do for a working restart after my container goes to unhealthy state?

@veyalla
Copy link
Contributor

veyalla commented May 31, 2022

I'm not sure what the expectation here is.

The health-check mechanism is transparent to IoT Edge and controlled exclusively by the Docker engine. There will be nothing reflected in Edge Agent's twin. You should just confirm the container is indeed restarted by Docker when health endpoint reports unhealthy.

@bqstony
Copy link

bqstony commented May 31, 2022

that is also what i am expecting

@bqstony
Copy link

bqstony commented Jun 1, 2022

After found a few things, i think your solution doesn't work out of the box. Because moby-engine does not support the restart behavior on health probe, see moby/moby#22719 . The restart is triggered by orchestrators like docker swarm (docker-compose) or other tools.

I Think it should be the job of the edgeAgent. Actually the RestartPolicyManager does only look at the state of the Container and not the Healt. See the remark in the code:

// TODO: If the module state is "running", when we have health-probes implemented,

@veyalla
Copy link
Contributor

veyalla commented Jun 1, 2022

Thanks for investigation. I believe you are correct, Docker Swarm-like restart-on-unhealthy behavior is not implemented in IoT Edge. For feature asks, please file or upvote an idea at https://aka.ms/iotedge-suggestion.

However, I was able to test a workaround that achieved the core requirement. The createOptions snippet below will call the health command every minute (controlled by the Interval parameter) and restart the container if the health check fails or the health endpoint cannot be reached; edgeAgent will subsequently restart the module.

{
    "Healthcheck": {
        "Test": [
            "CMD-SHELL",
            "curl -f http://localhost:9070/health || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)'"
        ],
        "Interval": 60000000000,
        "Timeout": 60000000000,
        "Retries": 1,
        "StartPeriod": 120000000000
    },
    "HostConfig": {
        "Init": true
    }
}

Ensure curl and bash commands are available in the container image or replace with equivalent

@rafaelschlatter
Copy link

I created an issue to ask for the implementation of restart "on-unhealthy" here: https://feedback.azure.com/d365community/idea/32f1a962-70a0-ed11-a81b-6045bd79fc6e

Please upvote this if you think it is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants