Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide visibility into Bootstrap Container behaviors - exit status, time, etc? #3811

Open
diranged opened this issue Mar 7, 2024 · 2 comments
Labels
area/core Issues core to the OS (variant independent) area/kubernetes K8s including EKS, EKS-A, and including VMW type/enhancement New feature or request

Comments

@diranged
Copy link

diranged commented Mar 7, 2024

What I'd like:
We use a few bootstrap containers on startup - some of them label hosts, others handle custom max-pods calculations, etc. Because booting new hosts is more important to us than the occasional host that might boot "incorrectly configured", we choose to mark these as essential=false to ensure that we are never blocked in booting new capacity. (This decision has saved us many outages).

The thing is ... once your host is booted, you have no idea whether or not the Bootstrap scripts worked. You can scroll through the Journald Logs, but thats it. You don't know how long a host waited to execute a script, how long it took to pull down an image, or what the exit codes were.

We want to keep track of the number of Bootstrap Containers that start up and fail so that we can alert on that, but not block the booting process. In an ideal world, we would also have some method for getting metrics on how long it took these containers to run, which would help us optimize our new-host boot time (but that's really for extra credit).

Preferred Behavior

When I think about how to approach this - I feel like the most natural thing is for each Bootstrap Container to become a "condition" on the node - so that a simple kubectl describe node ... will get you information on it. From there, metrics can be collected about which nodes have which conditions on them, and teams can develop any alerting or behaviors they need.

Any alternatives you've considered:

We first went down the path of trying to use the Node Problem Detector with this configuration (below) - but discoverd that it really only tails logs from the moment it starts up, so it cannot react to logs that existed before it comes up .. therefore it cannot have visibility into the Bootstrap Containers.

      bottlerocket-bootstrap-containers.json: |
        { 
          "plugin": "custom",
          "pluginConfig": {
            "invoke_interval": "5m",
            "timeout": "1m",
            "max_output_length": 80,
            "concurrency": 1
          },

          "source": "bottlerocket-bootstrap-containers",
          "conditions": [
            {
              "type": "BootstrapContainerFail",
              "reason": "NoFailure",
              "message": "Bootstrap Containers started successfully",
            }
          ],
          "rules": [
            {
              "type": "permanent",
              "condition": "BootstrapContainerFail",
              "reason": "ContainerStartFailure",
              "path": "/home/kubernetes/bin/log-counter",
              "timeout": "3m"
              "args": [
                "--journald-source=systemd",
                "--log-path=/var/log/journal",
                "--lookback=20m",
                "--delay=5m",
                "--count=5",
                "--pattern=Failed to start bootstrap container.*",
              ],
            }
          ]
        }
@diranged diranged added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Mar 7, 2024
@yeazelm
Copy link
Contributor

yeazelm commented Mar 8, 2024

Thanks for cutting this issue @diranged. I think there are some useful features that could be added to bootstrap containers. I haven't looked deeply at conditions but is the expectation there would be one for each bootstrap container, regardless of if it is marked essential or not? I think metrics about success, time, and logging output all seem like reasonable things as well. We'll take that as a feature request to enhance the observability of bootstrap containers.

@yeazelm yeazelm added area/kubernetes K8s including EKS, EKS-A, and including VMW area/core Issues core to the OS (variant independent) and removed status/needs-triage Pending triage or re-evaluation labels Mar 8, 2024
@diranged
Copy link
Author

diranged commented Mar 8, 2024

Thanks for cutting this issue @diranged. I think there are some useful features that could be added to bootstrap containers. I haven't looked deeply at conditions but is the expectation there would be one for each bootstrap container, regardless of if it is marked essential or not? I think metrics about success, time, and logging output all seem like reasonable things as well. We'll take that as a feature request to enhance the observability of bootstrap containers.

Just off the top of my head - I'd like to see a condition per Bootstrap Container. I could be convinced otherwise though ... but that seems cleanest to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Issues core to the OS (variant independent) area/kubernetes K8s including EKS, EKS-A, and including VMW type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants