Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent pod attach issues with bottlerocket-aws-k8s-1.24-x86_64-v1.19.3-f097c617 AMI #3866

Open
aaronborden-rivian opened this issue Apr 3, 2024 · 9 comments
Labels
type/bug Something isn't working

Comments

@aaronborden-rivian
Copy link

aaronborden-rivian commented Apr 3, 2024

Image I'm using:
AWS AMI bottlerocket-aws-k8s-1.24-x86_64-v1.19.3-f097c617 ami-08ab333430f1465ce
kublet version reported: v1.24.17-eks-bd4e8bf
Karpenter 0.30.0

We're using Bottlerocket to run our GitLab Runner CI jobs. After the 1.19.3 AMI was released, we started seeing jobs hanging after pod start and eventually timing out on nodes deployed with the latest Bottlerocket image.

What I expected to happen:
Nodes running Bottlerocket 1.19.3 to perform similar to Bottlerocket 1.19.2 with respect to kubectl attach stability.

What actually happened:
We saw an increase in the number of failed jobs with error message.

WARNING: prepare_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 1h0m0s seconds

Note that this did not affect all jobs, the error was rare, but frequent enough for us to see it ~0.7% of jobs affected. We saw about 175 jobs fail with this error during an 8 hour period.

After reverting to Bottlerocket 1.19.2, the error went away.

How to reproduce the problem:
Unfortunately, I don't have a great method for reproduction. The gitlab-runner uses attach to run scripts within the build container, so perhaps you could simplify the repro by doing kubectl attach in a loop.

  1. Specify Bottlerocket v1.19.3
  2. Run many CI jobs
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  amiSelector:
    aws::name: bottlerocket-aws-k8s-1.24-x86_64-v1.19.3-f097c617
    # ...
@aaronborden-rivian aaronborden-rivian added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Apr 3, 2024
@aaronborden-rivian
Copy link
Author

FYI I'll also be looping in GitLab support in case we can come up with more details.

@rpkelly
Copy link
Contributor

rpkelly commented Apr 3, 2024

Hi @aaronborden-rivian thanks for reporting this. I'll work on reproducing it. In the mean time, any additional logs or error messages you could provide would be helpful.

Thanks!

@aaronborden-rivian
Copy link
Author

A number of folks reporting this in the gitlab-runner issue https://gitlab.com/gitlab-org/gitlab-runner/-/issues/37446 Not all of them using Bottlerocket. Might be a component that's included in recent versions that is causing the issue.

@alex-berger
Copy link

alex-berger commented Apr 5, 2024

Maybe this is caused by containerd/containerd#10036. There is already an upstream containerd fix with backports for 1.6.* and 1.7.* in preparation, see containerd/containerd#10036 (comment).

@rpkelly rpkelly removed the status/needs-triage Pending triage or re-evaluation label Apr 5, 2024
@alex-berger
Copy link

Many thanks to @vyaghras, @bcressey and @rpkelly for #3869. Now, we need a new Amazon EKS optimized Bottlerocket AMI release, will this be triggered automatically?

@bcressey
Copy link
Contributor

bcressey commented Apr 5, 2024

@alex-berger it is not automatic but the release process for 1.19.4 with the fix is getting under way now.

@alex-berger
Copy link

alex-berger commented Apr 5, 2024

the release process for 1.19.4 with the fix is getting under way now

@bcressey Out of curiosity, can this be tracked somewhere (publicly, e.g. on Github or wherever) or is this an AWS internal process?

Update, ah my bad, as I wrote those lines #3872 appeared :-)

@rpkelly
Copy link
Contributor

rpkelly commented Apr 8, 2024

With the release of Bottlerocket v1.19.4 the containerd issue should be addressed!

@dmitrii-bdv
Copy link

@alex-berger
Looks like the releases are tracked here: https://github.com/bottlerocket-os/bottlerocket/releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants