Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks aws_auth configmap management may cause race conditions #84

Open
HartS opened this issue Aug 13, 2020 · 4 comments
Open

eks aws_auth configmap management may cause race conditions #84

HartS opened this issue Aug 13, 2020 · 4 comments

Comments

@HartS
Copy link
Contributor

HartS commented Aug 13, 2020

I've encountered 2 race conditions recently that I believe have been caused by terraform's management of the aws-auth configmap here

In a recent CI-initiated terraform destroy, the following error was encountered:

...
module.eks.aws_security_group.aws-node: Still destroying... [id=sg-069b01ce27e2bffd3, 30m1s elapsed]
Error: Error deleting security group: DependencyViolation: resource sg-069b01ce27e2bffd3 has a dependent object
	status code: 400, request id: 11278f08-4336-4f13-b335-bf0dc38e1ed8

During debugging, I noticed that kube couldn't be accessed with the kubeconfig generated by terraform output, which led to the discovery that this configmap had already been deleted. This cluster is now in a state where terraform can't destroy it, due to dependencies still existing in that security group, which @colstrom has speculated is due to cleanup of an eni expected to be managed by the worker node failing, because the worker node could no longer manage cluster resources after deletion of this configmap.

This configmap has also led to race conditions during deployment, where terraform apply has been failing with alarming frequency, with the following error:

module.services.kubernetes_config_map.aws_auth: Still creating... [30s elapsed]

Error: Post https://6A7BFC843CD4C7578DCB503446548A17.gr7.us-west-2.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps: dial tcp 34.218.122.126:443: i/o timeout

  on modules/services/aws_auth_cm.tf line 7, in resource "kubernetes_config_map" "aws_auth":
   7: resource "kubernetes_config_map" "aws_auth" {

Though I believe this is due to attempting to overwrite/patch it before the kube api is up.

@satadruroy
Copy link
Contributor

The issue with the security groups not getting deleted during terraform destroy is a known issue:

hashicorp/terraform-provider-aws#2445

Usually, another destroy after a failed one cleans things up.

hashicorp/terraform-provider-aws#2445 (comment)

@HartS
Copy link
Contributor Author

HartS commented Aug 14, 2020

Just to follow up, terraform destroy again didn't fix the issue; I kept seeing the same thing. After cleaning up the eni which had been associated with a deleted worker node, I was able to finish the destroy. Since this isn't a regular occurrence, it does point to a race condition, whether it's in cap-terraform or the terraform EKS provider.

@satadruroy
Copy link
Contributor

The aws_auth configmap timeout is a core EKS issue - the APIserver is not ready even when it says it is.

terraform-aws-modules/terraform-aws-eks#621

aws/containers-roadmap#654

@satadruroy
Copy link
Contributor

satadruroy commented Aug 17, 2020

@HartS can you disentangle the two issues please? They happen due to race conditions but they are not related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants