eks aws_auth configmap management may cause race conditions #84

HartS · 2020-08-13T21:56:22Z

I've encountered 2 race conditions recently that I believe have been caused by terraform's management of the aws-auth configmap here

In a recent CI-initiated terraform destroy, the following error was encountered:

...
module.eks.aws_security_group.aws-node: Still destroying... [id=sg-069b01ce27e2bffd3, 30m1s elapsed]
Error: Error deleting security group: DependencyViolation: resource sg-069b01ce27e2bffd3 has a dependent object
	status code: 400, request id: 11278f08-4336-4f13-b335-bf0dc38e1ed8

During debugging, I noticed that kube couldn't be accessed with the kubeconfig generated by terraform output, which led to the discovery that this configmap had already been deleted. This cluster is now in a state where terraform can't destroy it, due to dependencies still existing in that security group, which @colstrom has speculated is due to cleanup of an eni expected to be managed by the worker node failing, because the worker node could no longer manage cluster resources after deletion of this configmap.

This configmap has also led to race conditions during deployment, where terraform apply has been failing with alarming frequency, with the following error:

module.services.kubernetes_config_map.aws_auth: Still creating... [30s elapsed]

Error: Post https://6A7BFC843CD4C7578DCB503446548A17.gr7.us-west-2.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps: dial tcp 34.218.122.126:443: i/o timeout

  on modules/services/aws_auth_cm.tf line 7, in resource "kubernetes_config_map" "aws_auth":
   7: resource "kubernetes_config_map" "aws_auth" {

Though I believe this is due to attempting to overwrite/patch it before the kube api is up.

The text was updated successfully, but these errors were encountered:

satadruroy · 2020-08-13T22:16:11Z

The issue with the security groups not getting deleted during terraform destroy is a known issue:

hashicorp/terraform-provider-aws#2445

Usually, another destroy after a failed one cleans things up.

hashicorp/terraform-provider-aws#2445 (comment)

HartS · 2020-08-14T06:01:00Z

Just to follow up, terraform destroy again didn't fix the issue; I kept seeing the same thing. After cleaning up the eni which had been associated with a deleted worker node, I was able to finish the destroy. Since this isn't a regular occurrence, it does point to a race condition, whether it's in cap-terraform or the terraform EKS provider.

satadruroy · 2020-08-17T19:04:31Z

The aws_auth configmap timeout is a core EKS issue - the APIserver is not ready even when it says it is.

terraform-aws-modules/terraform-aws-eks#621

aws/containers-roadmap#654

satadruroy · 2020-08-17T19:05:15Z

@HartS can you disentangle the two issues please? They happen due to race conditions but they are not related.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eks aws_auth configmap management may cause race conditions #84

eks aws_auth configmap management may cause race conditions #84

HartS commented Aug 13, 2020

satadruroy commented Aug 13, 2020

HartS commented Aug 14, 2020

satadruroy commented Aug 17, 2020

satadruroy commented Aug 17, 2020 •

edited

Loading

eks aws_auth configmap management may cause race conditions #84

eks aws_auth configmap management may cause race conditions #84

Comments

HartS commented Aug 13, 2020

satadruroy commented Aug 13, 2020

HartS commented Aug 14, 2020

satadruroy commented Aug 17, 2020

satadruroy commented Aug 17, 2020 • edited Loading

satadruroy commented Aug 17, 2020 •

edited

Loading