Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform epoll_wait high cpu utilization #3276

Closed
jgrammen-agilitypr opened this issue Jun 4, 2018 · 23 comments
Closed

terraform epoll_wait high cpu utilization #3276

jgrammen-agilitypr opened this issue Jun 4, 2018 · 23 comments

Comments

@jgrammen-agilitypr
Copy link

This bug-tracker is monitored by developers and other technical types. We like detail! So please use this form and tell us, concisely but precisely, what's up. Please fill out ALL THE FIELDS!

If you have a feature request, please post to the UserVoice.

If this is a console issue (a problem with layout, rendering, colors, etc.), please post to the console issue tracker.

Important: When reporting BSODs or security issues, DO NOT attach memory dumps, logs, or traces to Github issues. Instead, send dumps/traces to secure@microsoft.com, referencing the GitHub bug number. Ideally, please configure your machine to capture minidumps, repro the issue, and send the minidump from "C:\Windows\minidump".

  • Your Windows build number: (Type ver at a Windows Command Prompt)

Microsoft Windows [Version 10.0.16299.431]

  • What you're doing and what's happening: (Copy&paste specific commands and their output, or include screen shots)
    I am using terraform to start virtual infrastructure in aws. while terraform is waiting for input, it experiences very high cpu utilization. this appears to be related to epoll_wait.
    the WSL version of terraform experiences this high cpu, where as the windows version does not.

I am quite aware that the developers are not necessarily familiar with terraform, but this issue no longer appears to be a terraform but instead an issue with how WSL handles the epoll_wait calls.

terraform apply

screen shots
WSL version show very high cpu while waiting for use input
terraform-high-cpu-2
windows version of terraform showing non high cpu
terraform-windows

terraform -v

Terraform v0.11.3
+ provider.aws v1.21.0

WSL / ubuntu for windows (on windows 10 build 1709)

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:        16.04
Codename:       xenial

terraform mainfest

# ca04test01
resource "aws_instance" "ca04test01" {
	key_name = "${var.keypair}"	
	ami = "${var.ami_xenial}"
	instance_type = "${var.server_type["ca04test01"]}"
	subnet_id = "${var.subnet_production["ent_app_preprod"]}"
	private_ip = "${var.server_ip["ca04test01"]}"
	# vpc_security_group_ids has to be a list, so enclose in [] to make it a 1 item list
	vpc_security_group_ids = ["${var.sg_production["ent_app_preprod"]}"]
	
	lifecycle {
		ignore_changes = ["tags"]
	}
	count = "1"
	user_data = "#cloud-config\nhostname: ca04test01.agilitypr.internal\nfqdn: ca04test01.agilitypr.internal"

	tags {
		type = "test"
		role = "test"
		Name = "ca04test01"
	}
	
	provisioner "remote-exec" {
		inline = [
			"printf '\n${var.server_ip["ca04test01"]} ca04test01.agilitypr.internal' | sudo tee -a /etc/hosts",
			"sleep 1" # trick to get terraform to finish above command before closing ssh conn
			# command || true # force command to exit and allow terraform to continue
		]
		connection {
			type = "ssh"
			user = "ubuntu"
		}
	}
}
  • What's wrong / what should be happening instead:

While waiting for user input the application should not be using very high cpu

  • Strace of the failing command, if applicable: (If some_command is failing, then run strace -o some_command.strace -f some_command some_args, and link the contents of some_command.strace in a gist here)

straces gist:
https://gist.github.com/jgrammen-agilitypr/47bf7baa6d81b1bbccd6e9189c047aff

See our contributing instructions for assistance.

@therealkenc
Copy link
Collaborator

Microsoft Windows [Version 10.0.16299.431]

Try 17134 and see how that fares. But ref #3191 which reportedly is okay in 17134 through 17661 but not okay in 17666 - 17???.

@jgrammen-agilitypr
Copy link
Author

ok, I will attempt to get an environment setup on the insider release of windows to test.

@therealkenc
Copy link
Collaborator

I will attempt to get an environment setup on the insider release of windows to test.

This one is awkward because WSL improved from FCU 16299 and April Update 17134, but #3191 is still under investigation. If it were me I would try 17134 stable first not insiders. Or try both if you are feeling highly motivated. Your problem could be fixed in 17134 but broken on the most recent insiders (17682 as of this writing). Or it could be an unrelated problem and spin in both. [Or, it could even be, albeit less likely, an unrelated problem and work fine in both 17134 and 17682, but not 16299.]

@jgrammen-agilitypr
Copy link
Author

the issue is not fixed in 17134.

17134-high-cpu

my next step will be to test insiders / latest preview build

@jgrammen-agilitypr
Copy link
Author

testing on insider preview 17682 same problem, issue not resolved

17682-high-cpu

@jgrammen-agilitypr
Copy link
Author

what are the next steps in debugging this issue so that it can be resolved? debug logs? straces?

@Brian-Perkins
Copy link

I also suspect this is the same as #3191 which is understood. There is an issue with large writes to non-blocking unix sockets that will cause them to always return EAGAIN.

@therealkenc
Copy link
Collaborator

It was almost duped right out of the gate. The ambiguity was caused by this (from #3191):

Tested and succeeded in 17134 and 17661.

The OP here is 16299.

@Brian-Perkins
Copy link

@therealkenc - indeed, that is quite an oversight on my part; likely unrelated issues.

@Brian-Perkins
Copy link

@jgrammen-agilitypr I downloaded a terraform binary and tried the example and did not see any CPU usage while it was prompting for the first value. Does this work for you as well?

@jgrammen-agilitypr
Copy link
Author

@Brian-Perkins the example you used, does not get far enough into the terraform process to actually cause problems.

you need something closer to the example posted at the beginning of this thread.

I am linking to a gist which contains a more complete project example:
https://gist.github.com/jgrammen-agilitypr/57b00d549c73a276fa5b0e0d448aa3f7

All of thoes files go in a folder, in my case named "test". inside the folder you run 'terraform init', to setup the aws provider.
You need to fill in the masked variables (eg: server_ip, sg_production, subnet_production, access_key, secret_key etc )

The example just creates an amazon ec2 nano instance, running ubuntu 16.04 xenial.

@Brian-Perkins
Copy link

Unfortunately it looks like that sample actually tries to connect to AWS as I am getting an error that the security token is invalid. So a repro from my side is seeming less likely. @jgrammen-agilitypr can you update your strace output with a "-f" as the original doesn't appear to contain the problematic loop. For starters I am mostly interested in what file type is causing the epoll_wait to always be ready so that I can at least narrow down the problem-space.

@therealkenc
Copy link
Collaborator

Doing this need-repro so it doesn't float indefinitely. Needs something self-contained (which I think is possible with terraform).

@jgrammen-agilitypr
Copy link
Author

I apologize for the delay, I must have missed the notification of your comment.
I have attached a strace with -f. The file was too large for gist, so I am uploading a zipped copy here
terraform_test_project.zip

the command run was

strace -o terraform_test_project.strace -f terraform apply

@berney
Copy link

berney commented Jul 4, 2018

You can reproduce it with this:

  1. mkdir repo
  2. cd repo
  3. Create main.tf with the following contents:

main.tf

resource "local_file" "foo" {
  content = "foo"
  filename = "${path.module}/foo.txt"
}
  1. terraform init
  2. terraform apply, wait at the prompt and observe high CPU utilisation

This will create a file called foo.txt with the contents "foo".

The first time you will need to run terraform init.

If the file exists remove it, run terraform apply and wait at the prompt.

I'm on Windows 1803 (OS Build 17134.112), in WSL, with Terraform v0.11.7, I get 25% CPU utilisation in kernel mode - I have 2 cores, with hyperthreading so 4 processors, the utilisation is balanced across all 4 processors - in effect at any point in time one CPU is 100% utilisation but there's no affinity to one CPU so over time it is spread across all CPUs and looks like 1/N utilisation.

With more complex projects I get 100% CPU utilisation (all cores pegged at 100% in kernel). I suspect this is because with more terraform plugins there are more processes and they are all spinning creating more load on more cores.

@therealkenc
Copy link
Collaborator

Much better, thanks @berney. You didn't say what version of Windows 10 you are on. If it still spins on 17704 (which you'll inevitably be asked to try) then that is a good repro.

@berney
Copy link

berney commented Jul 5, 2018

Sorry, I updated my comment above to state I'm on Windows 1803 (OS Build 17134.112) and using Terraform v0.11.7 in WSL.

@jgrammen-agilitypr
Copy link
Author

I have tested @berney 's reproduction steps in insiders build 17704
and get very high cpu usage.

17704-high-cpu-local

terraform -v

Terraform v0.11.7
+ provider.local v1.1.0

windows version

C:\Users\test>ver

Microsoft Windows [Version 10.0.17704.1000]

WSL

test@DESKTOP-668DBNN:/mnt/c/Users/test/Desktop/local$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.4 LTS
Release:        16.04
Codename:       xenial

@therealkenc
Copy link
Collaborator

I have tested @berney 's reproduction steps in insiders build 17704

Thanks, much appreciated. That's kind of what I figured, because this issue didn't seem to be dupe #3191 from the get-go. The devs will have to take a look from here.

Pro-tip: In general you can increase the chances of that happening by providing CLI repro steps from clean install, the first step often being either wget or apt. Think something a six year old can cut and paste into a terminal. Brian seemed to have Terraform at least limping along so that might not be necessary here. But in general, counting on someone to google a given topic specialisation is at best a crap shoot.

@jgrammen-agilitypr
Copy link
Author

Ok, thanks for the pro tip.

If dev's need more information or debug logs etc, ask away and I will do my best to provide what I can.

@Brian-Perkins
Copy link

This is an issue with pipes and the EPOLLET epoll event flag, where the EPOLLET part is not being respected so the epoll_wait always returns immediately.

@benhillis
Copy link
Member

Fixed in build 17723

@jgrammen-agilitypr
Copy link
Author

Testing on build 17728 ( somehow it skipped 17723 ) confirms that the issue with high load is fixed.

17728-no-high-cpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants