Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Move test agents to Linux+wptagent #98

Closed
pmeenan opened this issue Apr 26, 2017 · 30 comments
Closed

Move test agents to Linux+wptagent #98

pmeenan opened this issue Apr 26, 2017 · 30 comments
Assignees

Comments

@pmeenan
Copy link
Member

pmeenan commented Apr 26, 2017

Move the test agents to the new linux-based agent. This will open up the possibility of running lighthouse testing in the future.

@pmeenan
Copy link
Member Author

pmeenan commented Apr 26, 2017

The plan is to have the 5/1 crawl run on the new agents so hang on, the metrics may move or things may break (hopefully not but I know better than to assume it will be smooth). Everything SHOULD continue to work, including response bodies.

I'm starting the migration now since the 4/15 crawl is complete. While I'm deploying new VM's I'll also take the opportunity to upgrade VMWare to the latest version on each of the 7 hosts.

The plan:

  1. Secure-erase the SSD's in the 7 VM servers (the earlier VM's used most of the disk space and SSDs don't like running with few free blocks)
  2. Install the latest ESXi on the VM servers
  3. Partition each SSD to only use 80% of the free space (leave 20% for block management for performance)
  4. Set up a master VM using Ubuntu 16.04.2 (configured to auto-update the OS and agent code daily)
  5. Clone the master VM to all of the hosts for the same number of agents that were previously running (134 total, 14 on each of the 3 older servers and 23 on each of the 4 newer servers with more cores)
  6. When the 5/1 crawl starts, watch the host CPU utilization to see if more VMs can be deployed (the linux VM's use much less CPU than the WIndows ones on the public WPT instance)
  7. Watch the batch status to see the test thruput. The Windows VM's were running ~950 tests per 30 minute interval. Need to make sure we continue running at at least that pace to complete the crawl in time (hopefully much faster with more agents, making room for testing lighthouse).

@rviscomi
Copy link
Member

rviscomi commented Apr 26, 2017

the metrics may move

What if any are the Windows/Linux differences in Chrome that we can expect to manifest in the metrics?

@pmeenan
Copy link
Member Author

pmeenan commented Apr 26, 2017

There shouldn't be. I use the linux agent for a few of the public WPT locations and no issues have been reported.

The timings will actually get a bit better because the video frames are no longer limited to 10fps but we generally don't expose the visual metrics (hopefully we will be able to after the migration though that may take a few more cycles of tuning and addressing issues).

@pmeenan
Copy link
Member Author

pmeenan commented Apr 27, 2017

And of course, 6 of the servers went nice and smooth but the seventh went belly-up and even the IPMI interface went offline. I'll see if I can find a way to revive it tomorrow and if not I'll ping the local hands-and-eyes but worst case I may need to ask one of you guys to road trip to the data center in redwood city with a laptop, USB drive and a few utilities and I'll try walking you through reflashing it remotely.

As far as the VM's that are running, they look to be working well. Here are sample desktop and mobile runs. The metrics, bodies and blink feature usage information all looks good (to me anyway) in the HARs.

@pmeenan
Copy link
Member Author

pmeenan commented Apr 27, 2017

Send a note to the local hands and eyes (onsite tech) to see if they can take a look. I have the uneasy feeling that the motherboard toasted itself during the power cycles for the re-install. The server was ordered 4/2015 and has a 5-year hardware warranty with parts cross-shipment so getting replacement parts shouldn't be an issue (actually replacing them may be).

@igrigorik
Copy link
Contributor

Holding my fingers crossed.. Worst case, I guess we can run with reduced capacity until the problem is resolved. In theory, if the perf gains for new agents are there, perhaps we have just enough to squeeze in all the runs on the working hardware.

@pmeenan
Copy link
Member Author

pmeenan commented Apr 27, 2017

Absolute worst case I can spin up some VM's in Dulles to help fill the gap and make sure we complete on time. Pretty big unknown since this was the pass where we are tuning scaling and perf :(

@pmeenan
Copy link
Member Author

pmeenan commented May 1, 2017

Trying tuning a few things to see if I can get the thruput up but it looks like while CPU usage during the test is a lot lower, the overall CPU usage and thruput is actually worse. In particular, getting the video frames and bodies after the test completes instead of recording it all during testing appears to add a fair bit of time to each run.

The ~1000 test/30min thruput is running closer to 500 right now (on reduced hardware but the hardware is only responsible for a tiny fraction of that.

@igrigorik
Copy link
Contributor

Yikes, OK, what are our options here then? We can reduce number of sites we crawl, we can spin up more VM instances in Dulles.. these are temporary solution though.

@pmeenan
Copy link
Member Author

pmeenan commented May 1, 2017

Separating out the 2 issues for now. Long-term, assuming the 7th server comes back:

  1. Switch back to the Windows c++ agent

    • Would effectively get us back to where we were before
    • Need to add Lighthouse support for future testing (at which point capacity becomes a concern/question)
    • Would need to keep supporting the agent (was hoping to sunset it in favor of the cross-platform one)
  2. Improve the python agent (running Linux testing). Basically a bunch of iterating on finding what the slow points of the analysis are and optimizing them. I had already done a lot of this so there isn't a lot of "low hanging fruit". By far the slowest part is getting the trace events from Chrome over the remote devtools websocket. There are a few other ways to get the events but improving this may require upstream changes in Chrome (with an added benefit of helping other tooling).

My preference is option 2 but I'm not sure there's 100% speedup on the table. Nether option looks like it buys us enough headroom to add Lighthouse without other tradeoffs (fewer URLs, less frequency or more capacity).

@pmeenan
Copy link
Member Author

pmeenan commented May 1, 2017

ISC pulled the module and reset it and it seems to have come back to life so I'll get the 7th VM server configured now and then see where things stand.

@pmeenan
Copy link
Member Author

pmeenan commented May 1, 2017

The 7th VM server is back so we have all of our physical machines. I made one pass today at optimizing the trace processing (avoid reading back the traces from disk just to process them).

I'm also giving the VM's more RAM since we're not going to be deploying as many as I earlier hoped (going from 2GB to 4GB). I'll finish rolling out the new VM's tonight and let the new code run overnight so we can see what the thruput looks like and how much tuning is left to do.

Next opportunity for optimization is going to be around the video frame processing. The new agents get frames at up to 60fps and do a lot more processing to fix them up. I can add logic to sample them at no higher of a rate than the old desktop agents but it would be kind of a shame to give up on the increased accuracy.

There is also a lot of work around fixing up the PNG's that come from dev tools, converting them to JPEG, cropping, etc which is probably where I'll put most of the initial optimization focus (10's of seconds of processing in each test so pretty good area to target).

@igrigorik
Copy link
Contributor

Thanks for the updates Pat, curious to see tonights results.

Re, accuracy: true, but perhaps for HA specifically it's not a huge deal? As in, perhaps we should intentionally run at lower accuracy for frame metrics, since we don't necessarily trust latency based metrics anyway in the context of HA?

@pmeenan
Copy link
Member Author

pmeenan commented May 2, 2017

7th VM and trace optimizations brought us up to ~625.

I added support for seeing what test and agent is currently running to debug the pathological where an agent can get stuck running a test for over an hour. First finding was that we were running the in-page javascript for extracting metrics and bodies for failed tests but each step would timeout with no response and take a really long time to complete. Javascript modal popups were causing similar issues.

I added some code to automatically dismiss modal alerts and to not do the processing on failed tests and that fixed all of the current failcases I had. From the looks of the status page, 10+ agents at any point in time were stuck in this state so hopefully it buys us another hundred or so runs per hour. I just pushed the update and will watch the agents and pluck off any more failcases as I see them pop up.

In the meantime, I also have a 260MB trace from a page that takes ~5 minutes to process that I'm using to fix some N^2 issues with the trace parsing.

@igrigorik
Copy link
Contributor

In the meantime, I also have a 260MB trace from a page that takes ~5 minutes to process that I'm using to fix some N^2 issues with the trace parsing.

Ah, the wonders of modern web.. Dare I ask which site that is?

@pmeenan
Copy link
Member Author

pmeenan commented May 2, 2017

Not sure about that one right now but here is a test from the set that I was looking at that timed out loading the mobile site (2 minutes) but managed to rack up 300MB in trace data up to that point.

At some point I may just turn off timeline capture since we aren't actually using it for anything but the data could be valuable so I don't want to completely give up on it (and I am making good progress on fixing the issues that the extreme cases bring up).

@pmeenan
Copy link
Member Author

pmeenan commented May 2, 2017

Recent round of fixes brought us up into the 700 range. Still not clear but I'm cautiously optimistic that continuing to focus on the more extreme cases (mostly around trace and video processing) will get us back to where we need to be.

@pmeenan
Copy link
Member Author

pmeenan commented May 3, 2017

ok, now we're cooking with gas. Latest round of optimizations brought it up to ~900 which puts us right around 12 days run time (pretty much where we were before). There's still some room left and I'll plug away at it more to see how far I can push it but we're out of crisis mode now.

@rviscomi
Copy link
Member

rviscomi commented May 3, 2017

That's great news. Thanks Pat!

@igrigorik
Copy link
Contributor

@pmeenan you, sir, are awesome. But I guess we already knew that... :-)

@pmeenan
Copy link
Member Author

pmeenan commented May 4, 2017

ok, I think I'm done tuning and optimizing. It's running in the 1000-1050 range which puts us around 10% faster than the Windows agents were running.

I think the timing data is finally going to be useful (maybe for the next run since this one started out pretty hot). The server is running hot but not oversubscribed (90-95% CPU) and the few tests I looked at looked to have reasonable request timings and matched some spot tests I ran against the public WPT instance. They won't necessarily be useful for down-to-the-millisecond timings but directional and ballparks look good.

I'll leave this open until the crawl completes and we validate that the HAR processing and custom metrics all worked with the new agents.

@igrigorik
Copy link
Contributor

ok, I think I'm done tuning and optimizing. It's running in the 1000-1050 range which puts us around 10% faster than the Windows agents were running.

That's awesome. Great job Pat - as usual.

Re, timing: what throttling logic do we have in place for these runs? I presume we don't cpu-throttle the tests?

@pmeenan
Copy link
Member Author

pmeenan commented May 5, 2017

We traffic-shape the connections (netem) but don't CPU throttle yet. I made one pass at getting the CPU throttling working but it didn't seem to actually change the js execution times so I need to figure out how I'm holding it wrong. There is code in place to benchmark the test machine so the throttling can be applied in such a way that different environments will all throttle to the same effective speed.

The VM's only have a single CPU allocated so they are pretty low-end on the desktop scale of things but still significantly faster than say, a Moto G4.

We're also currently emulating a Nexus 5 (screen and UA string). It's probably worth re-visiting that when the CPU throttling is working.

@igrigorik
Copy link
Contributor

We're also currently emulating a Nexus 5 (screen and UA string). It's probably worth re-visiting that when the CPU throttling is working.

Agreed, it seems like lots of folks are converging on ~Moto 4G profile.

@pmeenan
Copy link
Member Author

pmeenan commented May 5, 2017

Rather than use the built-in CPU emulation I think I'm going to look at kernel-level throttling with cgroups/cfs/etc (at least for linux). That will give much more accurate throttling and looks like it might not be that hard.

@pmeenan
Copy link
Member Author

pmeenan commented May 9, 2017

CPU throttling is ready to go as soon as this crawl completes (need to re-deploy the VM's with cgroup support added). Also switched the default mobile profile to the Moto G4.

@pmeenan
Copy link
Member Author

pmeenan commented May 15, 2017

Throttling kicked in for the 5/15 crawl and I verified to make sure it is working as expected. That said, the perf data from 5/1 will be about the same. The throttling code benchmarks the test system and determines the actual multiplier to use from the calibrated profiles and it looks like the VM's aren't all that much faster than a Moto G when they are under testing load and the effective multiplier ends up being 1.024x. That does mean the desktop results will skew towards slower desktop systems but the unthrottled mobile tests are about the same as throttled.

@igrigorik
Copy link
Contributor

Reminded of... "You get a Moto 4G, and you get Moto 4G, everyone..." :-)

@pmeenan any other outstanding TODO's on agent migration?

@pmeenan
Copy link
Member Author

pmeenan commented May 16, 2017

Only TODO's are sorting out the drop in HTML/CSS/JS bytes (or verifying that it really is fixed in the 5/15 crawl).

We should also make sure the HAR's look good and that the Blink Feature Usage data made it in there.

Otherwise I think we're good.

@pmeenan
Copy link
Member Author

pmeenan commented Aug 19, 2017

Trends are now reporting HTML, JS and CSS sizes in-line with what we'd expect which is the last known issue with the Linux agent migration. Closing this as complete.

@pmeenan pmeenan closed this as completed Aug 19, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants