Web loading is extremely slow #38

acatxnamedvirtue · 2019-08-26T18:29:41Z

Hello there!

We use control-tower to deploy our concourse instance to AWS and we absolutely love it.
However, as we add more jobs and more pipelines, we are experiencing super slow page load times.

We're currently deploying with these flags:
--iaas aws
--region us-east-1
--workers 4
--worker-type m5
--worker-size 4xlarge
--web-size 2xlarge

And even though the web-size is 2xlarge, it's still very slow (3-6s page load times). From looking in the network tab, this is mostly coming from the "pipelines" and "jobs" calls. We could split pipelines out to separate teams, but since we're a fairly small company (100ish engineers) we appreciate the pipeline visibility, especially during on-call rotations where quickly redeploying a last known version is helpful. We could also start spinning up new control-tower concourse deployments to various sub-domains, but that's a little annoying from a management perspective.

We're wondering if you have any insight into this, or if you are planning on bumping up the options for maximum web node size (t3's would be particularly nice, but beefier instances would be great, too), or maybe it's just time for us to figure out the BOSH deployment on our own :)

Thanks for your help!

Tyler Beebe
Software Engineer
Meetup

DanielJonesEB · 2019-08-28T08:46:47Z

Hi @acatxnamedvirtue, thanks for reaching out.

Right now we don't have any plans to add new instance types, although it doesn't sound like a massive change (famous last words; it gets tricky when instance types aren't supported in all zones).

The latest release of Control Tower has an improved metrics dashboard courtesy of @gerhard. It doesn't yet show database metrics, but maybe this might help dig deeper into where the issues are arising?

gerhard · 2019-09-01T18:04:51Z

@acatxnamedvirtue it's most likely network latency / network throughput / disk IOPS on the db. It may be CPU contention on the web instances, but this is less likely. Without metrics, it's just a guess.

This is a real-world example of the Concourse Dashboard that @DanielJonesEB mentioned. In RabbitMQ's Concourse case, the db is constantly averaging ~115Mbps of outgoing network traffic (bursts of up to 225Mbps). My suspicion is that this is your bottleneck, especially if you are using a managed DB instance.

API Response Duration is also worth looking at:

crsimmons · 2019-09-04T11:15:03Z

It's worth noting that our implementation of Gerhard's dashboard doesn't have the DB metrics graphs because we use RDS/CloudSQL rather than a BOSH deployed vm. You may be able to figure out the DB network metrics via the IaaS though.

acatxnamedvirtue · 2019-09-06T18:34:06Z

Ah thanks for the pointers! I was able to take this and try a couple things out. Things I learned:

Throwing bigger machines at the web did nothing to help, like you said.
I redeployed with the newest control-tower and got access to that sweet new dashboard (Thanks @gerhard , this is the dashboard I've been wanting for ages!) API Duration was wild, oscillating between 10ms and 10s for all calls.
I reached out in concourse's discord for some help, and saw a similar message about someone looking for ideas on how to optimize performance, so I decided to give bumping up the DB machine a shot. I figured I'd do it manually first via the AWS console, so I went in and saw a message that said something like this:
"Provisioning less than 100GiB may cause poor IOPS performance"
So I did two things, I bumped the db storage size up to 100GiB and also bumped the machine type up to a m5.large.

Here's the performance increase I saw in the API duration almost immediately afterwards:

The web ui is SO FAST NOW, which is super exciting!!

I then decided to scale the machine back down to see if it was the machine size or the storage size.

I'm back down to running the default db.t2.medium, but still with 100 GiB storage, and still seeing zippy API durations.

My recommendation to ya'll might be to make the storage size of the DB a setting you can change via deploy flag, or to make the default 100GiB. I'm pretty sure the next time I do a control-tower deploy, terraform will scale it back down (or maybe even delete it? I'm new-ish to terraform).

Anyway, for now I've solved our main problem, and I'm super excited. Thanks for all of your help! Feel free to close this Issue at your convenience.

acatxnamedvirtue · 2019-09-06T18:42:09Z

Here's something from the concourse discord, wish I had known haha:

gerhard · 2019-09-06T21:14:08Z

@acatxnamedvirtue glad that you managed to pinpoint the root cause of your slow page loads!

I plan on rolling out node-exporter in our infra and replacing this Concourse dashboard with a Prometheus-based one, especially after concourse/concourse#4247 (comment) gets merged. With the node-exporter, it would be easy to show host metrics such as disk IOPS alongside Concourse metrics, which would have made this issue which you've hit trivial to spot. Eventually, I would really like to see IaaS thresholds (e.g. disk IOPS limits) integrated in this new dashboard, so that we can have something like this (replace Memory available before publishers blocked with Disk IOPS available):

crsimmons added enhancement New feature or request question Further information is requested labels Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web loading is extremely slow #38

Web loading is extremely slow #38

acatxnamedvirtue commented Aug 26, 2019

DanielJonesEB commented Aug 28, 2019

gerhard commented Sep 1, 2019 •

edited

Loading

crsimmons commented Sep 4, 2019

acatxnamedvirtue commented Sep 6, 2019

acatxnamedvirtue commented Sep 6, 2019 •

edited

Loading

gerhard commented Sep 6, 2019

Web loading is extremely slow #38

Web loading is extremely slow #38

Comments

acatxnamedvirtue commented Aug 26, 2019

DanielJonesEB commented Aug 28, 2019

gerhard commented Sep 1, 2019 • edited Loading

crsimmons commented Sep 4, 2019

acatxnamedvirtue commented Sep 6, 2019

acatxnamedvirtue commented Sep 6, 2019 • edited Loading

gerhard commented Sep 6, 2019

gerhard commented Sep 1, 2019 •

edited

Loading

acatxnamedvirtue commented Sep 6, 2019 •

edited

Loading