Slow performance with large task cache #5298

Akhalaka · 2020-03-12T14:56:29Z

Bug Report

A concourse job has no steps running for an extended period of time (2m+) following a task step with a large (few GB) cache. From the UI it appears that nothing is running but based on the logs it looks like its waiting on the creation of the cache volume. Initially, I considered old bug reports like #1404 but this can be reproduced using btrfs and using non-privileged containers.

This was initially discovered using the oci-build-task (https://github.com/vito/oci-build-task), but took a while to manifest as the cache would slowly grow and each worker has its own cache. In this instance, the pause is long enough to negate the benefit of using using the concourse cache for docker build layer cache as rebuilding the layers ends up faster.

Steps to Reproduce

I was able to reproduce on a single web node/single worker instance of concourse using the ollowing pipeline: https://gist.github.com/Akhalaka/160705fefc2cece853e4ff3f56f9a7fc. It consists of a dumb job that creates 1024 1Mb files in the cache, thus every run the cache size grows by 1GB. This is probably larger than a normal build cache will grow with a well written Dockerfile but this is meant to demonstrate the problem quickly. Also included in the gist is the debug logs that are relevant. Nothing stands out on why the volume creation is taking so long.

Other notes:

the 'next' task kicks off immediately following the atc.tracker.track.task-step.find-or-create-volume-for-container.release.released log message
In the first build after flying the pipeline (or on a worker without an already established cache), the 'next' task kicks off immediately following the completion of the 'Fill Cache' step. I believe this is because the worker gets to skip the volume creation here: https://github.com/concourse/concourse/blob/master/atc/worker/volume.go#L128
Its easiest to reproduce in a setup with 1 worker as each run will grow the same cache by 1GB. With multiple workers the job can bounce around between them and grow separate caches, but you should still get there.

Expected Results

Subsequent steps should start quickly (within reason)

Actual Results

There is a pause between completing the step and starting the next step on the order of 30s * <size of cache in GB>.

Additional Context

The worker data dir is the btrfs image on xfs mounted via the loopback device. The actual physical storage is a local ssd.

Version Info

Concourse version: v5.7.2
Deployment type (BOSH/Docker/binary): Docker
Infrastructure/IaaS: Bare Metal Kubernetes
Browser (if applicable): N/A
Did this used to work? Unsure

The text was updated successfully, but these errors were encountered:

jamieklassen · 2020-03-23T13:24:47Z

This issue is more specific, but I can still imagine a scenario like this benefiting from #4337

Akhalaka added the bug label Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance with large task cache #5298

Slow performance with large task cache #5298

Akhalaka commented Mar 12, 2020

jamieklassen commented Mar 23, 2020

Slow performance with large task cache #5298

Slow performance with large task cache #5298

Comments

Akhalaka commented Mar 12, 2020

Bug Report

Steps to Reproduce

Expected Results

Actual Results

Additional Context

Version Info

jamieklassen commented Mar 23, 2020