Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance with large task cache #5298

Open
Akhalaka opened this issue Mar 12, 2020 · 1 comment
Open

Slow performance with large task cache #5298

Akhalaka opened this issue Mar 12, 2020 · 1 comment
Labels

Comments

@Akhalaka
Copy link

Bug Report

A concourse job has no steps running for an extended period of time (2m+) following a task step with a large (few GB) cache. From the UI it appears that nothing is running but based on the logs it looks like its waiting on the creation of the cache volume. Initially, I considered old bug reports like #1404 but this can be reproduced using btrfs and using non-privileged containers.

This was initially discovered using the oci-build-task (https://github.com/vito/oci-build-task), but took a while to manifest as the cache would slowly grow and each worker has its own cache. In this instance, the pause is long enough to negate the benefit of using using the concourse cache for docker build layer cache as rebuilding the layers ends up faster.

Steps to Reproduce

I was able to reproduce on a single web node/single worker instance of concourse using the ollowing pipeline: https://gist.github.com/Akhalaka/160705fefc2cece853e4ff3f56f9a7fc. It consists of a dumb job that creates 1024 1Mb files in the cache, thus every run the cache size grows by 1GB. This is probably larger than a normal build cache will grow with a well written Dockerfile but this is meant to demonstrate the problem quickly. Also included in the gist is the debug logs that are relevant. Nothing stands out on why the volume creation is taking so long.

Other notes:

  • the 'next' task kicks off immediately following the atc.tracker.track.task-step.find-or-create-volume-for-container.release.released log message
  • In the first build after flying the pipeline (or on a worker without an already established cache), the 'next' task kicks off immediately following the completion of the 'Fill Cache' step. I believe this is because the worker gets to skip the volume creation here: https://github.com/concourse/concourse/blob/master/atc/worker/volume.go#L128
  • Its easiest to reproduce in a setup with 1 worker as each run will grow the same cache by 1GB. With multiple workers the job can bounce around between them and grow separate caches, but you should still get there.

Expected Results

Subsequent steps should start quickly (within reason)

Actual Results

There is a pause between completing the step and starting the next step on the order of 30s * <size of cache in GB>.

Additional Context

The worker data dir is the btrfs image on xfs mounted via the loopback device. The actual physical storage is a local ssd.

Version Info

  • Concourse version: v5.7.2
  • Deployment type (BOSH/Docker/binary): Docker
  • Infrastructure/IaaS: Bare Metal Kubernetes
  • Browser (if applicable): N/A
  • Did this used to work? Unsure
@Akhalaka Akhalaka added the bug label Mar 12, 2020
@jamieklassen
Copy link
Member

This issue is more specific, but I can still imagine a scenario like this benefiting from #4337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants