You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A concourse job has no steps running for an extended period of time (2m+) following a task step with a large (few GB) cache. From the UI it appears that nothing is running but based on the logs it looks like its waiting on the creation of the cache volume. Initially, I considered old bug reports like #1404 but this can be reproduced using btrfs and using non-privileged containers.
This was initially discovered using the oci-build-task (https://github.com/vito/oci-build-task), but took a while to manifest as the cache would slowly grow and each worker has its own cache. In this instance, the pause is long enough to negate the benefit of using using the concourse cache for docker build layer cache as rebuilding the layers ends up faster.
Steps to Reproduce
I was able to reproduce on a single web node/single worker instance of concourse using the ollowing pipeline: https://gist.github.com/Akhalaka/160705fefc2cece853e4ff3f56f9a7fc. It consists of a dumb job that creates 1024 1Mb files in the cache, thus every run the cache size grows by 1GB. This is probably larger than a normal build cache will grow with a well written Dockerfile but this is meant to demonstrate the problem quickly. Also included in the gist is the debug logs that are relevant. Nothing stands out on why the volume creation is taking so long.
Other notes:
the 'next' task kicks off immediately following the atc.tracker.track.task-step.find-or-create-volume-for-container.release.released log message
In the first build after flying the pipeline (or on a worker without an already established cache), the 'next' task kicks off immediately following the completion of the 'Fill Cache' step. I believe this is because the worker gets to skip the volume creation here: https://github.com/concourse/concourse/blob/master/atc/worker/volume.go#L128
Its easiest to reproduce in a setup with 1 worker as each run will grow the same cache by 1GB. With multiple workers the job can bounce around between them and grow separate caches, but you should still get there.
Expected Results
Subsequent steps should start quickly (within reason)
Actual Results
There is a pause between completing the step and starting the next step on the order of 30s * <size of cache in GB>.
Additional Context
The worker data dir is the btrfs image on xfs mounted via the loopback device. The actual physical storage is a local ssd.
Version Info
Concourse version: v5.7.2
Deployment type (BOSH/Docker/binary): Docker
Infrastructure/IaaS: Bare Metal Kubernetes
Browser (if applicable): N/A
Did this used to work? Unsure
The text was updated successfully, but these errors were encountered:
Bug Report
A concourse job has no steps running for an extended period of time (2m+) following a task step with a large (few GB) cache. From the UI it appears that nothing is running but based on the logs it looks like its waiting on the creation of the cache volume. Initially, I considered old bug reports like #1404 but this can be reproduced using btrfs and using non-privileged containers.
This was initially discovered using the oci-build-task (https://github.com/vito/oci-build-task), but took a while to manifest as the cache would slowly grow and each worker has its own cache. In this instance, the pause is long enough to negate the benefit of using using the concourse cache for docker build layer cache as rebuilding the layers ends up faster.
Steps to Reproduce
I was able to reproduce on a single web node/single worker instance of concourse using the ollowing pipeline: https://gist.github.com/Akhalaka/160705fefc2cece853e4ff3f56f9a7fc. It consists of a dumb job that creates 1024 1Mb files in the cache, thus every run the cache size grows by 1GB. This is probably larger than a normal build cache will grow with a well written Dockerfile but this is meant to demonstrate the problem quickly. Also included in the gist is the debug logs that are relevant. Nothing stands out on why the volume creation is taking so long.
Other notes:
Expected Results
Subsequent steps should start quickly (within reason)
Actual Results
There is a pause between completing the step and starting the next step on the order of
30s * <size of cache in GB>
.Additional Context
The worker data dir is the btrfs image on xfs mounted via the loopback device. The actual physical storage is a local ssd.
Version Info
The text was updated successfully, but these errors were encountered: