Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task fails with "config file not found" after restarting Docker service #1796

Closed
mbjelac opened this issue Nov 8, 2017 · 6 comments
Closed

Comments

@mbjelac
Copy link

mbjelac commented Nov 8, 2017

Disclaimer: There are a couple of issues which seem to be the same as this one. However, I decided to create a new one because:

Bug Report

  • Concourse version: 3.5.0
  • Deployment type (BOSH/Docker/binary): Docker 17.06.0-ce, build 02c1d87
  • Infrastructure/IaaS: Linux (privately hosted virtual machine)
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
  • Browser (if applicable): Google Chrome 62.0.3202.89 (Official Build) (64-bit)
  • Did this used to work? No, it happens consistently with older versions too.

TL;DR

When the Docker service which hosts Concourse restarts (manually, or by rebooting the virtual or physical machine), all the builds fail on the first task which requires a YAML config file with the message:

task config '<yml-config-file-path>' not found

Reconstruct cookbook

  • install Docker on a machine (see version above)
  • start Docker service
  • start Concourse containers with docker-compose up -d using docker-compose definition:
version: '3.3'

services:
  concourse-db:
    image: postgres:9.5
    volumes: ["/srv/docker/containers/concourse-db/data:/database"]
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    environment:
      POSTGRES_DB: concourse
      POSTGRES_USER: concourse
      POSTGRES_PASSWORD: changeme
      PGDATA: /database

  concourse-web:
    image: concourse/concourse
    links: [concourse-db]
    command: web
    depends_on: [concourse-db]
    ports: ["7070:8080"]
    volumes: ["/etc/docker/container/concourse/keys/web:/concourse-keys"]
    restart: unless-stopped # required so that it retries until conocurse-db comes up
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "10"
    environment:
      CONCOURSE_BASIC_AUTH_USERNAME: concourse
      CONCOURSE_BASIC_AUTH_PASSWORD: changeme
      CONCOURSE_EXTERNAL_URL: http://<ci-host>:7070
      CONCOURSE_POSTGRES_HOST: concourse-db
      CONCOURSE_POSTGRES_USER: concourse
      CONCOURSE_POSTGRES_PASSWORD: changeme
      CONCOURSE_POSTGRES_DATABASE: concourse

  concourse-worker:
    image: concourse/concourse
    privileged: true
    links: [concourse-web]
    depends_on: [concourse-web]
    command: worker
    volumes: 
      - "/etc/docker/container/concourse/keys/worker:/concourse-keys"
      - "/srv/docker/containers/concourse-worker/work-dir:/work-dir"
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    environment:
      CONCOURSE_TSA_HOST: concourse-web
      CONCOURSE_WORK_DIR: /work-dir
  • create a pipeline
    • be sure to have jobs with tasks that require external YAML config files, and tasks which have inlined configuration
  • start pipeline and verify they work as expected
  • on the machine you installed Docker on, do:
docker service stop
docker service start

What happens?

  • start a build of a job which has tasks defined in YAML config files
    • the task errors with message: task config '<yml-config-file-path>' not found
  • start a build of a job which has tasks with inlined configuration
    • they pass
  • start a job which has which has tasks with inlined configuration, but also includes output resources, such as e-mail
    • the PUT step errors with message: file not found
  • some inputs are in error: unknown handle: <uuid>
    • this is the probable cause of the task config not found failure ... but what is the parent cause?

Check workers

The single worker appears to be fine:

fly -t ci workers
name          containers  platform  tags  team  state    version
b640b1adefed  10          linux     none  none  running  1.2

After a while...

  • start a job - it hangs in the preparing build stage:
---------------
preparing build
---------------
(checkmark) checking pipeline is not paused
(checkmark) checking job is not paused
(checkmark) discovering any new versions of ci-tools
(checkmark) discovering any new versions of source
(checkmark) waiting for a suitable set of input versions
(checkmark) checking max-in-flight is not reached
<waits forever>

Fly still reports the single worker as running.

Quickfix

I always fix this in the same way. Maybe its useful information.

  • stop Concourse containers: docker-compose down
  • delete worker folder: rm -rf <worker folder path>
  • start Concourse containers: docker-compose up -d
  • wait for worker to stall (or rather, wait for Fly to report it stalled), then prune it
  • everything goes back to normal
@vito
Copy link
Member

vito commented Jan 24, 2018

This should end up being fixed by #1959, where volumes/containers that were hard-removed from a worker will be reaped from the ATC.

@vito vito added this to Icebox in Runtime via automation Jan 24, 2018
Kehrlann added a commit to Kehrlann/concourse-demo that referenced this issue Jan 26, 2018
- This way, we can delete it because docker-compose
  seems to screw things up :
  concourse/concourse#1796
vito added a commit that referenced this issue Apr 11, 2018
Submodule src/github.com/aws/aws-sdk-go 637cf7628..57564ea05:
  > Merge pull request #1889 from aws/release
  > Merge pull request #1888 from aws/release
  > Merge pull request #1887 from aws/release
  > adding correct link to v2 (#1886)
  > Merge pull request #1885 from aws/release
  > Merge pull request #1882 from aws/release
  > Merge pull request #1881 from aws/release
  > private/model/api: Fix typo in docs.
  > private/model/api: Add suppression of Eventstream models (#1875)
  > Merge pull request #1876 from aws/release
  > Merge pull request #1873 from aws/release
  > Merge pull request #1872 from aws/release
  > adding sync example (#1871)
  > Merge pull request #1870 from aws/release
  > Merge pull request #1866 from aws/release
  > aws/request: Fix a typo in a comment (#1862)
  > Merge pull request #1864 from aws/release
  > Merge pull request #1860 from aws/release
  > Merge pull request #1858 from aws/release
  > Add #1854 to pending change log
  >  aws/endpoints: Use service metadata for fallback signing name (#1854)
  > Merge pull request #1855 from aws/release
  > service/s3/s3manager: fixing error string and error messages being lost (#1851)
  > Merge pull request #1852 from aws/release
  > Merge pull request #1845 from aws/release
  > aws/endpints: Remove codegen customization for CloudHSMv2 signing name (#1840)
  > service/s3: Disable S3 object ContentMD5 automatic validation (#1843)
  > Merge pull request #1841 from aws/release
  > adding custom signing name for runtime.sagemaker (#1838)
  > Merge pull request #1835 from aws/release
  > Merge pull request #1832 from aws/release
  > Merge pull request #1831 from aws/release
  > service/s3: Add ContentMD5 validation of S3 Objects (#1827)
  > 501, 'NotImplemented', will no longer be retried (#1826)
  > aws/corehandlers: Add support for AWS_EXECUTION_ENV env var (#1820)
  > Merge pull request #1828 from aws/release
  > Merge pull request #1825 from aws/release
  > Merge pull request #1821 from aws/release
  > Create CODE_OF_CONDUCT.md (#1818)
  > service/s3/s3manager: Update GetBucketRegion region hint fallback (#1804)
  > Merge pull request #1817 from aws/release
  > Update travis CI tests to include Go 1.10 (#1805)
  > Update documentation to point at the correct interface (#1813)
  > Merge pull request #1812 from aws/release
  > Merge pull request #1809 from aws/release
  > Merge pull request #1807 from aws/release
  > fixing s_code link (#1806)
  > Merge pull request #1800 from aws/release
  > rework multipart upload so that it reuses the same buffers when uploading. this significantly decreases memory consumption during upload. (#1784)
  > aws/ec2metadata: Add support for AWS_EC2_METADATA_DISABLED env var (#1799)
  > Merge pull request #1798 from aws/release
  > Merge pull request #1796 from aws/release
  > Merge pull request #1795 from aws/release
  > Release v1.13.0 (#1794)
  > fixes 1790 to use context (#1792)
  > private/model/api: bug fix with some examples excluding fields (#1791)
  > Merge pull request #1788 from aws/release
  > Adding custom retryer to AssignPrivateIpAddresses and ModifyNetworkInterfaceAttribute (#1787)
  > Merge pull request #1786 from aws/release
  > Merge pull request #1785 from aws/release
  > aws/request: Fix support for streamed payloads for unsigned body request (#1778)
  > Merge pull request #1783 from aws/release
  > Merge pull request #1782 from aws/release
  > Merge pull request #1781 from aws/release
  > Merge pull request #1780 from aws/release
  > Merge pull request #1779 from aws/release
  > Add pending change log changes. (#1774)
  > private/model/api: removing crosslinks from input/output shapes. (#1765)
  >  aws/session: Fix bug in session.New not supporting AWS_SDK_LOAD_CONFIG (#1770)
  > example/service/ec2/instancesbyRegion: Fix typos in example (#1762)
  > adds validation to ensure there is no duplication of services in models/apis (#1758)
  > Merge pull request #1773 from aws/release
  > Merge pull request #1759 from aws/release
  > aws/request: Fix Pagination handling of empty string NextToken (#1733)
  > aws/endpoints: Workaround CloudHSMv2 signing name not modeled (#1751)
  > v1.12.69 (#1757)
  > models/apis: removing incorrect named folder (#1756)
  > Fix conflict of dirname GuardDuty to guardduty. fix #1753 (#1754)
  > Merge pull request #1752 from aws/release
  > Add #1749 to change log
  > service/s3/s3manager: Fix check for nil OrigErr in Error()(#1749)
  > service/s3: Add Get/Put object presign benchmark tests (#1735)
  > Merge pull request #1747 from aws/release
  > Only set header tag len if set (#1743)
  > Merge pull request #1744 from aws/release
  > Merge pull request #1740 from aws/release
  > Merge pull request #1739 from aws/release
  > Release v1.12.63 (#1738)
  > Merge pull request #1732 from aws/release
  > example: Add Custom Retry strategy example. (#1731)
  > Merge pull request #1730 from aws/release
  > Merge pull request #1728 from aws/release
  > Merge pull request #1725 from aws/release
  > Merge pull request #1724 from aws/release
  > Merge pull request #1723 from aws/release
  > Merge branch 'master' of https://github.com/aws/aws-sdk-go
  > Changed 'on-premise' to 'on-premises'
Submodule src/github.com/concourse/atc 830a18d..ebc24ca:
  > Merge branch 'x6j8x-master'
  > Merge pull request #261 from timrchavez/timrchavez/build_event_times
  > Merge pull request #256 from baptiste-bonnaudet/vault-max-ttl
@topherbullock
Copy link
Member

Since #1959 is done, we should verify that this is no longer an issue with docker compose
We should be able to follow each of the steps outlined by @mbjelac under What happens? and see that these are no longer happening.

@topherbullock topherbullock moved this from Icebox to Backlog in Runtime Sep 13, 2018
@vito
Copy link
Member

vito commented Sep 13, 2018

@topherbullock iirc we ended up punting on this aspect of that issue, so this is probably not actually solved yet

@topherbullock
Copy link
Member

topherbullock commented Sep 13, 2018

@vito
Oh right... I confirmed its def still an issue.
Drafting one for GC-ing records in the DB which aren't on the worker anymore.

@jama22 jama22 moved this from Backlog to In Flight in Runtime Oct 12, 2018
@cirocosta cirocosta moved this from In Flight to Backlog in Runtime Oct 15, 2018
@jama22 jama22 assigned jama22 and topherbullock and unassigned jama22 Oct 15, 2018
@kcmannem
Copy link
Member

kcmannem commented Nov 13, 2018

looks like bind mounts don't persist in the mount table between docker starts and stops. Don't really know how we get around this.

@vito
Copy link
Member

vito commented Nov 19, 2018

This should be somewhat mitigated by #2588, though it will take time for Concourse to confirm that the volume is gone.

For this to result in the errors you saw, it seems like Docker would have had to shut down the containers/etc. and then brought it back under the same name. Which is super weird. I've opened #2829 to suggest auto-deleting Docker-deployed workers if they're being shut down without draining.

Given the age of this issue, the fact that it's at least somewhat mitigated already, and we've got another issue proposing the next steps (which I've prioritized in the Operations backlog), I'm gonna close this one out. Thanks for the detailed report, @mbjelac!

@vito vito closed this as completed Nov 19, 2018
Runtime automation moved this from Backlog to Done Nov 19, 2018
@topherbullock topherbullock removed this from Done in Runtime Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants