Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/ #218

Closed
carbocation opened this issue Dec 17, 2020 · 12 comments

Comments

@carbocation
Copy link

carbocation commented Dec 17, 2020

This is a new error message for me, and I checked the GCP status page in case it was transient but don't see any active issues.

I am launching a bunch of tasks across a bunch of zones (within a single job) using the google-cls-v2 API with dsub 0.4.3. I am getting the following error, but only on some tasks within this job (while others within the same job are launching and succeeding):

Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n")

This strikes me as an odd message, because (1) I'm running these machines without a public IP, and so (2) I'm only using gcr.io Docker instances with them. I don't know why they'd be pointing to registry-1.docker.io.

If this isn't a dsub issue, I can point this to the mailing list. (And if there is a quick way for me to get a flavor for "dsub issue vs not dsub issue", just let me know so I can self-triage.)

Thanks!

@carbocation carbocation changed the title Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Failure message: wrapping host binaries: pulling image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/ Dec 17, 2020
@mbookman
Copy link
Contributor

Can you double-check the task, perhaps pull up the underlying API operation?

Unless I am misunderstanding the error, that it is trying ["docker" "pull" "bash"] indicates that it is trying to pull the bash image from Dockerhub.

Note that I just saw this same problem for the first time earlier this week for a workflow run by Cromwell under Terra. In that case, it dockerhub was throttling the connections to pull down the ubuntu:latest image.

@carbocation
Copy link
Author

So, a bit interesting. Here is the full output from gcloud alpha lifesciences operations describe 4679186047174961621. It strikes me as... odd that it seems to try (and fail) to connect to registry-1.docker.io when all of the images are from gcr.io, not dockerhub (at least the ones that are explicitly mentioned in this output below).

gcloud alpha lifesciences operations describe 4679186047174961621
done: true
error:
  code: 14
  message: 'Execution failed: generic::unavailable: wrapping host binaries: pulling
    image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]:
    exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
    net/http: request canceled while waiting for connection (Client.Timeout exceeded
    while awaiting headers)\n")'
metadata:
  '@type': type.googleapis.com/google.cloud.lifesciences.v2beta.Metadata
  createTime: '2021-03-31T19:13:53.704845Z'
  endTime: '2021-03-31T19:30:38.561799866Z'
  events:
  - description: Worker released
    timestamp: '2021-03-31T19:30:38.561799866Z'
    workerReleased:
      instance: google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb
      zone: us-central1-f
  - description: 'Execution failed: generic::unavailable: wrapping host binaries:
      pulling image: retry budget exhausted (10 attempts): running ["docker" "pull"
      "bash"]: exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
      net/http: request canceled while waiting for connection (Client.Timeout exceeded
      while awaiting headers)\n")'
    failed:
      cause: 'Execution failed: generic::unavailable: wrapping host binaries: pulling
        image: retry budget exhausted (10 attempts): running ["docker" "pull" "bash"]:
        exit status 1 (standard error: "Error response from daemon: Get https://registry-1.docker.io/v2/:
        net/http: request canceled while waiting for connection (Client.Timeout exceeded
        while awaiting headers)\n")'
      code: UNAVAILABLE
    timestamp: '2021-03-31T19:30:37.949807010Z'
  - description: Worker "google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb"
      assigned in "us-central1-f" on a "custom-4-8192" machine
    timestamp: '2021-03-31T19:14:03.234988572Z'
    workerAssigned:
      instance: google-pipelines-worker-157f9eda0bbba01e17a66d3fd85b15fb
      machineType: custom-4-8192
      zone: us-central1-f
  labels:
    dsub-version: v0-4-4
    job-id: maf-filter--james--210331-151350-57
    job-name: maf-filter
    task-id: '6'
    user-id: james
  pipeline:
    actions:
    - commands:
      - -c
      - |
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        while [[ ! -d /google/logs/action/6 ]]; do
          mkdir -p /tmp/continuous_logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/continuous_logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"


          sleep 60s
        done
      containerName: logging
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log
        STDERR_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stderr.log
        STDOUT_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stdout.log
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      runInBackground: true
    - commands:
      - -c
      - |+
        #!/bin/bash

        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        mkdir -m 777 -p "/mnt/data/script" 
        mkdir -m 777 -p "/mnt/data/tmp" 
        mkdir -m 777 -p "/mnt/data/workingdir" 

        echo "${_SCRIPT_REPR}"     | python -c 'import ast
        import sys

        sys.stdout.write(ast.literal_eval(sys.stdin.read()))
        '     > "/mnt/data/script/command.sh"
        chmod a+x "/mnt/data/script/command.sh"

        for ((i=0; i < DIR_COUNT; i++)); do
          DIR_VAR="DIR_${i}"

          log_info "mkdir -m 777 -p "${!DIR_VAR}""
          mkdir -m 777 -p "${!DIR_VAR}"
        done

      containerName: prepare
      entrypoint: /bin/bash
      environment:
        DIR_0: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr
        DIR_1: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_2: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_3: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2
        DIR_COUNT: '4'
        _META_YAML_REPR: "\"create-time: 2021-03-31 15:13:50.570367-04:00\\ndsub-version:\
          \ v0-4-4\\nenvs:\\n  MAF: 1E-2\\n  PVAL: '1'\\njob-id: maf-filter--james--210331-151350-57\\\
          njob-name: maf-filter\\nlogging: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/\\\
          nscript-name: command.sh\\ntask-ids: 1-6\\ntasks:\\n- inputs:\\n    file:\
          \ gs://bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz\\\
          n  logging-path: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log\\\
          n  outputs:\\n    count_file: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps\\\
          n    filtered: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz\\\
          n    filtered_tbi: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi\\\
          n  task-id: '6'\\nuser-id: james\\n\""
        _SCRIPT_REPR: "\"# Must be run on an image that has bgzip and tabix installed\\\
          n\\n# Screen out erroneous SE and ChiSq values ($12>0 && $15>0)\\n# And\
          \ require that MAF and P be within your specified range\\ngunzip -c ${file}\
          \ | awk -v maf=${MAF} -v pval=${PVAL} -F $'\\\\t' 'NR == 1{print $0}; NR\
          \ > 1 && $12>0 && $15>0 && ($16 <= pval) && (($7 > maf) && (1-$7 > maf))\
          \ {print $0}' > output.file\\n\\n# For X, skip the header and append to\
          \ what we already produced\\n# Disabled for now since not produced by REGENIE\\\
          n# gunzip -c ${file_x} | awk -v maf=${MAF} -v pval=${PVAL} -F $'\\\\t' 'NR\
          \ > 1 && $12>0 && $15>0 && ($16 <= pval) && (($7 > maf) && (1-$7 > maf))\
          \ {print $0}' >> output.file\\n\\n# Count the SNPs\\necho $(( $(wc -l <\
          \ output.file) - 1)) > ${count_file}\\n\\n# BGZip\\nbgzip -c output.file\
          \ > ${filtered}\\n\\n# Tabix. This is BOLT-specific formatting\\ntabix \\\
          \\\\n    -s 2 \\\\\\n    -b 3 \\\\\\n    -e 3 \\\\\\n    -S 1 \\\\\\n  \
          \  ${filtered}\\n\""
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < INPUT_COUNT; i++)); do
          INPUT_VAR="INPUT_${i}"
          INPUT_RECURSIVE="INPUT_RECURSIVE_${i}"
          INPUT_SRC="INPUT_SRC_${i}"
          INPUT_DST="INPUT_DST_${i}"

          log_info "Localizing ${!INPUT_VAR}"
          if [[ "${!INPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!INPUT_SRC}" "${!INPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!INPUT_SRC}" "${!INPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      containerName: localization
      entrypoint: /bin/bash
      environment:
        INPUT_0: file
        INPUT_COUNT: '1'
        INPUT_DST_0: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        INPUT_RECURSIVE_0: '0'
        INPUT_SRC_0: gs://bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - bash
      - -c
      - |
        export TMPDIR="/mnt/data/tmp"
        cd /mnt/data/workingdir

        "/mnt/data/script/command.sh"
      containerName: user-command
      entrypoint: /usr/bin/env
      environment:
        MAF: '1E-2'
        PVAL: '1'
        count_file: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        file: /mnt/data/input/gs/bucket/path/prevalent_disease_af_or_nonnsr/bolt.imputed.tsv.gz
        filtered: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        filtered_tbi: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
      imageUri: gcr.io/ukbb-analyses/ubuntu:2020-12-22
      mounts:
      - disk: datadisk
        path: /mnt/data
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < OUTPUT_COUNT; i++)); do
          OUTPUT_VAR="OUTPUT_${i}"
          OUTPUT_RECURSIVE="OUTPUT_RECURSIVE_${i}"
          OUTPUT_SRC="OUTPUT_SRC_${i}"
          OUTPUT_DST="OUTPUT_DST_${i}"

          log_info "Delocalizing ${!OUTPUT_VAR}"
          if [[ "${!OUTPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      containerName: delocalization
      entrypoint: /bin/bash
      environment:
        OUTPUT_0: filtered
        OUTPUT_1: filtered_tbi
        OUTPUT_2: count_file
        OUTPUT_COUNT: '3'
        OUTPUT_DST_0: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        OUTPUT_DST_1: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
        OUTPUT_DST_2: gs://bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        OUTPUT_RECURSIVE_0: '0'
        OUTPUT_RECURSIVE_1: '0'
        OUTPUT_RECURSIVE_2: '0'
        OUTPUT_SRC_0: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz
        OUTPUT_SRC_1: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.tsv.gz.tbi
        OUTPUT_SRC_2: /mnt/data/output/gs/bucket/path/maf-filter/p-1-maf-1E-2/outputprevalent_disease_af_or_nonnsr.n_snps
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
      mounts:
      - disk: datadisk
        path: /mnt/data
    - alwaysRun: true
      commands:
      - -c
      - |+
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_warning() {
          1>&2 echo "$(get_datestamp) WARNING: $@"
        }

        function log_error() {
          1>&2 echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local attempt
          for ((attempt = 0; attempt < 4; attempt++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
            if (( attempt < 3 )); then
              log_warning "Sleeping 10s before the next attempt of failed gsutil command"
              log_warning "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
              sleep 10s
            fi
          done

          log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        mkdir -p /tmp/logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"

      containerName: final_logging
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6.log
        STDERR_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stderr.log
        STDOUT_PATH: gs://bucket/path/maf-filter/p-1-maf-1E-2/dsub-logs/maf-filter--james--210331-151350-57.6-stdout.log
        USER_PROJECT: ''
      imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
    resources:
      regions:
      - us-central1
      - us-east1
      - us-west1
      virtualMachine:
        bootDiskSizeGb: 10
        bootImage: projects/cloud-lifesciences/global/images/w20210322-0300-rc0060586ad3-0000-2431-9437-089e0828bff8stable
        disks:
        - name: datadisk
          sizeGb: 50
          type: pd-ssd
        labels:
          dsub-version: v0-4-4
          goog-pipelines-worker: 'true'
          job-id: maf-filter--james--210331-151350-57
          job-name: maf-filter
          task-id: '6'
          user-id: james
        machineType: custom-4-8192
        network:
          usePrivateAddress: true
        nvidiaDriverVersion: 450.51.06
        serviceAccount:
          email: default
          scopes:
          - https://www.googleapis.com/auth/bigquery
          - https://www.googleapis.com/auth/compute
          - https://www.googleapis.com/auth/devstorage.full_control
          - https://www.googleapis.com/auth/genomics
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring.write
          - https://www.googleapis.com/auth/cloud-platform
        volumes:
        - persistentDisk:
            sizeGb: 50
            type: pd-ssd
          volume: datadisk
    timeout: 604800s
  startTime: '2021-03-31T19:14:03.234988572Z'
name: projects/783282864357/locations/us-central1/operations/4679186047174961621

@mbookman
Copy link
Contributor

mbookman commented Apr 1, 2021

Hi James,

I agree. The error message does not make sense in the context of that operation.
All actions use a gcr.io imageUri, either:

  • imageUri: gcr.io/google.com/cloudsdktool/cloud-sdk:294.0.0-slim
  • imageUri: gcr.io/ukbb-analyses/ubuntu:2020-12-22

And yet it would seem that the Pipelines API is trying to pull the bash docker image for dockerhub.
I have filed a bug with the HCLS team.

@carbocation
Copy link
Author

Thanks!

@mbookman
Copy link
Contributor

mbookman commented Apr 1, 2021

I heard back from the Pipelines team that they are aware of the problem and are looking to remove the dependency.
No specific timelines.

Do you have a sense of frequency in which you hit this?

@carbocation
Copy link
Author

Hard to say though it seems to be becoming more frequent. For example, I'm hitting it now (well, not waiting for all the retries - but noting that machines seem to be stuck in state VM starting (awaiting worker checkin) while trying to run a 'hello world' program on a 4-core machine). VM starting (awaiting worker checkin) seems to be the state that precedes the retry failure.

@carbocation
Copy link
Author

For what it's worth, I am not seeing a difference between google-v2 and google-cls-v2 providers. I also don't see any errors on the GCP status page to indicate that there is some global problem. I've tried E2 and N1 machine types with similar results. Currently I don't seem to be able to launch jobs unless I remove the --use-private-address flag, but a couple of days ago I was eventually able to. These aren't systematic observations, just anecdote, but sending them along in case they make something 'click' for someone else.

pbilling added a commit to StanfordBioinformatics/trellis-mvp-functions that referenced this issue Apr 8, 2021
@pbilling
Copy link

pbilling commented Apr 8, 2021

I am also consistently experiencing this issue running jobs with internal IPs, pulling from gcr.io. I just launched a batch of 20 jobs and all failed with this same error. Thanks to @carbocation for raising the issue.

@wnojopra
Copy link
Contributor

wnojopra commented Apr 9, 2021

Latest message from the Pipelines team is that they've heard from a few people the very same issue, and are working to have it fixed as soon as possible.

@wnojopra
Copy link
Contributor

wnojopra commented Apr 9, 2021

From Pipelines team: A fix is in the process of rollout. us-central1 and europe-west2 are fixed, while rollouts to remaining locations are in progress.

@cblacknygc
Copy link

cblacknygc commented Apr 12, 2021

We were hitting this issue thru cromwell in us-central1 and can confirm we haven't seen it since Google told us there was a fix April 9th.

@mbookman
Copy link
Contributor

mbookman commented Jun 8, 2021

Going to close this as the Google team has fixed the issue.

@mbookman mbookman closed this as completed Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants