Description
Today we debugged some never terminating workflow step and we ended up here. Turns out, if you configure a container workflow, then a single container step will take forever and never terminate if it does not print something for a while. We were able to easily reproduce this with this workflow:
name: Debug issue
on: [workflow_dispatch]
jobs:
debug:
container:
image: **some-image-path**
runs-on: [**some-runner**]
steps:
- name: step1
shell: bash
run: |
echo "Hello"
- name: Wait for 5m
shell: bash
run: |
sleep 5m
- name: step3
shell: bash
run: |
echo "Ciao"
If we instead change the second step to print continuously, we do not run into an error.
- name: Wait for 5m
shell: bash
run: |
for ((i=0; i<300; i++)); do
echo "."
sleep 1
done
echo "done"
We do not get any errors or valuable information when re-running in debug mode. We expect the issue to be in execPodStep
. The version of @kubernetes/client-node
is quite old (0.18.1 vs 0.22.0) - but to be fair, the release notes and the changes do not like there was a bug fixed.
Have you seen a similar behavior before? Do you have another recommendation to us to simply be more verbose and write progress on stdout?
Thanks!