Skip to content

Pipes appear to not pass on EOF on macOS runners #884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kousu opened this issue Dec 25, 2020 · 13 comments
Open

Pipes appear to not pass on EOF on macOS runners #884

kousu opened this issue Dec 25, 2020 · 13 comments
Labels
bug Something isn't working Runner Bug Bug fix scope to the runner

Comments

@kousu
Copy link

kousu commented Dec 25, 2020

Describe the bug

(first reported at actions/runner-images#2352)

Using yes | some_script anywhere in a workflow on macOS hangs forever. some_script will terminate but yes will keep going, blocking the rest of the workflow.

To Reproduce

I've made a minimal test case here: https://github.com/kousu/hanging-actions/. It basically just tests:

yes | head -n 7

plus collecting some context for debugging.

To reproduce, just fork my repo and watch its Actions tab. Linux (and Travis!) will pass easily, but Actions on macOS hangs until cancelled.

Expected behavior
All platforms should succeed in approximately the same time and way.

Runner Version and Platform

From my logs:

Request a runner to run this job

  • Current runner version: '2.275.1'

Operating System

  • Mac OS X
  • 10.15.7
  • 19H15

Virtual Environment

Thanks to @maxim-lobanov we know that it only happens under actions/runner, and not under https://github.com/microsoft/azure-pipelines-agent (even though azure pipelines runs the same macOS images) or when connected over VNC or ssh.

What's not working?

The symptom is that the Linux builds finish immediately while the macOS build hangs in yellow forever:

2020-12-25-043804_590x265_scrot

In fact even Travis on nearly (but not exactly the same) version of macOS works fine.

For example, consider this Actions Workflow and this this Travis script for comparison, everything finishes in about a minute except for Actions-macOS.

This should only take a moment to finish, and on Travis it does:

2020-12-25-044911_991x287_scrot

but on Actions it's at 3 minutes and counting. I've had jobs hung much longer too -- up to their 6 hour limit -- before I noticed what was going on:

2020-12-25-043814_969x287_scrot

The only way to stop the job is to cancel it. It never undeadlocks.

Job Log Output

Runner and Worker's Diagnostic Logs

I don't have access to these! If I install a runner locally and reproduce there I'll update this.

@kousu
Copy link
Author

kousu commented Dec 25, 2020

Debugging

I've figured out enough to guess that yes is refusing to quit even though head has closed its stdin. But why?

diffing the logs

Travis and Actions use slightly different log formats -- and they set $PS4 differently and have different ideas about when to use \rs -- but I dealt with some of that in my test case and can clean up the rest with a swipe of awk:

# tr kills DOS line endings
# awk filters for only lines after a given keyword
# cut kills off Github's timestamp prefixes
diff -u \
  <(cat travis-macos.txt | 
         tr -d '\r' |
         awk 'BEGIN { P=0 } /\+ echo/ { P = 1 } P==1 { print }' |
         tee travis-macos.cleaned.txt
    ) \
  <(cat actions-macos.txt |
        tr -d '\r' |
        cut -f 2- -d ' ' |
        awk 'BEGIN { P=0 } /\+ echo/ { P = 1 } P==1 { print }' |
        tee actions-macos.cleaned.txt
    )

(
Cleaned up intermediate logs, for orienting yourself. There's a lot of output!:

)

full diff

(- is Travis, + is Actions here)

--- /dev/fd/63	2020-12-25 04:53:46.691750457 -0500
+++ /dev/fd/62	2020-12-25 04:53:46.691750457 -0500
@@ -5,15 +5,15 @@
 + echo --------
 --------
 + uname -a
-Darwin Traviss-Mac.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
+Darwin Mac-1608887842659.local 19.6.0 Darwin Kernel Version 19.6.0: Thu Oct 29 22:56:45 PDT 2020; root:xnu-6153.141.2.2~1/RELEASE_X86_64 x86_64
 + whoami
-travis
+runner
 + set
-ANSI_CLEAR='\033[0K'
-ANSI_GREEN='\033[32;1m'
-ANSI_RED='\033[31;1m'
-ANSI_RESET='\033[0m'
-ANSI_YELLOW='\033[33;1m'
+AGENT_TOOLSDIRECTORY=/Users/runner/hostedtoolcache
+ANDROID_HOME=/Users/runner/Library/Android/sdk
+ANDROID_NDK_18R_PATH=/Users/runner/Library/Android/sdk/ndk/18.1.5063045
+ANDROID_NDK_HOME=/Users/runner/Library/Android/sdk/ndk-bundle
+ANDROID_SDK_ROOT=/Users/runner/Library/Android/sdk
 BASH=/bin/bash
 BASH_ARGC=()
 BASH_ARGV=()
@@ -21,106 +21,100 @@
 BASH_SOURCE=([0]="./test.sh")
 BASH_VERSINFO=([0]="3" [1]="2" [2]="57" [3]="1" [4]="release" [5]="x86_64-apple-darwin19")
 BASH_VERSION='3.2.57(1)-release'
+BOOTSTRAP_HASKELL_NONINTERACTIVE=1
+CHROMEWEBDRIVER=/usr/local/Caskroom/chromedriver/87.0.4280.20
 CI=true
-CONTINUOUS_INTEGRATION=true
-DEBIAN_FRONTEND=noninteractive
+CONDA=/usr/local/miniconda
 DIRSTACK=()
-DISPLAY=/private/tmp/com.apple.launchd.22HnXAfEhM/org.macosforge.xquartz:0
+DOTNET_MULTILEVEL_LOOKUP=0
+DOTNET_ROOT=/Users/runner/.dotnet
+EDGEWEBDRIVER=/usr/local/share/edge_driver
 EUID=501
-GEM_HOME=/Users/travis/.rvm/gems/ruby-2.6.6
-GEM_PATH=/Users/travis/.rvm/gems/ruby-2.6.6:/Users/travis/.rvm/gems/ruby-2.6.6@global
-GIT_ASKPASS=echo
+GECKOWEBDRIVER=/usr/local/opt/geckodriver/bin
+GITHUB_ACTION=run
+GITHUB_ACTIONS=true
+GITHUB_ACTION_REF=
+GITHUB_ACTION_REPOSITORY=
+GITHUB_ACTOR=kousu
+GITHUB_API_URL=https://api.github.com
+GITHUB_BASE_REF=
+GITHUB_ENV=/Users/runner/work/_temp/_runner_file_commands/set_env_6ae3b7b2-45b8-4c53-8a92-21771c0328f6
+GITHUB_EVENT_NAME=push
+GITHUB_EVENT_PATH=/Users/runner/work/_temp/_github_workflow/event.json
+GITHUB_GRAPHQL_URL=https://api.github.com/graphql
+GITHUB_HEAD_REF=
+GITHUB_JOB=test
+GITHUB_PATH=/Users/runner/work/_temp/_runner_file_commands/add_path_6ae3b7b2-45b8-4c53-8a92-21771c0328f6
+GITHUB_REF=refs/heads/trunk
+GITHUB_REPOSITORY=kousu/hanging-actions
+GITHUB_REPOSITORY_OWNER=kousu
+GITHUB_RETENTION_DAYS=90
+GITHUB_RUN_ID=443835833
+GITHUB_RUN_NUMBER=18
+GITHUB_SERVER_URL=https://github.com
+GITHUB_SHA=9357e35e8d52329e61ba761c2f62ea6359217d21
+GITHUB_WORKFLOW=CI
+GITHUB_WORKSPACE=/Users/runner/work/hanging-actions/hanging-actions
 GROUPS=()
-HAS_JOSH_K_SEAL_OF_APPROVAL=true
-HOME=/Users/travis
-HOMEBREW_NO_INSTALL_CLEANUP=1
-HOSTNAME=Traviss-Mac.local
+HOME=/Users/runner
+HOMEBREW_CASK_OPTS=--no-quarantine
+HOMEBREW_NO_AUTO_UPDATE=1
+HOSTNAME=Mac-1608887842659.local
 HOSTTYPE=x86_64
 IFS=$' \t\n'
-IRBRC=/Users/travis/.rvm/rubies/ruby-2.6.6/.irbrc
+ImageOS=macos1015
+ImageVersion=20201212.1
+JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
+JAVA_HOME_11_X64=/Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home
+JAVA_HOME_12_X64=/Library/Java/JavaVirtualMachines/adoptopenjdk-12.jdk/Contents/Home
+JAVA_HOME_13_X64=/Library/Java/JavaVirtualMachines/adoptopenjdk-13.jdk/Contents/Home
+JAVA_HOME_14_X64=/Library/Java/JavaVirtualMachines/adoptopenjdk-14.jdk/Contents/Home
+JAVA_HOME_7_X64=/Library/Java/JavaVirtualMachines/zulu-7.jdk/Contents/Home
+JAVA_HOME_8_X64=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
 LANG=en_US.UTF-8
 LC_ALL=en_US.UTF-8
-LOGNAME=travis
+LC_CTYPE=en_US.UTF-8
+LOGNAME=runner
 MACHTYPE=x86_64-apple-darwin19
-MY_RUBY_HOME=/Users/travis/.rvm/rubies/ruby-2.6.6
-NVM_BIN=/Users/travis/.nvm/versions/node/v15.1.0/bin
+NUNIT3_PATH=/Library/Developer/nunit/3.6.0
+NUNIT_BASE_PATH=/Library/Developer/nunit
 NVM_CD_FLAGS=
-NVM_DIR=/Users/travis/.nvm
-NVM_INC=/Users/travis/.nvm/versions/node/v15.1.0/include/node
+NVM_DIR=/Users/runner/.nvm
 OPTERR=1
 OPTIND=1
 OSTYPE=darwin19
-PAGER=cat
-PATH=/Users/travis/.rvm/gems/ruby-2.6.6/bin:/Users/travis/.rvm/gems/ruby-2.6.6@global/bin:/Users/travis/.rvm/rubies/ruby-2.6.6/bin:/Users/travis/.rvm/bin:/Users/travis/bin:/Users/travis/.local/bin:/Users/travis/.nvm/versions/node/v15.1.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/Apple/usr/bin
+PATH=/usr/local/opt/pipx_bin:/Users/runner/.cargo/bin:/usr/local/lib/ruby/gems/2.7.0/bin:/usr/local/opt/ruby/bin:/usr/local/opt/curl/bin:/usr/local/bin:/usr/local/sbin:/Users/runner/bin:/Users/runner/.yarn/bin:/usr/local/go/bin:/Users/runner/Library/Android/sdk/tools:/Users/runner/Library/Android/sdk/platform-tools:/Users/runner/Library/Android/sdk/ndk-bundle:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/usr/bin:/bin:/usr/sbin:/sbin:/Users/runner/.dotnet/tools:/Users/runner/.ghcup/bin:/Users/runner/hostedtoolcache/stack/2.5.1/x64
+PERFLOG_LOCATION_SETTING=RUNNER_PERFLOG
 PIPESTATUS=([0]="0")
-PPID=1186
+PIPX_BIN_DIR=/usr/local/opt/pipx_bin
+PIPX_HOME=/usr/local/opt/pipx
+POWERSHELL_DISTRIBUTION_CHANNEL=GitHub-Actions-macos1015
+PPID=1125
 PS4='+ '
-PWD=/Users/travis/build/kousu/hanging-actions
-RUBY_VERSION=ruby-2.6.6
+***
+RCT_NO_LAUNCH_PACKAGER=1
+RUNNER_OS=macOS
+RUNNER_PERFLOG=/usr/local/opt/runner/perflog
+RUNNER_TEMP=/Users/runner/work/_temp
+RUNNER_TOOL_CACHE=/Users/runner/hostedtoolcache
+RUNNER_TRACKING_ID=github_577b5a7a-3c20-4a28-87fa-761ca1468d79
+RUNNER_WORKSPACE=/Users/runner/work/hanging-actions
 SHELL=/bin/bash
 SHELLOPTS=braceexpand:hashall:interactive-comments:xtrace
-SHLVL=2
-SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.OrS36HPkaE/Listeners
-TERM=xterm
-TMPDIR=/var/folders/z3/_825pg0s3jvf0hb_q8kzmg5h0000gn/T/
-TRAVIS=true
-TRAVIS_ALLOW_FAILURE=
-TRAVIS_APP_HOST=build.travis-ci.com
-TRAVIS_APT_PROXY=http://build-cache.travisci.net
-TRAVIS_ARCH=amd64
-TRAVIS_BRANCH=trunk
-TRAVIS_BUILD_DIR=/Users/travis/build/kousu/hanging-actions
-TRAVIS_BUILD_ID=210564868
-TRAVIS_BUILD_NUMBER=16
-TRAVIS_BUILD_STAGE_NAME=
-TRAVIS_BUILD_WEB_URL=https://travis-ci.com/kousu/hanging-actions/builds/210564868
-TRAVIS_CMD=./test.sh
-TRAVIS_COMMIT=9357e35e8d52329e61ba761c2f62ea6359217d21
-TRAVIS_COMMIT_MESSAGE='Demonstrate hanging on Github'\''s macOS VMs'
-TRAVIS_COMMIT_RANGE=5846aac52ee9...9357e35e8d52
-TRAVIS_CPU_ARCH=amd64
-TRAVIS_DIST=notset
-TRAVIS_ENABLE_INFRA_DETECTION=true
-TRAVIS_EVENT_TYPE=push
-TRAVIS_HOME=/Users/travis
-TRAVIS_INFRA=macstadium
-TRAVIS_INIT=notset
-TRAVIS_INTERNAL_RUBY_REGEX='^ruby-(2\.[0-4]\.[0-9]|1\.9\.3)'
-TRAVIS_JOB_ID=464777667
-TRAVIS_JOB_NAME='OSX 10.15 (Catalina)'
-TRAVIS_JOB_NUMBER=16.2
-TRAVIS_JOB_WEB_URL=https://travis-ci.com/kousu/hanging-actions/jobs/464777667
-TRAVIS_LANGUAGE=ruby
-TRAVIS_OSX_IMAGE=xcode12.2
-TRAVIS_OS_NAME=osx
-TRAVIS_PULL_REQUEST=false
-TRAVIS_PULL_REQUEST_BRANCH=
-TRAVIS_PULL_REQUEST_SHA=
-TRAVIS_PULL_REQUEST_SLUG=
-TRAVIS_REPO_SLUG=kousu/hanging-actions
-TRAVIS_ROOT=/
-TRAVIS_RUBY_VERSION=default
-TRAVIS_SECURE_ENV_VARS=false
-TRAVIS_SUDO=true
-TRAVIS_TAG=
-TRAVIS_TEST_RESULT=
-TRAVIS_TIMER_ID=072d0238
-TRAVIS_TIMER_START_TIME=1608888989287773000
-TRAVIS_TMPDIR=/var/folders/z3/_825pg0s3jvf0hb_q8kzmg5h0000gn/T/tmp.qhZht3j7
+SHLVL=3
+SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.r8ce0Aj0d6/Listeners
+TERM=dumb
+TMPDIR=/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/
 UID=501
-USER=travis
+USER=runner
+VCPKG_INSTALLATION_ROOT=/usr/local/share/vcpkg
+XCODE_10_DEVELOPER_DIR=/Applications/Xcode_10.3.app/Contents/Developer
+XCODE_11_DEVELOPER_DIR=/Applications/Xcode_11.7.app/Contents/Developer
+XCODE_12_DEVELOPER_DIR=/Applications/Xcode_12.3.app/Contents/Developer
 XPC_FLAGS=0x0
 XPC_SERVICE_NAME=0
 _=whoami
-__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x0
-rvm_bin_path=/Users/travis/.rvm/bin
-rvm_path=/Users/travis/.rvm
-rvm_prefix=/Users/travis
-rvm_version='1.29.9 (latest)'
-shell_session_update () 
-{ 
-    :
-}
+__CF_USER_TEXT_ENCODING=0x1F5:0:0
 + echo
 
 + echo Test:
@@ -138,9 +132,15 @@
 y
 y
 y
-+ echo 'Great, I see you'\''re finished.'
-Great, I see you're finished.
-travis_time:end:072d0238:start=1608888989287773000,finish=1608888989436442000,duration=148669000,event=script�[0K�[32;1mThe command "./test.sh" exited with 0.�[0m
-
-
-Done. Your build exited with 0.
+##[error]The operation was canceled.
+Post job cleanup.
+[command]/usr/local/bin/git version
+git version 2.29.2
+[command]/usr/local/bin/git config --local --name-only --get-regexp core\.sshCommand
+[command]/usr/local/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
+[command]/usr/local/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
+http.https://github.com/.extraheader
+[command]/usr/local/bin/git config --local --unset-all http.https://github.com/.extraheader
+[command]/usr/local/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :
+Cleaning up orphan processes
+Terminate orphan process: pid (1129) (yes)

Points in this diff that stand out to me:

  1. The VMs are separated by 2 months of updates.
    -Darwin Traviss-Mac.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
    +Darwin Mac-1608887842659.local 19.6.0 Darwin Kernel Version 19.6.0: Thu Oct 29 22:56:45 PDT 2020; root:xnu-6153.141.2.2~1/RELEASE_X86_64 x86_64
    
  2. Indeed, yes is refusing to die like I thought:
    +Terminate orphan process: pid (1129) (yes)
    
  3. Github has TERM=dumb set? As in, not TERM=xterm like Travis. TERM often messes up interactive scripts, but I don't see why that would affect yes, it's too simple.
  4. Github has LC_CTYPE=en_US.UTF-8 set? But again, why would that matter?

@kousu
Copy link
Author

kousu commented Dec 25, 2020

In https://github.com/kousu/hanging-actions/tree/14572ba1b8b6554599a4b64f64a80f002a68922b/ I'm testing if the same bug shows up with seq 99999999999999999 | instead of yes |:

https://github.com/kousu/hanging-actions/blob/14572ba1b8b6554599a4b64f64a80f002a68922b/test.sh#L22-L27

As I expected, this hung too: https://github.com/kousu/hanging-actions/runs/1609762569 -- but seq is not infinite so I expect this to eventually terminate.


I tried writing my own infinite sequence and got two different results:

And in all cases, Travis is still fine!


Does the runner do something with dup()ing the fds? Does it proxy output between subcommands somehow?

@kousu
Copy link
Author

kousu commented Dec 31, 2020

We're wondering if maybe it has to do with some stray trap SIG_IGN PIPE? There's some discussion about the symptoms we saw at https://stackoverflow.com/questions/20573282/hudson-yes-standard-output-broken-pipe/20624315#20624315

@TingluoHuang
Copy link
Member

Does this repro on any self hosted runner?

@maxim-lobanov
Copy link

maxim-lobanov commented Jun 21, 2021

@TingluoHuang , yes, I can reproduce this issue on macOS self-hosted runner.

EDIT:
If I connect to machine via SSH or login into machine, command yes | head -n5 is finished after printing 5 lines
If I install runner on machine and run build, this command hangs and product output infinitely. The same behavior like on Hosted runner

@kousu
Copy link
Author

kousu commented Jun 21, 2021

Thanks for investigating this @maxim-lobanov :) :)

@Temtaime
Copy link
Contributor

Temtaime commented Nov 17, 2021

Any fixes ?
@TingluoHuang it is not unique to self hosted runner. Trying it on macos-latest github runner hangs too

@rouson
Copy link

rouson commented Jan 20, 2022

I just encountered this issue. It would be great if this issue could be resolved to save debugging time for those who encounter this issue.

@tavianator
Copy link

I am running into this here: https://github.com/tavianator/homebrew-tap/runs/4874104560?check_suite_focus=true. I think I found the issue: macOS uses node to spawn the shell to work around an SIP issue:

#if OS_OSX
if (Environment.ContainsKey("DYLD_INSERT_LIBRARIES")) // We don't check `isContainerStepHost` because we don't support container on macOS
{
// launch `node macOSRunInvoker.js shell args` instead of `shell args` to avoid macOS SIP remove `DYLD_INSERT_LIBRARIES` when launch process
string node12 = Path.Combine(HostContext.GetDirectory(WellKnownDirectory.Externals), "node12", "bin", $"node{IOUtil.ExeExtension}");
string macOSRunInvoker = Path.Combine(HostContext.GetDirectory(WellKnownDirectory.Bin), "macos-run-invoker.js");
arguments = $"\"{macOSRunInvoker.Replace("\"", "\\\"")}\" \"{fileName.Replace("\"", "\\\"")}\" {arguments}";
fileName = node12;
}
#endif

macos-run-invoker.js calls the shell with spawn():

const { spawn } = require('child_process');
// argv[0] = node
// argv[1] = macos-run-invoker.js
var shell = process.argv[2];
var args = process.argv.slice(3);
console.log(`::debug::macos-run-invoker: ${shell}`);
console.log(`::debug::macos-run-invoker: ${JSON.stringify(args)}`);
var launch = spawn(shell, args, { stdio: 'inherit' });
launch.on('exit', function (code) {
if (code !== 0) {
process.exit(code);
}
});

Old versions of node have a bug in spawn() that runs the child with SIGPIPE ignored: nodejs/node#13662. Apparently macOS yes doesn't check for EPIPE, it just expects to die from SIGPIPE, so it hangs forever writing y to a broken pipe.

I suspect using Node 16 instead of Node 12 would fix it, but I can't confirm easily because I don't have hands-on access to macOS.

@tavianator
Copy link

I suspect using Node 16 instead of Node 12 would fix it, but I can't confirm easily because I don't have hands-on access to macOS.

Looks like #1621 does this. Can anyone check whether it fixes this issue?

tavianator added a commit to tavianator/homebrew-tap that referenced this issue Jan 21, 2022
GH actions macOS runners have a bug that makes scripts run with SIGPIPE
ignored, ultimately causing `yes` to loop forever instead of exiting
when a pipe breaks.  Work around it by explicitly resetting the handler
using `env` from coreutils.
@tavianator
Copy link

genv --default-signal=PIPE yes | head -n7 is the only workaround I could figure out. Needs GNU coreutils installed from e.g. Homebrew.

@skeet70
Copy link

skeet70 commented May 27, 2022

We've run into this same issue testing the new alpha M1 OSX self-hosted runner. Our runner locally has node 18 installed, and I think part of the work to release the new M1 runner was to convert github first party actions to node 16.

@philip-peterson
Copy link

philip-peterson commented May 26, 2024

Unclear that this has to do with Node. It seems to be that ProcessInvoker neither catches the final output line from @kousu 's repro case (Great, I see you're finished.), nor does it detect that the process has finished (ProcessExitedHandler never fires). This is, AFAICT, when the runner is invoking the shell script directly (i.e., not through node).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Runner Bug Bug fix scope to the runner
Projects
None yet
Development

No branches or pull requests

9 participants