Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos-rebuild with --target-host hangs in github action environment #262686

Closed
duament opened this issue Oct 22, 2023 · 19 comments · Fixed by #263360
Closed

nixos-rebuild with --target-host hangs in github action environment #262686

duament opened this issue Oct 22, 2023 · 19 comments · Fixed by #263360

Comments

@duament
Copy link
Contributor

duament commented Oct 22, 2023

Describe the bug

I've set up a github action to deploy nixos on remote hosts with command

nix run nixpkgs#nixos-rebuild -- --flake .#"$host" --target-host deploy@"$host".rvf6.com --use-remote-sudo switch

But after #258571, it stops working.

The failed job https://github.com/duament/flakes/actions/runs/6598694622/job/17926924711 shows it hangs for about 2 hours and exit abnormally. On the target host I found nixos-rebuild-switch-to-configuration.service exited successfully but the systemd-run process hangs.

I can also reproduce it locally with systemd-run --user --same-dir nix run nixpkgs#nixos-rebuild -- --flake .#az --target-host deploy@az.rvf6.com --use-remote-sudo switch.

Might be related to tty.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Run nixos-rebuild --target-host either in github action environment or locally with systemd-run

Expected behavior

Exit successfully

Screenshots

❯ systemctl status nixos-rebuild-switch-to-configuration.service session-16.scope
○ nixos-rebuild-switch-to-configuration.service - /nix/store/hy54mnahnssgf95wfk890gr6kgamnizq-nixos-system-az-23.11.20231021.fcb40e7/bin/switch-to-configuration switch
     Loaded: loaded (/run/systemd/transient/nixos-rebuild-switch-to-configuration.service; transient)
  Transient: yes
     Active: inactive (dead) since Sun 2023-10-22 14:45:28 HKT; 26min ago
   Duration: 4.190s
    Process: 5365 ExecStart=/nix/store/hy54mnahnssgf95wfk890gr6kgamnizq-nixos-system-az-23.11.20231021.fcb40e7/bin/switch-to-configuration switch (code=exited, status=0/SUCCESS)
   Main PID: 5365 (code=exited, status=0/SUCCESS)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
        CPU: 1.147s

Oct 22 14:45:25 az nixos[5365]: switching to system configuration /nix/store/hy54mnahnssgf95wfk890gr6kgamnizq-nixos-system-az-23.11.20231021.fcb40e7
Oct 22 14:45:26 az su[5616]: Successful su for deploy by root
Oct 22 14:45:26 az su[5616]: pam_unix(su:session): session opened for user deploy(uid=993) by (uid=0)
Oct 22 14:45:26 az su[5616]: pam_unix(su:session): session closed for user deploy
Oct 22 14:45:26 az su[5623]: Successful su for rvfg by root
Oct 22 14:45:26 az su[5623]: pam_unix(su:session): session opened for user rvfg(uid=1000) by (uid=0)
Oct 22 14:45:27 az su[5623]: pam_unix(su:session): session closed for user rvfg
Oct 22 14:45:28 az nixos[5365]: finished switching to system configuration /nix/store/hy54mnahnssgf95wfk890gr6kgamnizq-nixos-system-az-23.11.20231021.fcb40e7
Oct 22 14:45:28 az systemd[1]: nixos-rebuild-switch-to-configuration.service: Deactivated successfully.
Oct 22 14:45:28 az systemd[1]: nixos-rebuild-switch-to-configuration.service: Consumed 1.147s CPU time, no IO, no IP traffic.

● session-16.scope - Session 16 of User deploy
     Loaded: loaded (/run/systemd/transient/session-16.scope; transient)
  Transient: yes
     Active: active (running) since Sun 2023-10-22 14:45:22 HKT; 26min ago
         IP: 0B in, 0B out
         IO: 228.0K read, 3.8M written
      Tasks: 5
     Memory: 704.0K
        CPU: 378ms
     CGroup: /user.slice/user-993.slice/session-16.scope
             ├─5305 "sshd: deploy [priv]"
             ├─5319 "sshd: deploy@notty"
             ├─5355 fish -c "sudo --preserve-env=NIXOS_INSTALL_BOOTLOADER -- systemd-run -E LOCALE_ARCHIVE --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-re>
             ├─5363 sudo --preserve-env=NIXOS_INSTALL_BOOTLOADER -- systemd-run -E LOCALE_ARCHIVE --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-rebuild-swi>
             └─5364 systemd-run -E LOCALE_ARCHIVE --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/hy54mnahn>

Oct 22 14:45:22 az systemd[1]: Started Session 16 of User deploy.
Oct 22 14:45:23 az sudo[5340]:   deploy : PWD=/var/empty ; USER=root ; COMMAND=/run/current-system/sw/bin/nix-env -p /nix/var/nix/profiles/system --set /nix/store/hy54mnahnssgf95wfk890gr6kgamnizq>
Oct 22 14:45:23 az sudo[5340]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=993)
Oct 22 14:45:23 az sudo[5340]: pam_unix(sudo:session): session closed for user root
Oct 22 14:45:24 az sudo[5351]:   deploy : PWD=/var/empty ; USER=root ; COMMAND=/run/current-system/sw/bin/systemd-run -E LOCALE_ARCHIVE --collect --no-ask-password --pty --quiet --same-dir --serv>
Oct 22 14:45:24 az sudo[5351]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=993)
Oct 22 14:45:24 az sudo[5351]: pam_unix(sudo:session): session closed for user root
Oct 22 14:45:24 az sudo[5363]:   deploy : PWD=/var/empty ; USER=root ; COMMAND=/run/current-system/sw/bin/systemd-run -E LOCALE_ARCHIVE --collect --no-ask-password --pty --quiet --same-dir --serv>
Oct 22 14:45:24 az sudo[5363]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=993)

Additional context

Notify maintainers

@thiagokokada

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.5.5, NixOS, 23.11 (Tapir), 23.11.20231021.fcb40e7`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.17.1`
 - nixpkgs: `/nix/store/xvwsw3qcgzw1xs4i9yypl9qipq08maig-source`
@thiagokokada
Copy link
Contributor

This may be related to #83392, that @Ma27 commented on the original PR but I couldn't find a way to reproduce.

I imagine this problem may only happen when systemd itself is updated. Could you confirm that this issue happens in every update or just big ones @duament ?

@thiagokokada
Copy link
Contributor

BTW, looking at your logs it seems I missed to preserve the bootloader updates inside the systemd-run, so this also needs to be fixed.

@thiagokokada
Copy link
Contributor

thiagokokada commented Oct 22, 2023

This works fine to me:

$ systemd-run -E PATH --user --same-dir --wait --collect --tty nix run nixpkgs#nixos-rebuild -- --flake .#zachune-nixos --target-host root@zachune-nixos-uk switch                                                                                                                      12:12:37
Running as unit: run-u875.service
Press ^] three times within 1s to disconnect TTY.
building the system configuration...
copying 0 paths...
updating GRUB 2 menu...
activating the configuration...
showing changes compared to /run/current-system...
setting up /etc...
reloading user units for root...
setting up tmpfiles
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 7.118s
CPU time consumed: 360ms

@duament Can you pass the --wait --collect --tty flags and post the output of the TTY to check the logs? Because without -E PATH it doesn't work for me because my configuration is Flake based and it compains the lack of git.

@duament
Copy link
Contributor Author

duament commented Oct 22, 2023

Can you pass the --wait --collect --tty flags and post the output of the TTY to check the logs? Because without -E PATH it doesn't work for me because my configuration is Flake based and it compains the lack of git.

I can only reproduce it without the --tty flag. Here's the logs

❯ systemd-run --user --wait --collect --tty --same-dir nixos-rebuild --flake .#az --target-host deploy@az.rvf6.com --use-remote-sudo switch
Running as unit: run-u114.service
Press ^] three times within 1s to disconnect TTY.
building the system configuration...
error: can not save history
warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

error: can not save universal variables or functions
warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

copying 0 paths...
error: can not save history
warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

error: can not save universal variables or functions
warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

error: can not save history
warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

error: can not save universal variables or functions
warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

activating the configuration...
sops-install-secrets: Imported /persist/etc/ssh/ssh_host_ed25519_key as age key with fingerprint age1tpln8534w0ttdp7sd7tf3zeyr3m4w707dakt8kgm8j8c9r0vhyjqhae023
setting up /etc...
sops-install-secrets: Imported /persist/etc/ssh/ssh_host_ed25519_key as age key with fingerprint age1tpln8534w0ttdp7sd7tf3zeyr3m4w707dakt8kgm8j8c9r0vhyjqhae023
reloading user units for deploy...
reloading user units for rvfg...
setting up tmpfiles
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 11.466s
CPU time consumed: 328ms
❯ systemd-run --user --wait --collect --same-dir nixos-rebuild --flake .#az --target-host deploy@az.rvf6.com --use-remote-sudo switch
Running as unit: run-u115.service

❯ journalctl --user -u run-u115.service -n 1000 -f
Oct 22 19:34:57 xiaoxin systemd[1659]: Started /run/current-system/sw/bin/nixos-rebuild --flake .#az --target-host deploy@az.rvf6.com --use-remote-sudo switch.
Oct 22 19:34:57 xiaoxin nixos-rebuild[405923]: building the system configuration...
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: error: can not save history
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: error: can not save universal variables or functions
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.
Oct 22 19:35:01 xiaoxin nixos-rebuild[405953]: copying 0 paths...
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: error: can not save history
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: error: can not save universal variables or functions
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: error: can not save history
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: error: can not save universal variables or functions
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Unable to locate config directory derived from $HOME: '/var/empty/.config/fish'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: The error was 'Operation not permitted'.
Oct 22 19:35:02 xiaoxin nixos-rebuild[405969]: warning-path: Please set $HOME to a directory where you have write access.

The second command without --tty flag hangs. But on the remote machine switch-to-configuration runs successfully and only systemd-run process hangs.

@duament
Copy link
Contributor Author

duament commented Oct 22, 2023

This may be related to #83392, that @Ma27 commented on the original PR but I couldn't find a way to reproduce.

I imagine this problem may only happen when systemd itself is updated. Could you confirm that this issue happens in every update or just big ones @duament ?

I believe it's not related to a systemd update since I can reproduce it by running the same command again without an actual system upgrade.

@thiagokokada
Copy link
Contributor

thiagokokada commented Oct 22, 2023

warning-path: Unable to locate data directory derived from $HOME: '/var/empty/.local/share/fish'.
warning-path: The error was 'Operation not permitted'.
warning-path: Please set $HOME to a directory where you have write access.

Your configuration seems to be depending on $HOME environment being set and needs to be fixed because this is impure. This is one of the things that #258571 changes and it is expected (so no bug here).

See the reasoning here:

To workaround for now you can pass NIXOS_SWITCH_USE_DIRTY_ENV=1 to nixos-rebuild, for example:

$ NIXOS_SWITCH_USE_DIRTY_ENV=1 sudo --preserve-env=NIXOS_SWITCH_USE_DIRTY_ENV nixos-rebuild switch --install-bootloader

Sadly this is probably broken for remote installations until this is fixed in #262724. Edit: no it is not.

Keep in mind that NIXOS_SWITCH_USE_DIRTY_ENV is a temporary workaround until all activation scripts are fixed to not depend on the environment and will be removed in the future.

@duament
Copy link
Contributor Author

duament commented Oct 22, 2023

Your configuration seems to be depending on "$HOME" environment being set and needs to be fixed because this is impure. This is one of the things that #258571 changes and it is expected (so no bug here).

That $HOME related warning/error is from fish. It just complains my $HOME(/var/empty) is not writable. It will not exit with error.

@duament
Copy link
Contributor Author

duament commented Oct 22, 2023

20231022_20h18m40s_grim

It seams systemd-run --pty will hang if no tty is available. Not sure if it's a feature or bug.

@duament
Copy link
Contributor Author

duament commented Oct 22, 2023

20231022_20h24m16s_grim

Adding a --pipe flag could prevent it from hanging.

I guess we should add a --pipe flag in nixos-rebuild.sh?

@thiagokokada
Copy link
Contributor

20231022_20h24m16s_grim

Adding a --pipe flag could prevent it from hanging.

I guess we should add a --pipe flag in nixos-rebuild.sh?

I guess it is fair:

--pipe, -P
           If specified, standard input, output, and error of the
           transient service are inherited from the systemd-run command
           itself. This allows systemd-run to be used within shell
           pipelines. Note that this mode is not suitable for
           interactive command shells and similar, as the service
           process will not become a TTY controller when invoked on a
           terminal. Use --pty instead in that case.

           When both --pipe and --pty are used in combination the more
           appropriate option is automatically determined and used.
           Specifically, when invoked with standard input, output and
           error connected to a TTY --pty is used, and otherwise --pipe.

           When this option is used the original file descriptors
           systemd-run receives are passed to the service processes
           as-is. If the service runs with different privileges than
           systemd-run, this means the service might not be able to
           re-open the passed file descriptors, due to normal file
           descriptor access restrictions. If the invoked process is a
           shell script that uses the echo "hello" >/dev/stderr
           construct for writing messages to stderr, this might cause
           problems, as this only works if stderr can be re-opened. To
           mitigate this use the construct echo "hello" >&2 instead,
           which is mostly equivalent and avoids this pitfall.

Especially because it seems --pipe and --tty can be used together and systemd-run will do the appropriate thing.

@thiagokokada
Copy link
Contributor

I will test --pipe and add it to #262724 if it works.

@thiagokokada
Copy link
Contributor

20231022_20h18m40s_grim

It seams systemd-run --pty will hang if no tty is available. Not sure if it's a feature or bug.

BTW, from the systemd-run manual it seems that this is a feature. --pty will connect to the terminal it is invoked on, and there is no fallback unless --pipe is passed.

So adding both flags seems correct in this case.

@thiagokokada
Copy link
Contributor

I tried to add the flag --pipe to the switch-to-configuration call, see patch:

diff --git a/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh b/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
index 9e75db6d27b5..cb616d3cf709 100755
--- a/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
+++ b/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
@@ -663,6 +663,7 @@ if [[ "$action" = switch || "$action" = boot || "$action" = test || "$action" =
         "--collect"
         "--no-ask-password"
         "--pty"
+        "--pipe"
         "--quiet"
         "--same-dir"
         "--service-type=exec"

But got stuck with some strange errors during tests:

$ nix-build -A nixosTests.nixos-rebuild-install-bootloader
...
machine # qemu-kvm: terminating on signal 15 from pid 6 (/nix/store/ffll6glz3gwx342z0ch8wx30p5cnqz1z-python3-3.11.5/bin/python3.11)
(finished: cleanup, in 0.72 seconds)
Traceback (most recent call last):
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/__init__.py", line 117, in main
    driver.run_tests()
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 147, in run_tests
    self.test_script()
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/driver.py", line 143, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 10, in <module>
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 675, in succeed
    (status, out) = self.execute(command, timeout=timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 614, in execute
    rc = int(self._next_newline_closed_block_from_shell().strip())
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'echo ${PIPESTATUS[0]}'
kill vlan (pid 7)
error: builder for '/nix/store/z7a152s3jc20mm9i998cck779cbhbm2s-vm-test-run-nixos-rebuild-install-bootloader.drv' failed with exit code 1;
       last 10 log lines:
       >     exec(self.tests, symbols, None)
       >   File "<string>", line 10, in <module>
       >   File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 675, in succeed
       >     (status, out) = self.execute(command, timeout=timeout)
       >                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       >   File "/nix/store/n6zqh28rh2r2h42npr5drz616khdrjj2-nixos-test-driver-1.1/lib/python3.11/site-packages/test_driver/machine.py", line 614, in execute
       >     rc = int(self._next_newline_closed_block_from_shell().strip())
       >          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       > ValueError: invalid literal for int() with base 10: 'echo ${PIPESTATUS[0]}'
       > kill vlan (pid 7)
       For full logs, run 'nix log /nix/store/z7a152s3jc20mm9i998cck779cbhbm2s-vm-test-run-nixos-rebuild-install-bootloader.drv'.

Not sure what is happening here, maybe @roberth can help?


For now I think one workaround can be to wrap the call to nixos-rebuild in a bash call, something like:

$ bash -c "nix run nixpkgs#nixos-rebuild -- --flake .#"$host" --target-host deploy@"$host".rvf6.com --use-remote-sudo switch"

@thiagokokada
Copy link
Contributor

BTW, I think we can substitute --pty with --pipe, because switch-to-configuration is not an interactive process anyway. I tried this and it works in my configuration, however still gets the same error during the tests.

@thiagokokada
Copy link
Contributor

BTW, I think we can substitute --pty with --pipe, because switch-to-configuration is not an interactive process anyway. I tried this and it works in my configuration, however still gets the same error during the tests.

I think I found a way to fix the issue, opened a PR: #262985

@thiagokokada
Copy link
Contributor

I think I found a way to fix the issue, opened a PR: #262985

I didn't work (--scope and --pipe doesn't work together). I will probably need further help in this issue.

@duament
Copy link
Contributor Author

duament commented Oct 25, 2023

ValueError: invalid literal for int() with base 10: 'echo ${PIPESTATUS[0]}'

Investigating the test error:

diff --git a/nixos/tests/nixos-rebuild-install-bootloader.nix b/nixos/tests/nixos-rebuild-install-bootloader.nix
index 3ade90ea24a..d1a01887f20 100644
--- a/nixos/tests/nixos-rebuild-install-bootloader.nix
+++ b/nixos/tests/nixos-rebuild-install-bootloader.nix
@@ -55,12 +55,12 @@ import ./make-test-python.nix ({ pkgs, ... }: {
           "${configFile}",
           "/etc/nixos/configuration.nix",
       )
-      machine.succeed("nixos-rebuild switch")
+      machine.succeed("stty -a 1>&2; nixos-rebuild switch; stty -a 1>&2; stty -echo")

       # Need to run `nixos-rebuild` twice because the first run will install
       # GRUB anyway
       with subtest("Switch system again and install bootloader"):
-          result = machine.succeed("nixos-rebuild switch --install-bootloader")
+          result = machine.succeed("stty -a 1>&2; nixos-rebuild switch --install-bootloader; stty -a 1>&2; stty -echo")
           # install-grub2.pl messages
           assert "updating GRUB 2 menu..." in result
           assert "installing the GRUB 2 boot loader on /dev/vda..." in result
diff --git a/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh b/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
index 9e75db6d27b..cb616d3cf70 100755
--- a/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
+++ b/pkgs/os-specific/linux/nixos-rebuild/nixos-rebuild.sh
@@ -663,6 +663,7 @@ if [[ "$action" = switch || "$action" = boot || "$action" = test || "$action" =
         "--collect"
         "--no-ask-password"
         "--pty"
+        "--pipe"
         "--quiet"
         "--same-dir"
         "--service-type=exec"

Found the tty settings being changed.

Logs:

machine: must succeed: stty -a 1>&2; nixos-rebuild switch; stty -a 1>&2; stty -echo
machine # speed 38400 baud; rows 0; columns 0; line = 0;
machine # intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
machine # eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
machine # werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
machine # -parenb -parodd -cmspar cs8 hupcl -cstopb cread -clocal -crtscts
machine # -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff
machine # -iuclc -ixany -imaxbel -iutf8
machine # -opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
machine # -isig -icanon iexten -echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
machine # echoctl echoke -flusho -extproc
machine # building Nix...
machine # building the system configuration...
...
machine # [  207.571327] systemd[1]: nixos-rebuild-switch-to-configuration.service: Deactivated successfully.
machine # speed 38400 baud; rows 0; columns 0; line = 0;
machine # intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
machine # eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
machine # werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
machine # -parenb -parodd -cmspar cs8 hupcl -cstopb cread -clocal -crtscts
machine # -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
machine # -iuclc -ixany imaxbel iutf8
machine # opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
machine # isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop echoprt
machine # echoctl echoke -flusho -extproc
(finished: must succeed: stty -a 1>&2; nixos-rebuild switch --install-bootloader; stty -a 1>&2; stty -echo, in 12.53 seconds)
Test "Switch system again and install bootloader" failed with error: ""
cleanup
kill machine (pid 8)
machine # qemu-kvm: terminating on signal 15 from pid 6 (/nix/store/ffll6glz3gwx342z0ch8wx30p5cnqz1z-python3-3.11.5/bin/python3.>
(finished: cleanup, in 0.17 seconds)
kill vlan (pid 7)

Logs diff:

--- old 2023-10-25 08:59:07.693220929 +0800
+++ new 2023-10-25 08:59:05.455250752 +0800
@@ -3,8 +3,8 @@
 machine # eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
 machine # werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
 machine # -parenb -parodd -cmspar cs8 hupcl -cstopb cread -clocal -crtscts
-machine # -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff
-machine # -iuclc -ixany -imaxbel -iutf8
-machine # -opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-machine # -isig -icanon iexten -echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
+machine # -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
+machine # -iuclc -ixany imaxbel iutf8
+machine # opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
+machine # isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop echoprt
 machine # echoctl echoke -flusho -extproc

I'll try to restore the settings and see if the test could pass.

@greed42
Copy link

greed42 commented Nov 30, 2023

I'm getting an identical hang using Packer to produce EC2 EBS images; with a Packer HCL fragment containing:

  provisioner "shell" {
    inline = [
        "set -x",
        "nix-channel --add \"https://nixos.org/channels/nixos-23.11\" nixos",
        "nix-channel --update nixos",
        "nixos-rebuild boot",
      ]
    )
  }

I'm starting from the NixOS-23.05.555.52869451b83-* AMIs. I can confirm the NIXOS_SWITCH_USE_DIRTY_ENV=1 workaround works with Packer.

@bmillwood
Copy link

bmillwood commented Dec 11, 2023

Since upgrading to nixos-23.11 I'm having issues that sound similar: if I run just

nixos-rebuild switch -I nixpkgs=/home/ben/code/nix/noether/nixpkgs

it works fine, but if I run

echo nixos-rebuild switch -I nixpkgs=/home/ben/code/nix/noether/nixpkgs | sudo bash

then it hangs after this:

$ sudo bash <<< "nixos-rebuild switch -I nixpkgs=/home/ben/code/nix/noether/nixpkgs"
building Nix...
building the system configuration...

until I ctrl-C it. Looking at ps while it is hanging, it does indeed seem to be systemd-run:

root       27378   27377  0 22:51 pts/4    00:00:00 /nix/store/q1c2flcykgr4wwg5a6h450hxbk4ch589-bash-5.2-p15/bin/bash /nix/store/fpr71zwhdbnqmax2c898skbj8b6am5j4-nixos-rebuild/bin/nixos-rebuild switch -I nixpkgs=/home/ben/code/nix/noether/nixpkgs
root       27504   27378  0 22:51 pts/4    00:00:00 sudo --preserve-env=NIXOS_INSTALL_BOOTLOADER -- systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/rygglgvmsy3j98wkas1iyg7r0m789gi4-nixos-system-noether-23.11.git.781e2a9797ec/bin/switch-to-configuration switch
root       27505   27504  0 22:51 pts/7    00:00:00 sudo --preserve-env=NIXOS_INSTALL_BOOTLOADER -- systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/rygglgvmsy3j98wkas1iyg7r0m789gi4-nixos-system-noether-23.11.git.781e2a9797ec/bin/switch-to-configuration switch
root       27506   27505  0 22:51 pts/7    00:00:00 systemd-run -E LOCALE_ARCHIVE -E NIXOS_INSTALL_BOOTLOADER --collect --no-ask-password --pty --quiet --same-dir --service-type=exec --unit=nixos-rebuild-switch-to-configuration --wait /nix/store/rygglgvmsy3j98wkas1iyg7r0m789gi4-nixos-system-noether-23.11.git.781e2a9797ec/bin/switch-to-configuration switch

and indeed echo "NIXOS_SWITCH_USE_DIRTY_ENV=true nixos-rebuild switch -I nixpkgs=/home/ben/code/nix/noether/nixpkgs" | sudo bash completes just fine.

(I assume the fact that I'm passing in a different nixpkgs with -I is irrelevant, but I'm including it just in case; in my real use case the stdin I feed to sudo bash is more interesting, this is just a minimal reproduction)

I have an effective workaround for my case (just running commands in sudo one-by-one, instead of feeding a script on stdin to sudo bash, which I guess means that there is a pty available) but happy to help debug if necessary.

Enzime added a commit to Enzime/nixpkgs that referenced this issue Dec 30, 2023
Enzime added a commit to Enzime/nixpkgs that referenced this issue Dec 30, 2023
Enzime added a commit to Enzime/nixpkgs that referenced this issue Jan 3, 2024
brokenpylons pushed a commit to UM-LPM/server that referenced this issue Mar 30, 2024
brokenpylons pushed a commit to UM-LPM/server that referenced this issue Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants