Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New NixOS services not starting until a second switch in EC2 image #23121

Closed
copumpkin opened this issue Feb 23, 2017 · 4 comments
Closed

New NixOS services not starting until a second switch in EC2 image #23121

copumpkin opened this issue Feb 23, 2017 · 4 comments
Labels
0.kind: bug 0.kind: regression Something that worked before working no longer 6.topic: nixos

Comments

@copumpkin
Copy link
Member

copumpkin commented Feb 23, 2017

I just built an EC2 image from 15c05ad. I boot it up with userdata (i.e., configuration) that enables a trivial httpd server, as follows:

{ pkgs, lib, ... }:

{
  imports = [ <nixpkgs/nixos/modules/virtualisation/amazon-image.nix> ];
  ec2.hvm = true;

  services.httpd = {
    enable = true;
    adminAddr = "test@example.org";
    documentRoot = "${pkgs.valgrind.doc}/share/doc/valgrind/html";
  };
  networking.firewall.allowedTCPPorts = [ 80 ];

  # My user goes here
}

If I journalctl -u amazon-init.service, I see the following log:

Feb 23 19:51:59 hostname systemd[1]: Starting Reconfigure the system from EC2 userdata on startup...
Feb 23 19:51:59 hostname amazon-init-start[723]: attempting to fetch configuration from EC2 user data...
Feb 23 19:52:00 hostname amazon-init-start[723]: setting configuration from EC2 user data
Feb 23 19:52:20 hostname amazon-init-start[723]: building Nix...
Feb 23 19:52:20 hostname amazon-init-start[723]: building the system configuration...
Feb 23 19:52:48 hostname amazon-init-start[723]: download-from-binary-cache.pl: still waiting for ‘https://cache.nixos.org/sbn7d3dnc3xi5zy4bdy7ss04xljkvz14.narinfo’ after 5 seconds...
Feb 23 19:52:54 hostname amazon-init-start[723]: these derivations will be built:
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/0xvsvyrinmqq0nf4f8lqdsgs6bb7gp7g-unit-script.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/1yff0asc6m1ab7cai9mc0xd2z3ckxq5i-httpd.conf.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/nsa09ikwpza490p4p315gf4lq9sm3qrz-nixos-version.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/p4nr2rsdd3qjh5xvmgbl7wwhqdanhips-system-path.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/vv1b36irfpazzjhqm02bmqswdn5vmyk4-dbus-catalog.xml.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/6jscbxm7kmxpmhaqa9vap8wv08dicw1c-dbus-1.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/3pspz56sakhwl0yiy1nbfpjlg5cpcgqb-unit-dbus.service.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/7ymiyjjw5333392nmn8zh86m3jpizzqz-firewall-start.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/n263vhshbl9c9ywc8zfcmbwpggdvdd10-unit-httpd.service.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/psd2l5i4cmv1ph50rch39i195hx4armh-unit-polkit.service.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/qsvcrk9733m7m35ss84w9smipb6c6rm3-unit-autovt-.service-disabled.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/sy55v9r8yqqfw4a7k3l6np4ccvij83vy-unit-console-getty.service-disabled.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/xpn79dllixhv1y0c0wnc949mf4dfmy6f-firewall-reload.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/zvxjr0fr32a1i4wh3k40mn42gbv5vbd7-unit-firewall.service.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/8982cagj24zal6g918w44p39shis37kx-system-units.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/dq92xcsqiakmh19jfjqx7if14inldl2j-users-groups.json.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/qpasrarcggsy59r0hkssf80fcmpg147i-user-units.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/wnr4zg7x7qnk364lidfbyga186iwb26c-copumpkin-authorized_keys.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/ndk893qib2099wvgls5g037d5q91kwcj-etc.drv
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/k9rdwvkqppp9r6dh39wf6wbryqc285iv-nixos-system-unnamed-17.03.git.15c05ad.drv
Feb 23 19:52:54 hostname amazon-init-start[723]: these paths will be fetched (43.31 MiB download, 187.54 MiB unpacked):
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/08g18hjzsqln0i2q9f3n614dvamcqswc-linux-headers-4.4.10
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/41p951m1ykjdr3w20w5wq1y49vsigjj3-libxslt-1.1.29-bin
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/43cpb15i8jjing1gwn8rmbg80lhl8kqv-binutils-2.27
Feb 23 19:52:54 hostname amazon-init-start[723]:   /nix/store/7vmyzhrzgpkjg0hdw1rpx9wkklygq1f2-cyrus-sasl-2.5.10
[..., you get the idea]
Feb 23 19:53:16 hostname amazon-init-start[723]: building path(s) ‘/nix/store/8bcslnycas9aqi8avy2crca8z83pirnj-system-units’
Feb 23 19:53:17 hostname amazon-init-start[723]: building path(s) ‘/nix/store/qz08d1iaf8dkyw6jbvghnl2qrb0iz6g5-etc’
Feb 23 19:53:17 hostname amazon-init-start[723]: building path(s) ‘/nix/store/bhwcr85q1nhqac746df2ixb92ki75r3w-nixos-system-unnamed-17.03.git.15c05ad’
Feb 23 19:53:18 hostname amazon-init-start[723]: updating GRUB 2 menu...
Feb 23 19:53:35 hostname nixos[2687]: switching to system configuration /nix/store/bhwcr85q1nhqac746df2ixb92ki75r3w-nixos-system-unnamed-17.03.git.15c05ad
Feb 23 19:53:35 hostname amazon-init-start[723]: activating the configuration...
Feb 23 19:53:35 hostname amazon-init-start[723]: setting up /etc...
Feb 23 19:53:36 hostname amazon-init-start[723]: setting up tmpfiles
Feb 23 19:53:36 hostname amazon-init-start[723]: reloading the following units: dbus.service, firewall.service
Feb 23 19:53:36 hostname amazon-init-start[723]: the following new units were started: systemd-vconsole-setup.service
Feb 23 19:53:36 hostname nixos[2687]: finished switching to system configuration /nix/store/bhwcr85q1nhqac746df2ixb92ki75r3w-nixos-system-unnamed-17.03.git.15c05ad
Feb 23 19:53:36 hostname systemd[1]: Started Reconfigure the system from EC2 userdata on startup.

Note that it starts systemd-vconsole-setup.service (see #23118) but does not start httpd.service.

I log into the target machine (I can ask it to create users in the new config and those work, so it's clearly getting activated properly) and run sytemctl status httpd.service:

● httpd.service - Apache HTTPD
   Loaded: loaded (/nix/store/sz2lc1s8c0s4w86a6s7iaqc2hl6vc3p9-unit-httpd.service/httpd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

So the service was added, but nobody bothered starting it. Finally, I run nixos-rebuild switch:

# nixos-rebuild switch
building Nix...
building the system configuration...
updating GRUB 2 menu...
activating the configuration...
setting up /etc...
setting up tmpfiles
the following new units were started: httpd.service, systemd-vconsole-setup.service

So it does no work to build a new configuration, but finally notices that httpd.service needs to be started.

This seems pretty mysterious to me. I'm trying to walk through the switch-to-configuration.pl logic to figure out how it decides what to start but I haven't grokked it yet.

cc @edolstra

Edit: I'm unsure if this is specific to EC2, but testing on non-EC2 is harder for me.

@copumpkin copumpkin added 0.kind: bug 6.topic: nixos 0.kind: regression Something that worked before working no longer labels Feb 23, 2017
@copumpkin
Copy link
Member Author

I'm struggling to understand how switch-to-configuration.pl is supposed to start brand new services, in general. I must be missing something because we all know it works fine most of the time, but here's my understanding of the relevant parts of switch-to-configuration.pl:

  1. List units in current system (so actual /etc)
  2. Iterate over units in step 1, compare them against the system we're about to activate ($out/etc) and apply logic to figure out if they should be started, restarted, reloaded, skipped, stopped, etc.
  3. Activate the new system
  4. Run systemctl daemon-reload to tell it to reload its unit files
  5. Start/restart/reload/etc. the units we computed in step 2
  6. List units in current system again
  7. Check that they started, possibly call them "new", etc.

What has me stumped is how brand new services are ever started. The iteration in step 2 runs over pre-existing units, so when I enable httpd.service for the first time, it won't show up in that loop. As a result, it won't ever get added to unitsToStart, and thus doesn't get started.

What am I missing? How do services get started for the first time ever? The @new variable looked promising until I noticed it's only computed after we actually start the new units we've decided are worthy of starting. All the reload/restart logic makes sense to me, but the only place unitsToStart gets manipulated is inside the loop that never covers the brand new unit.

@copumpkin
Copy link
Member Author

So I have no clue why, but reverting #22869 fixes this for me. I'll look at it some more tomorrow but would appreciate any suggestions anyone might have.

@copumpkin copumpkin changed the title New NixOS services not starting until a second switch (in EC2 image at least) New NixOS services not starting until a second switch in EC2 image Feb 24, 2017
@copumpkin
Copy link
Member Author

Okay, I investigated further.

New services typically get started up by switch-to-configuration.pl by firing up targets that want or require them. Those targets don't show up in the units being started messages because we filter them out to minimize noise.

In the case of httpd, its service definition has wantedBy = [ "multi-user.target" ];, so for switch-to-configuration.pl to start it up, we need that target to be in %unitsToStart. Unfortunately, the amazon-init service is starting early in the boot, before multi-user.target has activated. If we look at the top-level loop we use to determine which units to do what to, the whole piece of logic is wrapped in this:

    if (-e $prevUnitFile && ($state->{state} eq "active" || $state->{state} eq "activating")) {

Which means that multi-user.target doesn't get triggered by the nixos-rebuild switch invocation we make in amazon-init. Furthermore, it probably doesn't make sense to trigger multi-user.target from an earlier unit, but I don't know how to get systemd to start the service otherwise.

copumpkin added a commit that referenced this issue Feb 27, 2017
We now make it happen later in the boot process so that multi-user
has already activated, so as to not run afoul of the logic in
switch-to-configuration.pl. It's not my favorite solution, but at
least it works. Also added a check to the VM test to catch the failure
so we don't break in future.

Fixes #23121
@zimbatm
Copy link
Member

zimbatm commented Nov 7, 2018

For future visitors: if you encounter this error the issue might be elsewhere.

Run systemctl status and make sure that the state is "running". If you see a "starting" state with a number of jobs in the queue then it means that the system hasn't properly started yet. This can happen when a normal service is set to "oneshot" type for example. Run systemctl list-jobs to find out which jobs are holding the boot back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 0.kind: regression Something that worked before working no longer 6.topic: nixos
Projects
None yet
Development

No branches or pull requests

2 participants