Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure NixOS from EC2 user-data #7370

Merged
merged 1 commit into from
Jun 12, 2015
Merged

Conversation

copumpkin
Copy link
Member

[See #6662 for more discussion of how I got here and why I'm doing it this way]

This is my initial attempt at getting a working configure-from-user-data NixOS image working. The basic idea is to create an "unstable" NixOS image: its /etc/nixos/configuration.nix doesn't actually specify the way the machine is configured, but rather assumes that an /etc/nixos/amazon-init.nix exists, which is not bundled inside the image. Instead of bundling amazon-init.nix, the image bundles a postBootCommands script that downloads the EC2 user-data and writes it into /etc/nixos/amazon-init.nix.

The "unstable" NixOS image thus only configures itself from user-data on first boot, since when it calls nixos-rebuild switch on your personalized configuration, the custom postBootCommands will go away.

User-data should look something like:

### http://nixos.org/channels/nixos-unstable nixos

{
  # Insert your config here, and you'll probably not want me to log into your box so
  # change the user below, or don't put one there at all!

  security.sudo.wheelNeedsPassword = false;

  users.extraUsers.copumpkin = 
    { createHome      = true;
      home            = "/home/copumpkin";
      description     = "Dan P";
      extraGroups     = [ "wheel" ];
      useDefaultShell = true;
      openssh.authorizedKeys.keys = [
        "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCkoril5uKjJohHvqz9Ys9R2rBH95MUb4Rxo5kcuRvEIwMranQ7xP5eU7rZqfv7elE1DLfMs19via+btUX3w8o4juYxzXjafnH6Mck5hYdvxNnErW6gsp0vGDQ0ruRCQx3UmOuC5Ld/wXY7iMQqOlxeLZF2dVCKP1+BSs37wLC7scXYu0U+wODprVpAsZIOwLP85w/uCNlC8wbvNDWG+Hx+XD/ml2ezQiNBRnh7Qo3QKgpUvVBO0d9z84g92D2H9IA+pEpJiWFcYKGEowKSVQVFCi5LoWRiz8XLKL+JeBt5mmmqjmJua6o8lXV7+nba//KCIkG+IWS4nwKQlpzZXc4H"
      ];
    };
}

One thing to note is the ### section at the top, which specifies the channels (I just strip off the ### and direct into ~/.nix-channels)

If you trust me and want to try it out, I'm hosting a public AMI (until I get sick of paying for storage) of the above with ID ami-1c477874.

Questions for anyone still reading:

  • Is this the best way to do it? The postBootCommands with periodic timed check feels kind of hacky, but I also can't use a systemd service and activation scripts didn't work either.
  • Is this the best format for user-data? I like it because it's simple, specifies the full machine (with channels, although I'm not necessarily sold on the ### convention), and is very clearly just nixos configuration. Unfortunately it's not compatible with the existing format nixops uses for its temporary host keys, but I'd rather make it use a clean nix configuration file than force this into that format and have to deal with escaping and other ugliness. I haven't used NixOps much so perhaps it'll be too painful to transition it, but if possible I'd like the user-data format to be clean.

Note that it's still a WIP, so I'll obviously take out the useless echo calls and such before merging 😄

cc @edolstra @shlevy @rbvermaa

@copumpkin copumpkin added 9.needs: reporter feedback This issue needs the person who filed it to respond in progress labels Apr 14, 2015
@copumpkin copumpkin self-assigned this Apr 14, 2015
@shlevy
Copy link
Member

shlevy commented Apr 14, 2015

Actually looks pretty good and simple. Without hooking into the uevent system I'm not sure if there's a better way than polling to tell when networking is available, and unfortunately we can't use systemd's management.

@edolstra
Copy link
Member

Why can't we use systemd's management? If it's about running nixos-rebuild from a service, I'm sure that can be accomplished.


echo "Success"

curl -s http://169.254.169.254/2011-01-01/user-data > /etc/nixos/amazon-init.nix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write this to configuration.nix? That's more standard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted a simple place that would make it clear what came from user-data and what didn't. That way, configuration.nix comes pre-wired to import from amazon-init.nix but if you want to jump in and change it, you can tell it to stop importing amazon-init.nix, or use the existing file differently and so on. It also feels weird to auto-write to a location that people typically treat as human-managed. If someone were generating their own AMI, they wouldn't run the risk of getting their handmade configuration clobbered by my startup script.

@edolstra
Copy link
Member

Yeah, I think it would be good to make this co-exist with the user data format used by NixOps (e.g., encode the configuration.nix data in some way), and only perform the nixos-rebuild if the user data contains a configuration.nix. That way, we don't need to create multiple AMIs.

It could be something like:

ec2-run-instances --user-data "NIXOS_CONFIGURATION: `base64 -w0 ./my-config.nix`"

@copumpkin
Copy link
Member Author

@edolstra I pretty much documented all my attempts at using systemd on the other ticket and failed. If someone can get it working, that'd be nice, but @rbvermaa also said it felt wrong to have a systemd service doing this and said he preferred the current way.

About the format, I'm not super-opposed to that, but it feels like a historical accident would be making this less usable. Now I can't just go to the console and look at my NixOS configuration because it's base64-encoded because NixOps (which I don't even use) can't deal with it otherwise? It just seems like such a pity that we have this super-nice configuration format (nix itself) and are dropping it in favor of a hard-to-use simple key-value mapping that forces us to preprocess the input and not be able to read it later. This feels like a "let's rip off the bandaid and deal with a little bit of transition pain" kind of situation to me. For example, the configuration I included in the original ticket would not be human-readable right now if I followed that scheme.

If you insist on the NixOps format, how should I represent the channels?

@copumpkin
Copy link
Member Author

@edolstra what if I devise some sort of real transition plan that would allow NixOps to continue to work unchanged for now, but also simultaneously supported the format I wanted (I haven't yet decided how to achieve that). Then there would be no real transition pain, but NixOps would switch over at some point to my proposed format without breaking anyone.

@edolstra
Copy link
Member

Yeah, my only concern is that we can have a single AMI that supports both NixOps and configuration from user data. Note that nixos/modules/virtualisation/ec2-data.nix won't be bothered by a Nix expression in the metadata since it greps for some specific lines (e.g. starting with SSH_HOST_DSA_KEY:). But amazon-hvm-userdata-config.nix will barf on NixOps' metadata, I guess.

(It would probably be good to move amazon-hvm-userdata-config.nix to nixos/modules/virtualisation/ec2-data.nix so that we have only one place that deals with user data.)

@edolstra
Copy link
Member

How about this:

  • The script reads the user data, filters out any line starting with '#' or SSH_HOST_*:, and if the result is non-empty, writes the user data to /etc/nixos/configuration.nix (or wherever) and runs nixos-rebuild. If the result is empty, it does nothing.
  • In the future, NixOps prefixes any user data with '#' (e.g. # SSH_HOST_DSA_KEY: ...) so it doesn't get confused with a Nix expression.

@copumpkin
Copy link
Member Author

@edolstra I think that makes sense, thanks! Will update the PR in the next couple of days and report back if I run into any issues 😄

Did my reasoning for using a separate amazon-init.nix make sense? I don't feel nearly as strongly about that, so if you think it makes more sense to go straight to configuration.nix I'm happy to shove it in there.

@edolstra
Copy link
Member

I have no strong feelings about amazon-init.nix either, but if the concern is clobbering the user's configuration file, you could just write configuration.nix only if it doesn't exist. So it would be created only on first boot, after which the user is free to edit it.

@cstrahan
Copy link
Contributor

@copumpkin Are you still working on this? It sounds pretty slick. 😄

@copumpkin
Copy link
Member Author

Yep, just haven't had a chance to finish it off yet!

@cstrahan
Copy link
Contributor

@copumpkin I'm working on this right now.

@cstrahan
Copy link
Contributor

A couple questions:

When rebooting and/or rerunning the fetch-ec2-data service, what should the behavior be?

  • We probably don't want to clobber the user's /root/.nix-channels. I think we should probably check if the file exists, and only write to it once.
  • I suppose overwriting /etc/nixos/amazon-init.nix shouldn't be a problem, though.
  • Do we want to ensure that the nixos-rebuild switch only runs once? If so, how should we implement that - should we just check the existence of a file? (e.g. /root/.ec2-init-finished)

@cstrahan
Copy link
Contributor

Here's a thought: we could ensure the whole thing happens only once by putting a file in the AMI (e.g. /root/.need-ec2-init), and then we just need to check that the file exists before doing a one-time rebuild from user-data, deleting the file thereafter. Any downsides to that?

@copumpkin
Copy link
Member Author

It already does only happen once! When it reconfigures itself from userdata, it removes the reconfiguration script from the current configuration (unless you explicitly add it back, of course).

@copumpkin
Copy link
Member Author

Re: clobbering /root/.nix-channels, I figured it would be fine because at that point no user will have ever touched the system. Once they touch the system, it won't get clobbered again due to the behavior I just mentioned.

@cstrahan
Copy link
Contributor

It already does only happen once! When it reconfigures itself from userdata, it removes the reconfiguration script from the current configuration (unless you explicitly add it back, of course).

@copumpkin Ah, that makes sense - you could just overwrite the amazon-init.nix, of course :).

I see where you're copying over the configuration.nix, but I don't see where you're copying over the amazon-init.nix (which the former refers to), so I'm not really sure how this branch works, unless I'm missing something.

Here's what I have: master...cstrahan:nixos-userdata

@copumpkin
Copy link
Member Author

The point is that the /etc/nixos/configuration.nix that I put inside the image isn't the configuration.nix that built the the image. That's what I meant by "unstable". If you took the image and did nothing but nixos-rebuild switch on it, it would change.

So instead, I build the image from an "external" configuration.nix that specifies that I want the machine to reconfigure itself on startup. Then after it does that, it reconfigures itself to not reconfigure itself anymore (because presumably the user doesn't ask for it to keep reconfiguring itself, although that's certainly a possibility too)

It's a little subtle and there's a phasing thing going on that makes it tricky to think about, I think.

@cstrahan
Copy link
Contributor

Gotcha, just clicked a moment ago :).

@cstrahan
Copy link
Contributor

@copumpkin Ok, I think my branch should work - I'll give it a test tomorrow.

@cstrahan
Copy link
Contributor

@copumpkin How do you build an AMI?

@copumpkin
Copy link
Member Author

@cstrahan there's a script somewhere in the nixpkgs source tree, but I've been using https://gist.github.com/copumpkin/6df9c50630ed5fc5abb5 with a self-signed cert. There's a few different ways to build an AMI but I'm using a fairly simple S3-based instance store one.

@cstrahan
Copy link
Contributor

@copumpkin Thanks!

I don't know what else needs to change in our maintainer scripts and such to support building EBS-backed AMIs (which must be created from an existing instance), so I'll have to leave that to someone else.

/cc @edolstra

@copumpkin
Copy link
Member Author

There are already scripts in in nixpkgs to automate the creation of EBS-backed AMIs as well, near the one I mentioned earlier. All the stuff is in here.

@copumpkin
Copy link
Member Author

So what's the status on this? Is there anything left to do? I'd like to have three VM tests for it (to make sure NixOps-style userdata works, to make sure new-style userdata works, and that no userdata works), but would also like to get it into the 15.06 release. @domenkozar what's the cutoff date for getting things in there?

@copumpkin
Copy link
Member Author

@cstrahan any progress? you have push access to my repo so if you want to update this PR, please do so 😄 assuming @edolstra approves of the new changes, I think the last thing left to do is a set of VM tests, which I can implement once you push your changes.

@vcunat vcunat added 2.status: work-in-progress This PR isn't done and removed in progress labels May 5, 2015
@copumpkin copumpkin added this to the 15.06 milestone May 27, 2015
@copumpkin copumpkin force-pushed the nixos-userdata branch 2 times, most recently from 8f65a2d to bb25461 Compare June 7, 2015 23:33
@copumpkin
Copy link
Member Author

This is almost ready to be used. It depends on #8204 and #8013 (which also depends on #8204).

I now have two VM tests, built on top of the 169.254.169.254 user-data simulation functionality I put together in #8013. One tests that we still work properly with the NixOps user-data format, and the other ensures that we process the new-style configuration properly.

The main remaining thing I'd like to do is not force the user-data to contain a channel marker. To do that, I need to pre-populate the channel from the "host" building the image, so that it matches the actual content of the image. I expect not specifying a channel to be the common case, because it means minimal downloads to rebuild the system. Not having this means that one of the two tests is fairly slow, since it has to download a ton of NixOS stuff from the internet to reconfigure itself.

There's still a lot of stuff I'd clean up but we're getting close. Looking forward to feedback from anyone interested.

@copumpkin copumpkin changed the title Configure NixOS from EC2 user-data [don't merge] Configure NixOS from EC2 user-data Jun 12, 2015
@copumpkin
Copy link
Member Author

Since I have a VM test in place that makes sure that the NixOps-style userdata still works, I'm going to go ahead and merge this, treating the "configure from userdata" portion as a "beta". Beta in the sense that I'm still not sure what the best way to do it is, but I want to iterate on it a bit and it'll be easier to do once I have something concrete that's building and multiple people can experiment with.

copumpkin added a commit that referenced this pull request Jun 12, 2015
Configure NixOS from EC2 user-data (beta)
@copumpkin copumpkin merged commit 5e13ee7 into NixOS:master Jun 12, 2015
@nyarly
Copy link
Contributor

nyarly commented Oct 9, 2015

@copumpkin - it looks like as of 15.09, this is available in <nixpkgs/nixos/modules/virtualization/amazon-init.nix>, but the default AMIs don't have that in their stock configuration.nix. Am I reading correctly?

@paul-e-cooley
Copy link

Any word on when we might get a 15.xx AMI with this fix? Or alternatively, is it safe to just copy the commit results to the module and then perform a rebuild to initialize the fix?

@copumpkin
Copy link
Member Author

I'll probably finish fixing the VM test for it this weekend and then I
think Rob said he'd be willing to regenerate the official AMIs. The test is
breaking because as part of the rebuild-from-config, it tries to pull from
the Internet and fails on Hydra because it's in a sandbox. Open to ideas on
fixing it!
On Fri, Jan 8, 2016 at 10:11 Paul E Cooley notifications@github.com wrote:

Any word on when we might get a 15.xx AMI with this fix? Or alternatively,
is it safe to just copy the commit results to the module and then perform a
rebuild to initialize the fix?


Reply to this email directly or view it on GitHub
#7370 (comment).

@paul-e-cooley
Copy link

Thanks, @copumpkin. I haven't even had a chance to look at the Hydra stuff yet. But plan to soon.

@aycanirican
Copy link
Member

I wonder if we have network connectivity in postBoot. I got this in my log (in reverse order) and my user-data is not written to /etc/nix/configuration.nix

Mar 09 14:45:40 localhost stage-2-init: /nix/store/g9idnp7fzf06ir918wbhw6sz580i6gbw-nix-1.11.2/bin/nix-channel: unable to check ‘http://nixos.org/channels/nixos-unstable’
Mar 09 14:45:40 localhost stage-2-init: created 1 symlinks in user environment
Mar 09 14:45:40 localhost stage-2-init: attempting to fetch configuration from EC2 user data...

@nicknovitski
Copy link
Contributor

I've observed the behavior @aycanirican describes, in 16.03. Would the script fit better in boot.initrd.network.postCommands?

@edolstra
Copy link
Member

edolstra commented Oct 3, 2016

Yes, it's supposed to have network at that stage. In fact we even access the network in stage 1.

@nicknovitski
Copy link
Contributor

Is it possible that the script needs something which gets set in $HOME/.nix-profile/etc/profile.d/nix.sh, but not in stage-2-init.sh? I've found year-old discussions saying that can cause a similar error.

endocrimes added a commit to endocrimes/nixpkgs that referenced this pull request Apr 11, 2020
This commit migrates the Nomad package from the 0.10.x line of releases
to 0.11.X.

This allows us to also bump the version of Go that is used to 1.14.x.
NOTE: 1.14.x will be needed for the rest of the 0.11.x releases as Nomad
only bumps patch versions of Go within a release series.

CHANGELOG:

FEATURES:

    Container Storage Interface [beta]: Nomad has expanded support
    of stateful workloads through support for CSI plugins.
    Exec UI: an in-browser terminal for connecting to running allocations.
    Audit Logging (Enterprise): Audit logging support for Nomad
    Enterprise.
    Scaling APIs: new scaling policy API and job scaling APIs to support external autoscalers
    Task Dependencies: introduces lifecycle stanza with prestart and sidecar hooks for tasks within a task group

BACKWARDS INCOMPATIBILITIES:

    driver/rkt: The Rkt driver is no longer packaged with Nomad and is instead
    distributed separately as a driver plugin. Further, the Rkt driver codebase
    is now in a separate
    repository.

IMPROVEMENTS:

    core: Optimized streaming RPCs made between Nomad agents [NixOSGH-7044]
    build: Updated to Go 1.14.1 [NixOSGH-7431]
    consul: Added support for configuring enable_tag_override on service stanzas. [NixOSGH-2057]
    client: Updated consul-template library to v0.24.1 - added support for working with consul connect. Deprecated vault_grace [NixOSGH-7170]
    driver/exec: Added no_pivot_root option for ramdisk use [NixOSGH-7149]
    jobspec: Added task environment interpolation to volume_mount [NixOSGH-7364]
    jobspec: Added support for a per-task restart policy [NixOSGH-7288]
    server: Added minimum quorum check to Autopilot with minQuorum option [NixOSGH-7171]
    connect: Added support for specifying Envoy expose path configurations [NixOSGH-7323] [NixOSGH-7396]
    connect: Added support for using Connect with TLS enabled Consul agents [NixOSGH-7602]

BUG FIXES:

    core: Fixed a bug where group network mode changes were not honored [NixOSGH-7414]
    core: Optimized and fixed few bugs in underlying RPC handling [NixOSGH-7044] [NixOSGH-7045]
    api: Fixed a panic when canonicalizing a jobspec with an incorrect job type [NixOSGH-7207]
    api: Fixed a bug where calling the node GC or GcAlloc endpoints resulted in an error EOF return on successful requests [NixOSGH-5970]
    api: Fixed a bug where /client/allocations/... (e.g. allocation stats) requests may hang in special cases after a leader election [NixOSGH-7370]
    cli: Fixed a bug where nomad agent -dev fails on Windows [NixOSGH-7534]
    cli: Fixed a panic when displaying device plugins without stats [NixOSGH-7231]
    cli: Fixed a bug where alloc exec command in TLS environments may fail [NixOSGH-7274]
    client: Fixed a panic when running in Debian with /etc/debian_version is empty [NixOSGH-7350]
    client: Fixed a bug affecting network detection in environments that mimic the EC2 Metadata API [NixOSGH-7509]
    client: Fixed a bug where a multi-task allocation maybe considered healthy despite a task restarting [NixOSGH-7383]
    consul: Fixed a bug where modified Consul service definitions would not be updated [NixOSGH-6459]
    connect: Fixed a bug where Connect enabled allocation would not stop after promotion [NixOSGH-7540]
    connect: Fixed a bug where restarting a client would prevent Connect enabled allocations from cleaning up properly [NixOSGH-7643]
    driver/docker: Fixed handling of seccomp security_opts option [NixOSGH-7554]
    driver/docker: Fixed a bug causing docker containers to use swap memory unexpectedly [NixOSGH-7550]
    scheduler: Fixed a bug where changes to task group shutdown_delay were not persisted or displayed in plan output [NixOSGH-7618]
    ui: Fixed handling of multi-byte unicode characters in allocation log view [NixOSGH-7470] [NixOSGH-7551]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: work-in-progress This PR isn't done 9.needs: reporter feedback This issue needs the person who filed it to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants