Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework access management & deployment for the NixOS core infra #324

Closed
delroth opened this issue Jan 6, 2024 · 4 comments
Closed

Rework access management & deployment for the NixOS core infra #324

delroth opened this issue Jan 6, 2024 · 4 comments

Comments

@delroth
Copy link
Contributor

delroth commented Jan 6, 2024

Please take this as an RFC and feel free to yell at what seems to be a bad idea and/or suggest improvements.

Goals

  • Unify access management, currently very fragmented.
    • Some hosts only accessible via nixops ssh from bastion, some only accessible via ssh on Wireguard, varying login usernames depending on the host.
  • Unify deployment mechanisms.
    • Currently a mix of nixops and manual on-host nixos-rebuild.
  • Introduce proper secret management.
    • Currently a mix of unversioned files in a directory on bastion deployed via nixops, and manually deployed secret files on rhea and bastion .
  • Only change as much as needs to be changed for meet those goals. No scope creep.

Plan

  • More common Nix configuration for users and access management on core infra machines.
    • Get rid of automagic NixOps SSH keys, have root SSH access for all core infra SSH keys.
    • Potentially: switch to separate users with sudo access for SSH administrative access - it doesn't grant us much in terms of security, but it has some auditability benefits.
  • Manage secrets with sops-nix.
    • The most critical secrets will be kept out of scope for this out of caution (for now). Specifically identified now: the Hydra signing key. A full inventory will be necessary to identify any other critical secrets. Rule of thumb: if it can't be rotated, let's keep managing it "manually" for now.
    • This aligns with the infra choice already made for non-critical-infra.
  • Deprecate nixops completely.
    • We already don't use any of its advanced features, it's really just deploying configs from a flake.
    • We'll switch to a flake with a outputs.nixosConfigurations per machine, as well as outputs.colmena for colmena compat for remote deployment.
    • There is a total of 2 use cases of listing all machines in the nixops network, for monitoring, and that's kinda broken already because that doesn't include rhea. So for now, until new use cases appear for this, let's stop relying on any cross-machine configuration and keep every nixosConfiguration independently evaluable.

Future improvements

  • Move channel scripts out of bastion and get rid of it.
    • $120/month savings, and once nixops is dead + secrets versioned, the use case for bastion kind of goes away (no more unversioned state required for a deployment tool).
    • It doesn't seem to me that channel scripts currently need to be colocated with the cache, but I haven't looked in too much details yet.
  • Figure out what we want to do with Wireguard
    • Currently SSH is exposed outside of the Wireguard network for the core infra. I think it would make sense to require Wireguard access as an extra defense in depth layer, however this introduces an extra point of failure which could be annoying to recover from (requiring rebooting in single-user mode if some change breaks Wireguard).
@zimbatm
Copy link
Member

zimbatm commented Jan 7, 2024

Sounds good overall.

I would even remove the bastion and wireguard. Make things simple. Take a step back. And then once you're comfortable, re-introduce appropriate security measures. If you have the NixOS firewall enabled, and password auth disabled on OpenSSH, things are already pretty secure.

@AmineChikhaoui
Copy link
Member

Might be worth introducing Tailscale for access management and ssh. It would probably make it cleaner to handle ACLs and access control in general.

@delroth
Copy link
Contributor Author

delroth commented Jan 28, 2024

I won't be making much progress on this for the next ~7 days, so current progress on nixops removal is dumped at master...delroth:nixos-org-configurations:remove-nixops if someone wants to move things forward in the meantime.

@delroth
Copy link
Contributor Author

delroth commented Feb 9, 2024

I think we can call this fixed:

Unify access management, currently very fragmented.

infra-build can ssh root@{eris,haumea,rhea}.nixos.org

Unify deployment mechanisms.

Everything can now be done with a nixos-rebuild --flake. In the future we can add colmena support for convenience.

Introduce proper secret management.

delft/* now uses agenix. TBD: moving non-critical-infra to agenix too to align.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants