Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/google-compute-config: fix module #144761

Closed
wants to merge 3 commits into from
Closed

Conversation

picnoir
Copy link
Member

@picnoir picnoir commented Nov 5, 2021

Motivation for this change

We were relying on some google-provided python script to perform various actions on the GCE guests. These scripts have been decommissioned in favor of a Golang daemon: https://github.com/GoogleCloudPlatform/guest-agent. The #131761 bump broke the google-compute-config NixOS module: the Python script are not longer present in the $out/bin directory (more info about that in the upstream commit), hence breaking the NixOS module that was relying on those scripts.

Migrating to the new Golang daemon to perform what the scripts used to do. See more details in the individual commit messages.

This module is tightly coupled to GCE, it's sadly hard to write any VM test for it and prevent future similar breakages :( . We tested this PR manually on a GCE test deployment containing some oslogin and metadata scripts.

 google-guest-agent: init at 20211019.00

The guest agent is depending in some bash scripts contained in the
google-compute-guest-configs packages to setup the virtio multiqueue
support at runtime. We inject those through a wrapper.

The oslogin implementation of the guest agent won't work on NixOS as
it currently is. On top of that, it's enabled/disabled status
exclusively depends on the remote GCE deployment metadata, we can't
disable that through the conf file. We add a patch allowing us to
disable this feature via the conf file.
 nixos/google-compute-config: use the gce guest-agent

The python scripts we're using have been deprecated in favor of the
guest agent, see more infos at:
GoogleCloudPlatform/compute-image-packages@276e520#diff-267a2788071ee63df1443363c2ab882e7d321adc77ee53462a920229a962eabbL40

The guest agent onboard all the old python scripts as part of two new
go binaries: google_guest_agent and google_metadata_script_runner.
Both are using the same /etc/default-instance_configs.cfg
configuration file.

The configuration file we embed in this module mimicks the setup we
had with the python scripts. We'll have two services:

- The metadata script service: in charge of running the
  google compute startup/showdown scripts.
- The guest agent: in charge of syncing the guest clock with the host
  and setup the network interfaces on boot.
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

The guest agent is depending in some bash scripts contained in the
google-compute-guest-configs packages to setup the virtio multiqueue
support at runtime. We inject those through a wrapper.

The oslogin implementation of the guest agent won't work on NixOS as
it currently is. On top of that, it's enabled/disabled status
exclusively depends on the remote GCE deployment metadata, we can't
disable that through the conf file. We add a patch allowing us to
disable this feature via the conf file.

Co-authored-by: Mark Karpov <markkarpov92@gmail.com>
picnoir and others added 2 commits November 5, 2021 18:54
The python scripts we're using have been deprecated in favor of the
guest agent, see more infos at:
GoogleCloudPlatform/compute-image-packages@276e520#diff-267a2788071ee63df1443363c2ab882e7d321adc77ee53462a920229a962eabbL40

The guest agent onboard all the old python scripts as part of two new
go binaries: google_guest_agent and google_metadata_script_runner.
Both are using the same /etc/default-instance_configs.cfg
configuration file.

The configuration file we embed in this module mimicks the setup we
had with the python scripts. We'll have two services:

- The metadata script service: in charge of running the
  google compute startup/showdown scripts.
- The guest agent: in charge of syncing the guest clock with the host
  and setup the network interfaces on boot.

Co-authored-by: Mark Karpov <markkarpov92@gmail.com>
@picnoir
Copy link
Member Author

picnoir commented Nov 5, 2021

. This is the default recommended style which should be used in most cases

Could you please point me to this Nixpkgs "recommended style" guidelines?

I don't see any reason to not use it in this package.

https://discourse.nixos.org/t/rec-env-allocation/15926 . Now you do.

Quoting literal paths does nothing, so it can also be removed

Meaning I'd have to edit that file again and rebase the PR. Meaning spend time on stuff you admit yourself would not lead to any qualitative improvement.


I don't understand what you're trying to do here by nitpicking details. This PR makes some technical changes to this module, we did not even discussed yet whether or not this is the way to go for the google-compute-config NixOS module.

What's the point of nitpicking style details when the core of the PR has not been discussed yet? What are you trying to achieve here?

I think reading https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/ might help.

@picnoir
Copy link
Member Author

picnoir commented Nov 5, 2021

Oh, btw, for the record: the quoting part you're not happy with is actually not coming from me, it's already in Nixpkgs, in one of the derivation packaging the now partly legacy google-compute-engine :

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thanks for doing that digging!

Especially considering how hard this is to test automatically, and to avoid introducing regressions without noticing, we should try to send this upstream.

From f318f525149b06365e9f0c7f642f09f401c3e1c1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?F=C3=A9lix=20Baylac-Jacqu=C3=A9?= <felix@alternativebit.fr>
Date: Fri, 5 Nov 2021 15:40:13 +0100
Subject: [PATCH] oslogin: add disable toggle in conf file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reword this patch - we should send this upstream, so it should be reworded a bit (to explain this is about NixOS, which doesn't want to use the imperative install/uninstall bits) - and maybe add some more comments to their code, so it's clear why the config boolean is read.

In another context, upstream seems to have been open to accept patches, as long as it doesn't impair other users - so a config option disabling the "installation" phases in their daemon might be accepted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NinjaTrappeur are you fine with me trying to upstream this patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 sure!

@@ -76,7 +76,8 @@ func (o *osloginMgr) timeout() bool {
}

func (o *osloginMgr) disabled(os string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, needs to have a comment on why we disable this.

sha256 = "sha256-DHpv6RIXV3Rn/8fuGUWS6nzsEqgyY5+zozsundX9ByM=";
};

patches = [ ./0001-oslogin-add-disable-toggle-in-conf-file.patch ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be sent upstream and fetchpatch'ed from there.

@@ -76,7 +76,8 @@ func (o *osloginMgr) timeout() bool {
}

func (o *osloginMgr) disabled(os string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, needs to have a comment on why we disable this.

sha256 = "sha256-DHpv6RIXV3Rn/8fuGUWS6nzsEqgyY5+zozsundX9ByM=";
};

patches = [ ./0001-oslogin-add-disable-toggle-in-conf-file.patch ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be sent upstream and fetchpatch'ed from there.

@Artturin
Copy link
Member

Artturin commented Dec 1, 2021

should we downgrade the package for now?

@PaulGrandperrin
Copy link
Contributor

I have a small question/suggestion related to this work:

On this line:

<nixpkgs/nixos/modules/virtualisation/google-compute-image.nix>
, shouldn't it be google-compute-config.nix instead of google-compute-image.nix ?

I'm new to Nix, so maybe I don't understand something :)

@PaulGrandperrin
Copy link
Contributor

PaulGrandperrin commented Dec 1, 2021

Also, on GCE, I installed Nixos manually with a ZFS root FS. This prevents me from importing directly google-compute-config.nix as the root FS is hard-coded to be ext4.

Could there be a simple solution to this issue?

For now, I just copied the file in my /etc/nixos to adapt it (also to use systemd-boot with EFI).

@picnoir
Copy link
Member Author

picnoir commented Dec 2, 2021

should we downgrade the package for now?

We have two options for a quick fix:

  1. Downgrade google-compute-guest-configs. Pro: easy, Con: the downgrade could backfire and break some downstream deployments. Ping @cpcloud since you bumped that one.
  2. Merge this PR as is. Pro: fixes the module, Con: procrastination being procrastination, the patch may end up never be sent upstream.

I'm unlikely to have the required bandwidth to properly push the patch upstream for the next 2 months.

@flokli
Copy link
Contributor

flokli commented Dec 2, 2021

Then let's downgrade. I don't want to discredit the work being done in this PR, but we shouldn't merge it in halfway-done - I'd then rather revert the bump to fix things (and backport it to 21.11, too).

@cpcloud
Copy link
Contributor

cpcloud commented Dec 2, 2021

Reverting is fine by me.

talyz added a commit to talyz/nixpkgs that referenced this pull request Dec 3, 2021
This reverts commit 1748291.

This upgrade broke the google-compute-config module. See
NixOS#144761 for more info.
github-actions bot pushed a commit that referenced this pull request Dec 6, 2021
This reverts commit 1748291.

This upgrade broke the google-compute-config module. See
#144761 for more info.

(cherry picked from commit 5a4ea49)
@flokli flokli mentioned this pull request Jan 8, 2022
13 tasks
environment.etc."sysctl.d/11-gce-network-security.conf".source = "${gce}/sysctl.d/11-gce-network-security.conf";

environment.etc."default/instance_configs.cfg".text = ''
[Accounts]
Copy link
Member

@abbradar abbradar Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NinjaTrappeur can you compare your configuration here to the one from my PR and tell which other options are a good idea to use? Or to paraphase: which parts of default configuration did you change?

@picnoir
Copy link
Member Author

picnoir commented Jan 13, 2022

Closing in favor of #150837

@picnoir picnoir closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants