Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/slurm: rewrite module to RFC 0042 style settings #161815

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

markuskowa
Copy link
Member

@markuskowa markuskowa commented Feb 25, 2022

Motivation for this change

Use free-form settings following RFC 0042 to generate config files. This should make the module more flexible, but it also is a breaking change, which requires users to adapt their config. In either case, it should make the module easier to maintain.

See also #144575

Things done
  • Added a note to the release notes
  • Re-write of module to use settings for slurm.conf, slurmdbd.cong, and cgroup.conf
  • Adapted the slurm test.
  • Remove and older "removed" options message
  • Removed the Slurm tools wrapper, which fed SLURM_CONF to all binaries. It is now exported directly as a system-wide environment variable.
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@markuskowa
Copy link
Member Author

markuskowa commented Feb 25, 2022

@GrahamcOfBorg test slurm

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/8

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/758

@markuskowa
Copy link
Member Author

Rebased for 22.11 release.

@markuskowa markuskowa marked this pull request as draft August 24, 2022 10:55
@RaitoBezarius
Copy link
Member

So sorry this did not get any attention, can you fix the conflicts? I'm interested into getting this merged. Though, I do not have extensive experience with Slurm.

@markuskowa
Copy link
Member Author

@RaitoBezarius There seems to be a bug that causes an infinite recursion which still needs to be fixed. However, due to the lack of feedback I did not prioritize working on it lately. It would also be good to get some feedback on the "settings" conversion.
I will pick this up again and hopefully we can include in the 23.05 release.

Copy link
Member

@RaitoBezarius RaitoBezarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise keeping the legacy options for a release cycle, while deprecating them.
Then, we can proceed to remove them.

I don't know how many NixOS users of slurm are out there, but I think it would be annoying to find out about if some of them do not read the release notes.

nixos/doc/manual/release-notes/rl-2211.section.md Outdated Show resolved Hide resolved
'';
};

controlMachine = mkOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO all the legacy options should be kept at least for one release cycle while putting a deprecation warning.
Then, we can all mkRemovedOption all of them after this.

Copy link
Member Author

@markuskowa markuskowa Dec 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to make a hard change here for the following reason. It may lead to confusion when both, legacy and settings are used. Which one should be used if it affects one an the same setting? It would make the module more messy trying to handle both. The conversion of a config file should be rather straight forward.

We could use mkRenamedOptionModule where applicable and mkRemovedOption for the rest?
EDIT: I added this version now to the module to ease the transition.

nixos/modules/services/computing/slurm/slurm.nix Outdated Show resolved Hide resolved
nixos/modules/services/computing/slurm/slurm.nix Outdated Show resolved Hide resolved
nixos/modules/services/computing/slurm/slurm.nix Outdated Show resolved Hide resolved
@markuskowa
Copy link
Member Author

@posch Are you using slurm on NixOS via the module?

@markuskowa markuskowa marked this pull request as ready for review December 25, 2022 15:44
@posch
Copy link
Contributor

posch commented Dec 25, 2022

@posch Are you using slurm on NixOS via the module?

No, I'm not. I'm using a stand-alone slurm.conf. My Nixos config basically contains only the systemd services:

let
        mungeKeyFile = "/...";    
        slurmConfigFile = "/.../slurm.conf";
        slurmStateSaveLocation = "/var/spool/slurm/ctld";
in
{
        users.groups.munge = {};
        users.users.munge = {
                isSystemUser = true;
                group = "munge";
        };

        systemd.services.munge = {
                wantedBy = ["multi-user.target"];
                after = ["network.target" "time-sync.target"];
                serviceConfig = {
                        Type = "simple";
                        ExecStart = "${pkgs.munge}/bin/munged -F --key-file ${mungeKeyFile}";
                        ExecStartPre = [
                                "+${pkgs.coreutils}/bin/chown munge:munge ${mungeKeyFile}"
                        ];
                        PIDFile = "/run/munged.pid";
                        User = "munge";
                        Group = "munge";
                        Restart = "on-abort";
                        StateDirectory = "munge";
                        StateDirectoryMode = "0711";
                        RuntimeDirectory = "munge";
                };
        };      

        systemd.services.slurmctld = {
                wantedBy = [ "multi-user.target" ];
                after = [ "network.target" "munge.service" "nslcd.service" ];
                requires = [ "munge.service" ];
                serviceConfig = {
                        Type = "simple";
                        ExecStart = "/ngt/slurm/bin/slurmctld -D -f ${slurmConfigFile}";
                        PIDFile = "/run/slurmctld.pid";
                        ExecReload = "${pkgs.coreutils}/bin/kill -HUP $MAINPID";
                        LimitNOFILE = 65536;
                };
                preStart = ''
                        mkdir -p ${slurmStateSaveLocation}
                '';

        };

        systemd.services.nslcd.after = [ "network.target" ];

example = literalExpression ''
settings = {
SlurmctldHost = "control";
nodeName = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that Nix attribute sets seem to be the obvious choice for Node lines, but it looks a bit overengineered to me. I usually copy+paste Node lines from the output of slurmd -C:

# /.../slurmd -C
NodeName=... CPUs=192 Boards=1 SocketsPerBoard=4 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=772645

Is there a concrete advantage of having this split up and rewritten into an attribute set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantages that I see is that you can now (1) access the parameters for a node or a partition directly and use them in another place in your config (e.g. for CPUs or memory). (2) The attributes can be merged, when they are defined in separate locations. This is otherwise only possible if you define nodeName, partitionName separately (now also nodeSet exists). (3) It makes the module easier to maintain and can automatically handle new features without the need to modify the module.

It may look over engineered at first glance, but I think it is more consistent to go all the way here.

dbdserver.storagePass and .configFile were already removed in NixOS-21.03.
Add moved and removed options.
@markuskowa
Copy link
Member Author

@infinisil @RaitoBezarius Are good to go, or do you have other comments?

ProctrackType=${cfg.procTrackType}
${cfg.extraConfig}
'';
configFile = settingsFormat.generate "slurm.conf" cfg.settings;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease transition and offer escape hatches (as I know of one complicated NixOS module where it would have been great to have this), I wish this could be an actual option overrideable by the user who could just bypass settings and write the configuration file itself and the settings would be the defaults.

Plus, an assertion could be done to prevent settings & config file being written.

Once this is done, I think it would be fine to merge it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the change, but it was not what I suggested — adding a extraConfig is not the same as adding an escape hatch that replaces the whole configFile and is an acceptable alternative to RFC42 rather than stringly-typed extraConfig IMHO (because the user can plug his generator).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I put the extraConfig back in. Its content now gets appended. You can now override settings completely by just setting it to an empty set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot for your patience!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I see what you mean now. Maybe as it is right now, it is a good solution? Using an alternative configFile adds more complexity as we need to define StateSaveLocation and SlurmSpoolDir separately, since they are needed by the service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more or less fine, the only problem with the current solution is that it's not clear we can sidestep completely all the settings, but I think it's fine for now.

@RaitoBezarius
Copy link
Member

@ofborg test slurm

@wegank wegank added 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 2.status: merge conflict labels Mar 19, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 20, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants