Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] HPC improvements #79771

Closed
wants to merge 13 commits into from
Closed

[WIP] HPC improvements #79771

wants to merge 13 commits into from

Conversation

@ikervagyok
Copy link
Contributor

@ikervagyok ikervagyok commented Feb 10, 2020

Motivation for this change

Want to convince my colleagues, that NixOS/NixOps is OK for a test cluster setup which consists of a few AArch64 nodes. LAMMPS is the package I chose to give to my test users, that's why I've converted it to CMake and added a few extra LAMMPS packages

Things done

Updated stuff and added libfabric, pmix and ucx. Tested with 2 nodes, will go up to 4 tomorrow and maybe some x86 nodes in the next few days.

@markuskowa since you seem to be at the center of NixOS-HPC what do you think about these changes? What should be made optional/reworked? I know It's not a polished PR, but I think can be made into something usable.

  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
@costrouc
Copy link
Contributor

@costrouc costrouc commented Feb 11, 2020

Very happy to see that lammps finally has a standard build procedure with cmake!

Copy link
Member

@markuskowa markuskowa left a comment

Hi @ikervagyok. Thank you for the contribution. It's nice to see that there is more and more HPC activity with Nix. As a general comment: this is a big PR that addresses several different issues.
It may be better to split it it up in several PRs (e.g. lammps, openmpi, slurm). It would also simplify the review if you could briefly comment how you have tested the additional features.
In my experience it is one thing to get things to compile but a completely separate thing to make them work. There is usually some fine tuning involved. I test drive new packages in my own overlay before I upstream them (feel free to copy from the overlay what you need).

@@ -355,6 +355,7 @@ in
ExecStart = "${wrappedSlurm}/bin/slurmd";
PIDFile = "/run/slurmd.pid";
ExecReload = "${pkgs.coreutils}/bin/kill -HUP $MAINPID";
LimitMEMLOCK = "infinity";

This comment has been minimized.

@markuskowa

markuskowa Feb 11, 2020
Member

Why is this necessary?

This comment has been minimized.

@ikervagyok

ikervagyok Feb 11, 2020
Author Contributor

This fixed the issue of slurm not being able to use mpirun, while it worked in a shell. I'll try again to see if i've messed up my testing.

pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
configureFlags = with stdenv; [
"--with-cma"
"--with-hwloc=${hwloc.dev}"
"--with-ofi=${libfabric}"

This comment has been minimized.

@markuskowa

markuskowa Feb 11, 2020
Member

What benefit does libfabric bring (I am not familiar with libfabric)?

This comment has been minimized.

@ikervagyok

ikervagyok Feb 11, 2020
Author Contributor

I think OmniPath needs libfabric - can't try though, since I have no OmniPath machine in a test setup

This comment has been minimized.

@markuskowa

markuskowa Feb 11, 2020
Member

I would prefer to remove it then.

This comment has been minimized.

@ikervagyok

ikervagyok Feb 11, 2020
Author Contributor

as a dependency, or the whole package?

This comment has been minimized.

@markuskowa

markuskowa Feb 12, 2020
Member

I would leave the whole package out for now since the libfabric is not even built with OmniPath in the present version. We can add that later, when it becomes relevant.

"--with-cma"
"--with-hwloc=${hwloc.dev}"
"--with-ofi=${libfabric}"
"--with-pmi=${pmix}"

This comment has been minimized.

@markuskowa

markuskowa Feb 11, 2020
Member

I have tried to build with pmix in the past but could not get it work properly with slurm. Did you try if you can run openmpi binaries with srun instead of mpirun?
In general I would say pmix is not strictly necessary as openmpi already works nicely with slurm and detects the resource allocations properly. From how i understand pmix it is designed for large mpi jobs (>100 nodes).

This comment has been minimized.

@ikervagyok

ikervagyok Feb 11, 2020
Author Contributor

you're right, i've thought it worked, but i probably mistakenly tested mpirun multiple times?! will look into this.

"--with-pmi=${pmix}"
"--with-pmix=${pmix} --enable-install-libpmix"
"--with-verbs=${rdma-core}"
"--with-ucx=${ucx}"

This comment has been minimized.

@markuskowa

markuskowa Feb 11, 2020
Member

We (@mekeetsa) are testing UCX with infiniband at the moment (see also https://github.com/markuskowa/NixOS-QChem/blob/ucx/ucx/default.nix). Did you test it with an Infiniband configuration?

This comment has been minimized.

@ikervagyok

ikervagyok Feb 11, 2020
Author Contributor

yes, i've tested it and it seems to work (setting up the env with OMPI_MCA_btl=^openib,uct,tcp UCX_TLS=ib,sm,self)

pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/development/libraries/openmpi/default.nix Outdated Show resolved Hide resolved
pkgs/servers/computing/slurm/default.nix Outdated Show resolved Hide resolved
@ofborg ofborg bot requested a review from markuskowa Feb 11, 2020
@@ -29,6 +30,12 @@ in stdenv.mkDerivation rec {
sha256 = "0ms0zvyxyy3pnx9qwib6zaljyp2b3ixny64xvq3czv3jpr8zf2wh";
};

# TODO: Remove these, when a new release gets out
patches = [
(fetchpatch { name = "ucx-1.8-compat-1"; url = "https://github.com/open-mpi/ompi/commit/526775dfd7ad75c308532784de4fb3ffed25458f.patch"; sha256 = "1zfr7ym4303dsnnjqfc1qbnfs0izd8bazxfl57p3dhqsiznd734r"; })

This comment has been minimized.

@markuskowa

markuskowa Feb 12, 2020
Member

I am still confused why there are patches for UCX 1.8 when are building against UCX 1.7.

This comment has been minimized.

@ikervagyok

ikervagyok Feb 12, 2020
Author Contributor

Take a look at the two commit messages:

open-mpi/ompi@526775d
open-mpi/ompi@a3026c0

The first one adds support and the second replaces the version bound version > 1.7 with version >= 1.7. They are named that way, to make it clear, that they can be removed together.

This comment has been minimized.

@markuskowa

markuskowa Feb 15, 2020
Member

The way I understand the patches the first one fixes a problem with ucx/master (correct me if I am wrong). Since we are not building against ucx 1.8 (or current master) that patch is not needed right now. We should remove it for now.

@markuskowa
Copy link
Member

@markuskowa markuskowa commented Jul 2, 2020

Can we close this PR? Most of the improvements have now been addressed in other PRs.

@ikervagyok ikervagyok closed this Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.