A nix configuration of a tiny SLURM-based HPC cluster.
Originally, this was a draft for a HPC cluster shared between a few researchers. Due to some decisions, it was never used, so I stripped identifying information to release it publicly as an inspiration source.
I'll likely continue to work on it as a testing playground.
Note that the intended hardware is only provided for reference, as it influences the SLURM configuration
server01
,server02
- Intended for CPU-heavy computations
server03
- Intended for CUDA computations
server04
- Lower-End Administration Node
- Authentication source
- Slurm Controller
- SSH Entrypoint
- Physical location for
~
- Lower-End Administration Node
All servers are reachable via an external interface/address, while having a shared (static) network on a separate, internal interface.
- Deployment with
nixinate
- Secret Management with
sops
andsops-nix
- Automated testing
- Cluster-wide user management with
kanidm
- Testing
- SLURM for Workload Management
-
munged
- Testing mutual authentication
- Controller
-
slurmdb
-
slurmctld
-
mariadb
-
- Nodes
-
slurmd
-
- Syncing of SLURM accounts with unix groups
- Testing the execution of simple commands
-
- Shared file system
- NFS
- Testing mutual visibility of files
-
Ceph(Maybe experiment with it)
- NFS
- Testing the assumed network setup
- Monitoring and Alerting
- Grafana
- Prometheus
- Node Exporter
- SLURM Exporter