Skip to content

HPC Deployment

LiranOG edited this page May 9, 2026 · 8 revisions

🖥️ HPC Deployment

GRANITE v0.6.8 | ← Gravitational Wave Extraction | Initial Data →


1. Pre-Flight Checklist

# Always run before any simulation
python3 scripts/health_check.py

Verifies: Release build flags (-O3 -march=native), OMP thread count, available RAM, HDF5 parallel mode.


2. Memory Requirements

Configuration AMR Levels RAM (estimated)
64³, 4 levels (desktop) 4 ~0.5 GB
128³, 4 levels (desktop) 4 ~4-6 GB
256³, 6 levels (workstation) 6 ~16 GB
512³, 8 levels (cluster) 8 ~128 GB
B5_star, 12 levels 12 ~2 TB

Formula:

RAM ≈ nvar_total × nx³ × AMR_factor × 8 bytes
nvar_total = 22 (CCZ4) + 9 (GRMHD) = 31 × 3 (RK3 buffers) = 93
AMR_factor ≈ 1.14 per level (geometric series: Σ (1/8)^ℓ)

3. OpenMP Configuration

# Recommended: use all physical cores (not hyperthreads)
export OMP_NUM_THREADS=$(nproc --all)    # Linux
export OMP_PROC_BIND=close               # Bind threads to nearby cores
export OMP_PLACES=cores                  # Core-level granularity

For NUMA systems (multi-socket nodes):

numactl --interleave=all python3 scripts/run_granite_hpc.py ...

4. HPC Launch Command

python3 scripts/run_granite_hpc.py \
    build/bin/granite_main \
    benchmarks/B2_eq/params.yaml \
    --omp-threads 32 \
    --mpi-ranks 128 \
    --disable-numa-bind \
    --amr-telemetry-file /scratch/$USER/amr_B2eq.jsonl

SLURM auto-generation (produces jobs/submit_granite.sbatch):

python3 scripts/run_granite_hpc.py \
    build/bin/granite_main \
    benchmarks/B2_eq/params.yaml \
    --slurm \
    --mpi-ranks 128 \
    --omp-threads 8

5. SLURM Job Template

#!/bin/bash
#SBATCH --job-name=granite_B2eq
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --partition=compute

module load gcc/11 openmpi/4.1 hdf5/1.12-parallel

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=close

srun python3 scripts/run_granite_hpc.py \
    build/bin/granite_main \
    benchmarks/B2_eq/params.yaml \
    --omp-threads $SLURM_CPUS_PER_TASK \
    --mpi-ranks $SLURM_NTASKS

6. Lustre / Parallel Filesystem I/O Tuning

io:
  hdf5_stripe_count:  16       # Match to number of storage targets
  hdf5_stripe_size:   4194304  # 4 MB (optimal for large field arrays)
  collective_io:      true     # MPI-IO collective mode

Set Lustre striping on the output directory:

lfs setstripe -c 16 -S 4M /scratch/$USER/granite_output/

7. Container Deployment

Docker

docker build -f containers/Dockerfile -t granite:v0.6.8 .
docker run --rm -it -v $(pwd)/output:/output granite:v0.6.8 \
    build/bin/granite_main benchmarks/B2_eq/params.yaml

Singularity/Apptainer (HPC clusters)

singularity build granite.sif containers/granite.def
singularity run --bind /scratch/$USER:/output granite.sif \
    build/bin/granite_main benchmarks/B2_eq/params.yaml

8. GPU Roadmap (Post-v0.7)

Phase Hardware Configuration Projected Throughput
v0.6.8 (current) i5-8400, GTX 1050 Ti 64³ CPU 0.084 M/s
v0.7 GPU kernels vast.ai H100 SXM 256³ GPU ~50 M/s (projected)
v0.8 production Cluster H100 × 8 512³ GPU ~400 M/s (projected)
v1.0 B5_star Tier-0 cluster 12 AMR levels Exascale-class

Note: The GTX 1050 Ti (development desktop) is NOT viable for FP64 GPU compute. GPU production runs target H100 SXM instances via vast.ai after GPU kernel porting is complete in v0.7.


See also: Benchmarks & Validation | Developer Guide


Clone this wiki locally