-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Context
Training jobs are submitted to HPC clusters via SLURM scripts in scripts/ and research/. There is no way to test these job scripts locally before scheduling real jobs, which makes iteration slow and wastes cluster resources on configuration errors.
Proposed Changes
Add a Docker Compose setup that simulates a minimal SLURM environment locally:
- review the DRAC / Compute Canada SLURM documentation to match their config as closely as possible: https://docs.alliancecan.ca/wiki/Running_jobs see the bash scripts in the
research/folder of this repo for real examples of how the pipeline in this repo is scheduled in DRAC's SLURM. For example: research/order_level_classifier/job_train_classifier.sh - A container with a SLURM controller and single compute node (e.g., using
giovtorres/slurm-docker-clusteror similar) - The project mounted as a volume so job scripts can be submitted with
sbatch - GPU passthrough optional (for CPU-only smoke tests, training can run for 1-2 epochs)
- A README explaining how to start the environment, submit jobs, and check output
- Create a GitHub workflow to test SLURM jobs
This would allow developers to validate SLURM scripts, environment setup, and pipeline orchestration before submitting to the real cluster.
Related
research/order_level_classifier/job_*.sh— existing SLURM job scriptsscripts/train_species_classifier.sh— local equivalent (PR feat: add species classifier training pipeline #69)
Reactions are currently unavailable