Skip to content

feat: add Docker Compose environment to simulate SLURM for local testing #71

@mihow

Description

@mihow

Context

Training jobs are submitted to HPC clusters via SLURM scripts in scripts/ and research/. There is no way to test these job scripts locally before scheduling real jobs, which makes iteration slow and wastes cluster resources on configuration errors.

Proposed Changes

Add a Docker Compose setup that simulates a minimal SLURM environment locally:

  • review the DRAC / Compute Canada SLURM documentation to match their config as closely as possible: https://docs.alliancecan.ca/wiki/Running_jobs see the bash scripts in the research/ folder of this repo for real examples of how the pipeline in this repo is scheduled in DRAC's SLURM. For example: research/order_level_classifier/job_train_classifier.sh
  • A container with a SLURM controller and single compute node (e.g., using giovtorres/slurm-docker-cluster or similar)
  • The project mounted as a volume so job scripts can be submitted with sbatch
  • GPU passthrough optional (for CPU-only smoke tests, training can run for 1-2 epochs)
  • A README explaining how to start the environment, submit jobs, and check output
  • Create a GitHub workflow to test SLURM jobs

This would allow developers to validate SLURM scripts, environment setup, and pipeline orchestration before submitting to the real cluster.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions