Compute platform

This tool is freely inspired from the concepts used in Ansible and coded in python3 using the paramiko library for SSH commands.

It relies on 3 types of yaml files:

hosts files, describing the hosts to connect to
configuration files, describing the configuration for each job type hosts
playbooks, describing a list of tasks to run on specific hosts

Hosts

Hosts are remote servers to SSH to connect to to run tasks. Each host should be accessible through SSH without a passphrase.

raspi:
  user: username
  host: 192.168.123.1
  cpus_per_node: 8
  platform: linux
  gpu: False

To account for slurm clusters complexity, you can add a path to save logs to, and the init field allows to load modules and activate python environments.

calmip:
  user: username
  host: olympe.calmip.univ-toulouse.fr
  cpus_per_node: 36
  platform: slurm
  wandb: offline
  gpu: True
  save_path: "/tmpdir/username/slurm_logs"
  init: |
    export OMP_NUM_THREADS=1
    module purge
    module load intel/18.2
    module load intelmpi/18.2
    module load python/3.6.8
    source activate /tmpdir/username/envs/torchenv
    cd

Localhost

Localhost can be defined as a host to run tasks locally, commands will then be run directly in local CLI without ssh.

Configs

Each configuration can describe how many CPUs to allocate for the job depending on the platform, how long the job can be, what partition to use and if a GPU is required. If these parameters are not specified, they will be set to the default values.

Variables

Variables can be set in a playbook, and the string used as key will be replaced by the value everywhere at run time. This allows to easily run a whole playbook on a specific set of hosts, without changing each task, or to reuse a parameter in multiple tasks.

variables:
  "£hosts": 
    - localhost
    - cluster

tasks:
  - name: "Run echo hostname"
    hosts: "£hosts"
    cmd: 
      - "echo hello $HOSTNAME"

Playbooks

A playbook is a set of tasks to run on specific hosts. It starts with an inventory field to specify the paths of hosts and configuration files. Then all tasks are run sequentially as described.

`cmd`

A cmd task simply runs the commands on the remote server through ssh.

  - name: "Run echo hostname"
    hosts: 
      - cluster
    cmd: 
      - "echo hello $HOSTNAME"
      - "pwd"

`pip`

The pip field is a list of packages to install on the hosts with pip.

  - name: "Test pip"
    hosts: 
      - localhost
    pip:
      action: install
      packages:
        - numpy
        - pandas

`git`

The git field allows to push or pull from a directory on the remote host. Its action subfield can be set to "push" or "pull", default is "pull". If github is not accessible, the code will hang and try 25 times to connect to github with a 5s delay. This was implemented to bypass some proxy settings that were annoying the author and about which the IT service never answered, forcing a more robust solution.

  - name: "Git pull"
    hosts: 
      - localhost
    git:
      action: pull
      path: "~/Documents/Doctorat/Dev/BERL"

It can also be used to clone a repository.

  - name: "Git clone"
    hosts: 
      - localhost
    git:
      action: clone
      path: "~/Documents/Doctorat/Dev/test_compute"
      repo: git@github.com:TemplierPaul/BERL.git
      install: False

`slurm`

The slurm field allows to run a job on a specific partition on the remote host with slurm, and to do hyperparameter sweeps easily.

It can take the following parameters:

cmd: the command to run on the remote host
config: the configuration file to use
seeds: the list of seeds to use, each job will run one experiment per seed sequentially
sweep: dictionary {parameter: list of values} to sweep over
args: list of arguments to pass to the command, one job will be started for each argument in the list
scatter: if True, scatters the configurations to run between the hosts, else runs all configs on each host.

All slurm files generated are saved to the local archive directory and named after their slurm job ID, and commands are also saved in the job.csv file for archiving.

  - name: "Test slurm"
    hosts: 
      - cluster
    slurm:
      job_name: "Testing"
      scatter: True
      parallel: True
      cmd: "python myscript.py"
      config: test
      seeds: 0, 1, 2, 3
      sweep:
        "--preset": ["canonical", "openai"]
        "--n_evaluations": [1, 5, 8]
      args:
        - "run_1"
        - "run_2"
        - "run_3"

`tmux`

The tmux field allows to run a job on a host that does not use Slurm by using tmux. It can be used to run multiple jobs in parallel by setting the parallel field to True, else it will run one job at a time. Similarly to srun, it can take multiple args. The cd field can be used to start a tmux instance from a specific directory.

  - name: "Test tmux"
    hosts: 
      - cluster
    tmux:
      job_name: "Testing"
      scatter: True
      parallel: True
      cd: "~/path/to/cd"
      cmd: "python myscript.py"
      seeds: 0
      args:
        - "run_1"
        - "run_2"
        - "run_3"

`copy`

The copy field allows to copy a file or directory between the remote host and localhost with scp. To copy from localhost to remote host, the mode field should be set to "push", else set it to "pull" to get the file from remote host.

  - name: "Push file to a remote machine"
    hosts:
      - cluster
    copy:
      mode: "push"
      src: "/path/to/local/file/filename.txt" # File
      dest: "/path/to/save/to/on/cluster" # Directory

  - name: "Copy a file from a remote machine"
    hosts:
      - cluster
    copy:
      mode: "pull"
      src: "/path/to/remote/file/filename.txt" # File
      dest: "/path/to/save/to/locally" # Directory

Setup

To set up this tool for you:

Clone the repository: `git clone
Install the needed libraries (namely paramiko) with pip
Setup a hosts.yaml file with the hosts you want to connect to
Make sure all hosts are accessible through SSH without a passphrase
Setup a configs.yaml file with the configurations you can use
Create your playbooks in the playbooks directory

Running playbooks

To start a playbook, just run:

python3 compute.py playbooks/<playbook-name>.yaml

Global command

To make the command global and simply type run playbook from anywhere, add this to your .bashrc:

run () {
    cd <path-to-Compute>
    python compute.py playbooks/$1.yaml
    cd -
}

Contributing

To add features, you can create new classes in python:

@register("cmd")
class CMD(Task):
    def __init__(self, name, cfg, servers):
        super().__init__(name, cfg, servers)

    def run(self):
        # Code to run the parsed task
        cmd = self.cfg
        if isinstance(cmd, str):
            cmd = [cmd]
        for server in self.servers:
            for c in cmd:
                print(f"> {c}")
                out, err = server.run(c)
                print(out, err)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
archive/old		archive/old
old		old
playbooks		playbooks
setup		setup
slurm		slurm
src		src
.gitignore		.gitignore
compute.py		compute.py
configs.yaml		configs.yaml
custom.slurm		custom.slurm
hosts.yaml		hosts.yaml
joblist.txt		joblist.txt
jobs.csv		jobs.csv
p.templier@pando		p.templier@pando
readme.md		readme.md
run.sh		run.sh
script.py		script.py
script.slurm		script.slurm

TemplierPaul/Compute

Folders and files

Latest commit

History

Repository files navigation

Compute platform

Hosts

Localhost

Configs

Variables

Playbooks

cmd

pip

git

slurm

tmux

copy

Setup

Running playbooks

Global command

Contributing

About

Resources

Stars

Watchers

Forks

Languages