Parallel Hyperparameter Search #84

lowlypalace · 2023-12-16T21:29:29Z

PR Description

This PR makes hyperparameter search more scalable by running the training loop for each agent in parallel. This is especially useful when running the search on a cluster (e.g. Slurm).

Assign workers to devices such as GPU / CPU. The list of devices can now be provided as an argument.
Fix an issue when we were not setting number of threads for each process. This basically made the CPU parallelization even slower when a sequential run. Now this is fixed. Link to the issue.
Add num_workers parameter to define the size of the pool of processes. Still not sure about this one though as we still need to wait for all of the workers from the same iteration so that we can get the hypervolume. So maybe we can just spawn the same number of processes as num_seeds.
There was a bug in one of the initializers of the Envelope and PCN algorithms, where we were passing device as an id making it essentialy always default to the same device.

TODO:

True GPU parallelization of policy evaluation / model update step.
Add examples of runs with different configurations

Example Configs on a Slurm Cluster

Using 4 GPUs + 4 workers

#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes

#SBATCH --cpus-per-task 4 # number of processes
#SBATCH -G 4

python experiments/hyperparameter_search/launch_sweep.py \
--algo envelope \
--env-id minecart-v0 \
--sweep-count 100 \
--seed 10 \
--num-seeds 4 \
--num-workers 4 \
--devices cuda:0 cuda:1 cuda:2 cuda:3

Using 4 CPUs + 4 workers

#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes

#SBATCH --cpus-per-task 4 # number of processes

python experiments/hyperparameter_search/launch_sweep.py \
--algo envelope \
--env-id minecart-v0 \
--sweep-count 100 \
--seed 10 \
--num-seeds 4 \
--num-workers 4

Each worker will use auto and then each algo instance will default to cpu as CUDA is not available.

Example Runs on a Slurm Cluster

Example Runs:

Workers	CPUs	GPUs	CPU Usage	GPU Usage	Sweeps
4	4	0	94.88%	N/A	18
4	1	0	95.55%	N/A	15
1	1	0	25.03%	N/A	15
4	4	1	18.78%	99%	5
4	4	4	31.07%	9% 11% 12% 10%	5
4	1	4	98.85%	4% 5% 5% 5%	13

Workers corresponds to num_workers, CPUs corresponds to --cpus-per-task, GPUs corresponds to -G
Number of seeds set to 4 (i.e. training 4 agents).
GPU Usage measured through srun -s --jobid <job-id> --pty nvidia-smi command while running the job.
CPU Usage measured via seffcommand after finishing the job.
CPU Config: Intel Broadwell or Skylake processors.
GPU Config: Tesla V100.
Each run lasted 10 hours.

lowlypalace · 2023-12-16T22:22:06Z

Any idea why black is complaining as I've run the linter on my end?

lowlypalace added 15 commits November 9, 2023 23:45

Set max_workers to the number of CPU cores

3aaad05

Allocate GPUs to workers

d65cb4b

Update num cpus

f919f64

Use num_workers

2ef27d9

Fix

1ff5ee9

Fix

57f7fd0

Set num threads

84e7b81

Move GPU assignment to main loop

e67927f

Pass device instead of id

af64f1b

Remove print

507c8b1

Fix device in PCN

d3d7ece

Lint

aeac893

Add devices argument

505755c

Move

c257ec2

Fix

02b614e

lowlypalace added 5 commits December 17, 2023 00:11

Make argument desc clear

ab7d1d4

Make shorter

8cd9c4c

Clarify

d458762

Edit default

f44d5df

Edit

ad26c02

ffelten closed this Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Hyperparameter Search #84

Parallel Hyperparameter Search #84

lowlypalace commented Dec 16, 2023 •

edited

Loading

lowlypalace commented Dec 16, 2023 •

edited

Loading

Parallel Hyperparameter Search #84

Parallel Hyperparameter Search #84

Conversation

lowlypalace commented Dec 16, 2023 • edited Loading

PR Description

TODO:

Example Configs on a Slurm Cluster

Example Runs on a Slurm Cluster

lowlypalace commented Dec 16, 2023 • edited Loading

lowlypalace commented Dec 16, 2023 •

edited

Loading

lowlypalace commented Dec 16, 2023 •

edited

Loading