Skip to content

Conversation

@vict0rsch
Copy link
Collaborator

@vict0rsch vict0rsch commented Jun 1, 2022

Enables distributed training:

  • 1 node
  • N GPUs
  • N SLURM tasks

(DistributedDataParallel is more efficient than DataParallel so that's why there's N 1-GPU tasks instead of 1 N-GPUs task)

@vict0rsch vict0rsch requested a review from AlexDuvalinho June 1, 2022 14:03
@vict0rsch vict0rsch changed the title setup distributed config without --submit (--distributed is enough) Distributed training Jun 2, 2022
@vict0rsch
Copy link
Collaborator Author

vict0rsch commented Jun 7, 2022

@vict0rsch
Copy link
Collaborator Author

2022-06-06 18:08:06 (WARNING): No metadata file found at '/network/projects/_groups/ocp/oc20/is2re/all/train/metadata.npz'. Batches will not be balanced, which can incur significant overhead!

Following the instructions in TRAIN.md I ran the following commands to create those files

python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/test_id/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/test_ood_both/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/train/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/val_ood_ads/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/val_ood_cat/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/test_ood_ads/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/test_ood_cat/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/val_id/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/all/val_ood_both/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/100k/train/data.lmdb
python scripts/make_lmdb_sizes.py --num-workers 6 --data-path /network/projects/ocp/oc20/is2re/10k/train/data.lmdb

@vict0rsch vict0rsch merged commit 4218071 into main Jun 7, 2022
@vict0rsch vict0rsch deleted the distribute-sbatch-py branch June 7, 2022 10:05
@vict0rsch vict0rsch mentioned this pull request Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants