Fix DDP final eval #16

vict0rsch · 2022-06-14T10:32:52Z

distutils.synchronize() after trainer.save() to prevent checkpoint corruption
add git_checkout command-line arg to sbatch.py to ensure enqueued jobs have the appropriate code state
- defaults to None meaning the training will use the code as it is when the job starts (not when it is queued)
- writes git checkout {git_checkout} to the sbatch file so use as:
  - python sbatch.py git_checkout=your-branch
  - python sbatch.py git_checkout=somecommithash
removes the error.txt files, instead use a per-task output file output-%t.txt where %t is the task id in a single job (%j) multi-task (%t) setting such as ours

vict0rsch · 2022-06-14T10:38:54Z

Waiting for a full training to complete before merging

python sbatch.py py_args="--mode train --config-yml configs/is2re/10k/schnet/new_schnet.yml" note="Distributed training test" git_checkout=fix-ddp mem=96GB

vict0rsch added 6 commits June 14, 2022 05:59

clean imports and line lengths

9ae80f7

no error output + 1 output file per task

3b3d6bd

output to output-%t.txt (%t = task id)

6cb044c

synchronize tasks after save()

f1f5e06

remove error arg handling (deprecated)

c65f01d

add git_checkout options to ensure code is in the expected state

98308a6

vict0rsch changed the title ~~synchronize() after save() to prevent checkpoint corruption~~ Fix DDP final eval Jun 14, 2022

vict0rsch marked this pull request as draft June 14, 2022 10:38

much more usage comments in defaults.yaml

1211d24

vict0rsch requested review from AlexDuvalinho, alexhernandezgarcia and yimengmin June 14, 2022 10:54

vict0rsch added 2 commits June 14, 2022 14:42

black

a64dc5f

fix small logging error

b31f5ba

vict0rsch marked this pull request as ready for review June 14, 2022 15:20

vict0rsch merged commit f89f4ce into main Jun 14, 2022

vict0rsch deleted the fix-ddp branch June 14, 2022 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DDP final eval #16

Fix DDP final eval #16

Uh oh!

vict0rsch commented Jun 14, 2022 •

edited

Loading

Uh oh!

vict0rsch commented Jun 14, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix DDP final eval #16

Fix DDP final eval #16

Uh oh!

Conversation

vict0rsch commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vict0rsch commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vict0rsch commented Jun 14, 2022 •

edited

Loading

vict0rsch commented Jun 14, 2022 •

edited

Loading