Skip to content

Conversation

@vict0rsch
Copy link
Collaborator

@vict0rsch vict0rsch commented Jun 14, 2022

  • distutils.synchronize() after trainer.save() to prevent checkpoint corruption
  • add git_checkout command-line arg to sbatch.py to ensure enqueued jobs have the appropriate code state
    • defaults to None meaning the training will use the code as it is when the job starts (not when it is queued)
    • writes git checkout {git_checkout} to the sbatch file so use as:
      • python sbatch.py git_checkout=your-branch
      • python sbatch.py git_checkout=somecommithash
  • removes the error.txt files, instead use a per-task output file output-%t.txt where %t is the task id in a single job (%j) multi-task (%t) setting such as ours

@vict0rsch vict0rsch changed the title synchronize() after save() to prevent checkpoint corruption Fix DDP final eval Jun 14, 2022
@vict0rsch vict0rsch marked this pull request as draft June 14, 2022 10:38
@vict0rsch
Copy link
Collaborator Author

vict0rsch commented Jun 14, 2022

Waiting for a full training to complete before merging

python sbatch.py py_args="--mode train --config-yml configs/is2re/10k/schnet/new_schnet.yml" note="Distributed training test" git_checkout=fix-ddp mem=96GB

@vict0rsch vict0rsch marked this pull request as ready for review June 14, 2022 15:20
@vict0rsch vict0rsch merged commit f89f4ce into main Jun 14, 2022
@vict0rsch vict0rsch deleted the fix-ddp branch June 14, 2022 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants