Skip to content

fix large checkpoints in pipe parallel#33

Merged
sdtblck merged 1 commit intomainfrom
fix_large_checkpoints
Jul 29, 2021
Merged

fix large checkpoints in pipe parallel#33
sdtblck merged 1 commit intomainfrom
fix_large_checkpoints

Conversation

@sweinbach
Copy link
Copy Markdown

Some checkpoints seem to be saved with extensive bloat / temporary information of torch modules. Copying a tensor before saving seems to fix it. In order not to OOM on GPU, tensors are just moved to cpu before saving.

@sweinbach sweinbach requested a review from sdtblck as a code owner July 29, 2021 16:38
@sdtblck sdtblck merged commit 87fbb8f into main Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants