New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpointing behaviour changed from previous versions (self.best_model_path holds rank 1 values) #14302
Comments
Hi @nithinraok
This is because in the exp manager, Nemo reloads the checkpoint like so: trainer._checkpoint_connector.restore(self.best_model_path) This code executes on each rank, but only rank 0 saved This can be fixed by broadcasting the value to all ranks from rank 0: best_model_path = trainer.strategy.broadcast(self.best_model_path)
trainer._checkpoint_connector.restore(best_model_path) This small fix can be added in Nemo to unblock you. By git bisect between 1.6.5 and master, I found that the change was introduced with #13364. cc @carmocca This means that a second way to solve this would be to set |
I also found that in nemo there is usage of deprecated apis in Lightning, for example, |
Thank you @awaelchli , the suggested solution works. Yes we need to change plugin to strategy as well before next release. |
馃悰 Bug
self.best_k_models stores checkpoints from two devices with different values and self.best_model_path points to cuda:1 while checkpoints saved were from cuda:0
This behavior has changed from 1.6.5 to 1.7.x
To Reproduce
Expected behavior
Run code without issues
Environment
1.7.x
a.txt
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj
The text was updated successfully, but these errors were encountered: