New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating NVIDIA A100 GPU machine for pytorch2.0 #1610
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change wise looks good.
Please make sure, model training time is fast enough (similar to what we had for pytorch 1.x) to complete the 80 epochs in 7 days.
Manually tested, each epoch is completing in 1 hour 48 minutes. 80 epochs will complete in 7 days. |
* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location
* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location
* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location
Description
We reduced the epochs as pytorch2 long haul tests are running on NVIDIA L4 machines, which lack the powerful GPU of the NVIDIA A100.
asia-northeast1-a
region so changed the machine from NVIDIA L4 to NVIDIA A100.Link to the issue in case of a bug fix.
NA
Testing details