Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Merged
merged 6 commits into from Jan 24, 2024

Conversation

Tulsishah
Copy link
Collaborator

@Tulsishah Tulsishah commented Jan 12, 2024

Description

We reduced the epochs as pytorch2 long haul tests are running on NVIDIA L4 machines, which lack the powerful GPU of the NVIDIA A100.

  1. NVIDIA A100 GPU machine is available now on asia-northeast1-a region so changed the machine from NVIDIA L4 to NVIDIA A100.
  2. Increasing epochs to 80 again.
  3. Updating driver version, which is compatible with NVIDIA A100 machine.

Link to the issue in case of a bug fix.

NA

Testing details

  1. Manual - Tested manually
  2. Unit tests - NA
  3. Integration tests - NA

Copy link
Collaborator

@raj-prince raj-prince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change wise looks good.

Please make sure, model training time is fast enough (similar to what we had for pytorch 1.x) to complete the 80 epochs in 7 days.

@Tulsishah
Copy link
Collaborator Author

Change wise looks good.

Please make sure, model training time is fast enough (similar to what we had for pytorch 1.x) to complete the 80 epochs in 7 days.

Manually tested, each epoch is completing in 1 hour 48 minutes. 80 epochs will complete in 7 days.

@Tulsishah Tulsishah merged commit c28b8b8 into master Jan 24, 2024
8 checks passed
ashmeenkaur pushed a commit that referenced this pull request Jan 29, 2024
* adding nvidia a100 gpu machine

* adding nvidia a100 gpu machine

* testing changes

* testing changes

* undo testing changes

* chainging bucket location
ashmeenkaur pushed a commit that referenced this pull request Feb 1, 2024
* adding nvidia a100 gpu machine

* adding nvidia a100 gpu machine

* testing changes

* testing changes

* undo testing changes

* chainging bucket location
ashmeenkaur pushed a commit that referenced this pull request Feb 5, 2024
* adding nvidia a100 gpu machine

* adding nvidia a100 gpu machine

* testing changes

* testing changes

* undo testing changes

* chainging bucket location
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants