Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Tulsishah · 2024-01-12T19:55:22Z

Description

We reduced the epochs as pytorch2 long haul tests are running on NVIDIA L4 machines, which lack the powerful GPU of the NVIDIA A100.

NVIDIA A100 GPU machine is available now on asia-northeast1-a region so changed the machine from NVIDIA L4 to NVIDIA A100.
Increasing epochs to 80 again.
Updating driver version, which is compatible with NVIDIA A100 machine.

Link to the issue in case of a bug fix.

NA

Testing details

Manual - Tested manually
Unit tests - NA
Integration tests - NA

raj-prince

Change wise looks good.

Please make sure, model training time is fast enough (similar to what we had for pytorch 1.x) to complete the 80 epochs in 7 days.

Tulsishah · 2024-01-24T11:01:33Z

Change wise looks good.

Please make sure, model training time is fast enough (similar to what we had for pytorch 1.x) to complete the 80 epochs in 7 days.

Manually tested, each epoch is completing in 1 hour 48 minutes. 80 epochs will complete in 7 days.

* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location

Tulsishah added 5 commits January 12, 2024 15:52

adding nvidia a100 gpu machine

d774242

adding nvidia a100 gpu machine

250f0fe

testing changes

541ca13

testing changes

39c5829

undo testing changes

d9d17e6

Tulsishah requested review from sethiay, vadlakondaswetha and raj-prince as code owners January 12, 2024 19:55

chainging bucket location

65d2147

raj-prince approved these changes Jan 24, 2024

View reviewed changes

Tulsishah merged commit c28b8b8 into master Jan 24, 2024
8 checks passed

ashmeenkaur pushed a commit that referenced this pull request Jan 29, 2024

Updating NVIDIA A100 GPU machine for pytorch2.0 (#1610)

6664f49

* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location

Tulsishah mentioned this pull request Jan 29, 2024

Back merge master to read cache release branch #1664

Merged

ashmeenkaur pushed a commit that referenced this pull request Feb 1, 2024

Updating NVIDIA A100 GPU machine for pytorch2.0 (#1610)

664a62b

* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location

ashmeenkaur pushed a commit that referenced this pull request Feb 5, 2024

Updating NVIDIA A100 GPU machine for pytorch2.0 (#1610)

d904d54

* adding nvidia a100 gpu machine * adding nvidia a100 gpu machine * testing changes * testing changes * undo testing changes * chainging bucket location

chenrui333 mentioned this pull request Mar 2, 2024

gcsfuse 1.4.2 Homebrew/homebrew-core#164838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Tulsishah commented Jan 12, 2024 •

edited

raj-prince left a comment

Tulsishah commented Jan 24, 2024

Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Updating NVIDIA A100 GPU machine for pytorch2.0 #1610

Conversation

Tulsishah commented Jan 12, 2024 • edited

Description

Link to the issue in case of a bug fix.

Testing details

raj-prince left a comment

Choose a reason for hiding this comment

Tulsishah commented Jan 24, 2024

Tulsishah commented Jan 12, 2024 •

edited