Skip to content

Add longer timeout to v1 trainer#348

Merged
kmontemayor2-sc merged 2 commits intomainfrom
kmonte/add_v1_trainer_timeout
Sep 30, 2025
Merged

Add longer timeout to v1 trainer#348
kmontemayor2-sc merged 2 commits intomainfrom
kmonte/add_v1_trainer_timeout

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

Scope of work done

Vertex AI may start the training jobs in a scattered manner, e.g. rank 0 starts at time 0, rank 10 starts at time 11, and so rank 0 timesout and crashes.

We should probably parameterize this in the future but in practice the 45 minute timeout is sufficient here.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Tested internally succsefully.

Updated Changelog.md? NO

Ready for code review?: NO

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 30, 2025

GiGL Automation

@ 18:48:14UTC : 🔄 Unit Test started.

@ 19:37:37UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/integration_test

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/e2e_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 30, 2025

GiGL Automation

@ 18:48:32UTC : 🔄 Integration Test started.

@ 19:38:41UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 30, 2025

GiGL Automation

@ 18:48:44UTC : 🔄 E2E Test started.

@ 20:07:09UTC : ✅ Workflow completed successfully.

Comment thread python/gigl/src/training/v1/lib/training_process.py Outdated
@kmontemayor2-sc kmontemayor2-sc marked this pull request as ready for review September 30, 2025 22:18
@kmontemayor2-sc kmontemayor2-sc added this pull request to the merge queue Sep 30, 2025
Merged via the queue into main with commit 1447e21 Sep 30, 2025
4 checks passed
@kmontemayor2-sc kmontemayor2-sc deleted the kmonte/add_v1_trainer_timeout branch September 30, 2025 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants