Skip to content

gsutil retry#7

Merged
gobbleturk merged 2 commits into
mainfrom
gsutil-retry
Apr 21, 2023
Merged

gsutil retry#7
gobbleturk merged 2 commits into
mainfrom
gsutil-retry

Conversation

@gobbleturk
Copy link
Copy Markdown
Collaborator

Add a retry to gsutil.

We have found that gsutil does not always install in time, but a short retry fixes this issue. We have only seen this issue in large runs (e.g. 2xv4-384), previously the code was working up to 2xv4-128.

@gobbleturk gobbleturk requested a review from rwitten April 20, 2023 21:32
@gobbleturk gobbleturk merged commit c3c59a4 into main Apr 21, 2023
@gobbleturk gobbleturk deleted the gsutil-retry branch April 21, 2023 21:07
A9isha pushed a commit that referenced this pull request Apr 11, 2024
jberchtold-nvidia added a commit to jberchtold-nvidia/maxtext that referenced this pull request Aug 15, 2025
…ion_v2

Add test scripts to v2 integration
geeningwang pushed a commit to geeningwang/maxtext that referenced this pull request Apr 17, 2026
- Benchmark history: add row AI-Hypercomputer#7 (scan=true, 71.1 ms / 450 tok/s, +28% vs AI-Hypercomputer#6)
- Add 'scan_layers=true Analysis' section explaining the 15.4 ms overhead:
  root cause is loss of XLA inter-layer weight-prefetch pipelining (lax.while_loop
  prevents cross-iteration scheduling); ~0.32 ms/layer consistent with
  memory-bandwidth-bound workload losing prefetch overlap
- Quantify sparse dispatch break-even bar: must recover >22% to beat 55.7 ms
  dense baseline; rough estimate ~34 ms achievable with ragged_all_to_all
- Update 'Most Impactful Next Fixes' section: scan fix done, ragged_all_to_all
  is now AI-Hypercomputer#1 priority; update HEAD ref to 539cc04
- Re-rank optimisation table: ragged_all_to_all moved to rank 2 (highest unrun),
  scan_layers=true added as rank 3 (done, prerequisite)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants