Skip to content

Add multiple retries to load data in multihost_dataloading#13

Merged
gobbleturk merged 3 commits into
mainfrom
more-retries
Apr 24, 2023
Merged

Add multiple retries to load data in multihost_dataloading#13
gobbleturk merged 3 commits into
mainfrom
more-retries

Conversation

@gobbleturk
Copy link
Copy Markdown
Collaborator

@gobbleturk gobbleturk commented Apr 24, 2023

Tested via:

  • Normal run (doesn't hit the except clause because error is rare)
  • Dummy crashes via raise tf.errors.FailedPreconditionError(None, None, "Dummy TF Error")

@gobbleturk gobbleturk requested a review from rwitten April 24, 2023 18:09
Copy link
Copy Markdown
Collaborator

@rwitten rwitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor nits

Copy link
Copy Markdown
Collaborator

@rwitten rwitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits actually included this time.

Comment thread MaxText/multihost_dataloading.py Outdated
max_logging.log("Failed to get next data batch, retrying")
time.sleep(10)
loaded_data_success = False
data_load_attempts, max_data_load_attempts = 0, 30
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: putthing these constants into uppercase (and including a the time.sleep constant) would be better

Comment thread MaxText/multihost_dataloading.py Outdated
loaded_data_success = False
data_load_attempts, max_data_load_attempts = 0, 30
while not loaded_data_success and data_load_attempts < max_data_load_attempts:
data_load_attempts = data_load_attempts + 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_load_attempts += 1

@gobbleturk gobbleturk merged commit 38a6a29 into main Apr 24, 2023
@gobbleturk gobbleturk deleted the more-retries branch October 25, 2023 17:37
A9isha pushed a commit that referenced this pull request Apr 11, 2024
phu0ngng pushed a commit to phu0ngng/maxtext that referenced this pull request Aug 18, 2025
…rchtold/error-when-dropout-used-in-mlp

Raise error when dropout used in TE mlp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants