Skip to content

Optimize batch loading and metrics writing, replace PositionalSharding with NamedSharding#186

Merged
coolkp merged 8 commits intomainfrom
sdxl-gpu
Jun 18, 2025
Merged

Optimize batch loading and metrics writing, replace PositionalSharding with NamedSharding#186
coolkp merged 8 commits intomainfrom
sdxl-gpu

Conversation

@coolkp
Copy link
Copy Markdown
Collaborator

@coolkp coolkp commented Jun 17, 2025

  • Asynchronously prepare batch and write metrics (Improvements in step time and MFU)
  • Add loss blocking same as maxtext.
  • Remove positional sharding its deprecated
  • Add script to convert to tfrecord

Comparison with main (Improvements in step time with using tf)
image

coolkp added 5 commits June 12, 2025 07:01
…to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding

Signed-off-by: Kunjan <kunjanp@google.com>
Signed-off-by: Kunjan <kunjanp@google.com>
Signed-off-by: Kunjan <kunjanp@google.com>
@coolkp coolkp requested a review from entrpn June 17, 2025 13:45
Signed-off-by: Kunjan <kunjanp@google.com>
entrpn
entrpn previously approved these changes Jun 17, 2025
Copy link
Copy Markdown
Collaborator

@entrpn entrpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve it, but would really like to find the root cause as to why we need an executor to load next batch.

@coolkp coolkp force-pushed the sdxl-gpu branch 5 times, most recently from bdded09 to 4b051fa Compare June 18, 2025 16:17
entrpn
entrpn previously approved these changes Jun 18, 2025
…ocessed features

Signed-off-by: Kunjan <kunjanp@google.com>
@coolkp coolkp force-pushed the sdxl-gpu branch 5 times, most recently from 6b6881d to 388ade1 Compare June 18, 2025 18:51
@coolkp coolkp merged commit f344ab0 into main Jun 18, 2025
2 of 3 checks passed
hx89 pushed a commit to hx89/maxdiffusion that referenced this pull request Jul 14, 2025
…g with NamedSharding (AI-Hypercomputer#186)

* fix profiling

* Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Replace positional sharding with named sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Fallback to regular tfrecord iterator for datasets without all the processed features

Signed-off-by: Kunjan <kunjanp@google.com>

* README update

---------

Signed-off-by: Kunjan <kunjanp@google.com>
hx89 pushed a commit to hx89/maxdiffusion that referenced this pull request Jul 15, 2025
…g with NamedSharding (AI-Hypercomputer#186)

* fix profiling

* Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Replace positional sharding with named sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Fallback to regular tfrecord iterator for datasets without all the processed features

Signed-off-by: Kunjan <kunjanp@google.com>

* README update

---------

Signed-off-by: Kunjan <kunjanp@google.com>
hx89 pushed a commit to hx89/maxdiffusion that referenced this pull request Jul 15, 2025
…g with NamedSharding (AI-Hypercomputer#186)

* fix profiling

* Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Replace positional sharding with named sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Fallback to regular tfrecord iterator for datasets without all the processed features

Signed-off-by: Kunjan <kunjanp@google.com>

* README update

---------

Signed-off-by: Kunjan <kunjanp@google.com>
hx89 pushed a commit to hx89/maxdiffusion that referenced this pull request Jul 15, 2025
…lSharding with NamedSharding (AI-Hypercomputer#186)

* fix profiling

* Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Replace positional sharding with named sharding

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Formatting

Signed-off-by: Kunjan <kunjanp@google.com>

* Fallback to regular tfrecord iterator for datasets without all the processed features

Signed-off-by: Kunjan <kunjanp@google.com>

* README update

---------

Signed-off-by: Kunjan <kunjanp@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants