Optimize batch loading and metrics writing, replace PositionalSharding with NamedSharding by coolkp · Pull Request #186 · AI-Hypercomputer/maxdiffusion

coolkp · 2025-06-17T13:44:43Z

Asynchronously prepare batch and write metrics (Improvements in step time and MFU)
Add loss blocking same as maxtext.
Remove positional sharding its deprecated
Add script to convert to tfrecord

Comparison with main (Improvements in step time with using tf)

…to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding Signed-off-by: Kunjan <kunjanp@google.com>

Signed-off-by: Kunjan <kunjanp@google.com>

entrpn

I'll approve it, but would really like to find the root cause as to why we need an executor to load next batch.

…ocessed features Signed-off-by: Kunjan <kunjanp@google.com>

…g with NamedSharding (AI-Hypercomputer#186) * fix profiling * Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding Signed-off-by: Kunjan <kunjanp@google.com> * Replace positional sharding with named sharding Signed-off-by: Kunjan <kunjanp@google.com> * Formatting Signed-off-by: Kunjan <kunjanp@google.com> * Formatting Signed-off-by: Kunjan <kunjanp@google.com> * Fallback to regular tfrecord iterator for datasets without all the processed features Signed-off-by: Kunjan <kunjanp@google.com> * README update --------- Signed-off-by: Kunjan <kunjanp@google.com>

…lSharding with NamedSharding (AI-Hypercomputer#186) * fix profiling * Use torch cpu, async write to tensorboard, script to convert latents to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding Signed-off-by: Kunjan <kunjanp@google.com> * Replace positional sharding with named sharding Signed-off-by: Kunjan <kunjanp@google.com> * Formatting Signed-off-by: Kunjan <kunjanp@google.com> * Formatting Signed-off-by: Kunjan <kunjanp@google.com> * Fallback to regular tfrecord iterator for datasets without all the processed features Signed-off-by: Kunjan <kunjanp@google.com> * README update --------- Signed-off-by: Kunjan <kunjanp@google.com>

coolkp added 5 commits June 12, 2025 07:01

fix profiling

1203400

Merge branch 'main' of https://github.com/AI-Hypercomputer/maxdiffusion

39aa12b

Use torch cpu, async write to tensorboard, script to convert latents …

04735f4

…to tfrecord, batch iterator for tfrecord cached, namedsharding instead of positional sharding Signed-off-by: Kunjan <kunjanp@google.com>

Replace positional sharding with named sharding

d76d5e8

Signed-off-by: Kunjan <kunjanp@google.com>

Formatting

b2956a7

Signed-off-by: Kunjan <kunjanp@google.com>

coolkp requested a review from entrpn June 17, 2025 13:45

Formatting

f418a0c

Signed-off-by: Kunjan <kunjanp@google.com>

entrpn previously approved these changes Jun 17, 2025

View reviewed changes

coolkp dismissed entrpn’s stale review via 7992261 June 18, 2025 14:18

coolkp force-pushed the sdxl-gpu branch 5 times, most recently from bdded09 to 4b051fa Compare June 18, 2025 16:17

entrpn previously approved these changes Jun 18, 2025

View reviewed changes

coolkp dismissed entrpn’s stale review via b97716c June 18, 2025 17:00

coolkp force-pushed the sdxl-gpu branch from 4b051fa to b97716c Compare June 18, 2025 17:00

Fallback to regular tfrecord iterator for datasets without all the pr…

02ca045

…ocessed features Signed-off-by: Kunjan <kunjanp@google.com>

coolkp force-pushed the sdxl-gpu branch 5 times, most recently from 6b6881d to 388ade1 Compare June 18, 2025 18:51

README update

fe88889

coolkp force-pushed the sdxl-gpu branch from 388ade1 to fe88889 Compare June 18, 2025 19:46

entrpn approved these changes Jun 18, 2025

View reviewed changes

coolkp merged commit f344ab0 into main Jun 18, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize batch loading and metrics writing, replace PositionalSharding with NamedSharding#186

Optimize batch loading and metrics writing, replace PositionalSharding with NamedSharding#186
coolkp merged 8 commits intomainfrom
sdxl-gpu

coolkp commented Jun 17, 2025

Uh oh!

entrpn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coolkp commented Jun 17, 2025

Uh oh!

entrpn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants