fix: enable ulysses sharding for custom kernels and improve scaling precision by Perseus14 · Pull Request #396 · AI-Hypercomputer/maxdiffusion

Perseus14 · 2026-05-03T20:27:11Z

Description

This PR introduces two small but important fixes to the Ulysses attention implementation:

Enable Ulysses sharding for custom kernels: In src/maxdiffusion/pyconfig.py, the check for Ulysses attention is changed from attention == "ulysses" to "ulysses" in attention. This ensures that custom attention implementations that include "ulysses" in their identifier (e.g., custom_ulysses) will correctly trigger Ulysses sequence sharding instead of falling back to default sharding strategies.
Ensure correct padding: In src/maxdiffusion/models/attention_flax.py, padding was done for query variable but query_scaled was used in the attention calculation. This fixes it and ensures the padded variable is used in the attention calculation.
Improve scaling precision: In src/maxdiffusion/models/attention_flax.py, the hardcoded constant 1.44269504 used to scale queries for base-2 exponentiation is replaced with math.log2(math.e). This provides better precision and makes the intent of the code clearer.

Generation Time

Main: ~420s
Branch: ~140s

github-actions · 2026-05-03T20:27:20Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

eltsai · 2026-05-04T16:36:55Z

Thanks for fixing this @Perseus14 ! I tested on v6e (don't have v7x for now due to capacity) and the generation time is 204 sec:

==================================================
  TIMING SUMMARY
==================================================
  Load (checkpoint):      97.3s
  Compile:               229.8s
  ────────────────────────────────────────
  Inference:             204.0s
  Conditioning:           11.8s
  Denoise Total:         183.6s
  VAE Decode:              8.6s
==================================================

I think the generation time is expected, because on v7x we are seeing 28% speed boost (140 vs 194.4). From go/wan-dashboard the e2e gen time is 322 sec, the speed gain is about 36% percent.

Fix: Ulysses custom attention now uses ulysses sharding

f23c50c

Perseus14 requested a review from entrpn as a code owner May 3, 2026 20:27

Perseus14 changed the title ~~Fix: Ulysses custom attention now uses ulysses sharding~~ fix: enable ulysses sharding for custom kernels and improve scaling precision May 4, 2026

Perseus14 requested a review from eltsai May 4, 2026 06:56

eltsai approved these changes May 4, 2026

View reviewed changes

github-actions Bot added the pull ready label May 4, 2026

copybara-service Bot merged commit 71b4138 into main May 4, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable ulysses sharding for custom kernels and improve scaling precision#396

fix: enable ulysses sharding for custom kernels and improve scaling precision#396
copybara-service[bot] merged 1 commit intomainfrom
fix-ulysses-custom

Perseus14 commented May 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

eltsai commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Perseus14 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Generation Time

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

eltsai commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Perseus14 commented May 3, 2026 •

edited

Loading