Skip to content

Add loops for ATen compiler MHA asm gen to reduce instruction count#45

Merged
booth-algo merged 1 commit into
mainfrom
feat/codegen-addr-reg-init
May 18, 2026
Merged

Add loops for ATen compiler MHA asm gen to reduce instruction count#45
booth-algo merged 1 commit into
mainfrom
feat/codegen-addr-reg-init

Conversation

@booth-algo
Copy link
Copy Markdown
Collaborator

@booth-algo booth-algo commented May 18, 2026

Summary

Loops the ATen per-head MHA attention helper emission to reduce instruction memory pressure while preserving the old Python-unrolled path.

Changed helpers:

  • _online_softmax_asm
  • _scale_o_asm
  • _final_scaling_asm
  • _pv_multiply_asm
  • _reset_fpsram_asm
  • _reset_vram_asm

Default emission now uses hardware loops. The existing public unroll path is preserved: ATEN_UNROLL=1 / unroll_loops=True now also sets unroll_attention=True, so the attention helpers use the Python-unrolled path. Tests and harnesses can still override prog.unroll_attention directly after construction for A/B comparisons.

CLM-60M Native Layer 0 Counts

Rerun locally from the simulator workspace with compile_hf_model(model, seq_len=64, hidden_size=None, inter_dim=None, num_layers=1). Native dims: hidden=384, inter=1408, heads=6, kv_heads=2, head_dim=64.

Metric Previous Current Change
Total ASM source lines 35,479 15,367 -20,112 (-56.7%)
Actual static instruction lines 34,041 14,403 -19,638 (-57.7%, 2.36x smaller)
Comment / metadata lines 1,438 964 -474 (-33.0%)
Loop-expanded dynamic instructions 645,334 649,762 +4,428 (+0.69%)
Estimated cycles 8,915,366 8,919,794 +4,428 (+0.05%)
Estimated ms @ 1GHz 8.915366 8.919794 +0.004428 (+0.05%)
C_LOOP_START static lines 248 296 +48

Verification

Companion simulator branch asm-count-verification adds the harnesses and report. Results from that branch:

  • ATen MHA seq=64, head_dim=64: static instructions drop from 2,960 to 128; estimated cycles increase about 3%.
  • ATen MHA seq=64, head_dim=128: static instructions drop from 4,399 to 169; estimated cycles increase about 3%.
  • Transactional emulator golden check for looped ATen MHA passed against PyTorch SDPA with 100% allclose pass rate under repo thresholds.
  • Additional flag check: ATEN_UNROLL=1 constructs PlenaCompiler with unroll_loops=True and unroll_attention=True; ATEN_UNROLL=0 leaves both false.

@booth-algo booth-algo force-pushed the feat/codegen-addr-reg-init branch from b0c6806 to 55caddf Compare May 18, 2026 23:03
@booth-algo booth-algo force-pushed the feat/codegen-addr-reg-init branch from 55caddf to 6b8ae98 Compare May 18, 2026 23:11
@booth-algo booth-algo marked this pull request as ready for review May 18, 2026 23:13
@booth-algo booth-algo changed the title NOT READY FOR MERGE: loop ATen MHA attention helpers Add loops for ATen compiler MHA asm gen to reduce instruction count May 18, 2026
@booth-algo booth-algo merged commit d2817df into main May 18, 2026
3 checks passed
@booth-algo booth-algo deleted the feat/codegen-addr-reg-init branch May 18, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant