You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Faster DISCO sparsity-pattern setup; OpenMP forward/backward kernels with up to ~55x speedup in some configurations
Cross-attention (key != value != query) in AttentionS2, NeighborhoodAttentionS2, and DistributedNeighborhoodAttentionS2
Serial attention upsampling when nlon_out % nlon_in == 0: CPU/CUDA/torch upsample kernels and matching reference
DistributedNeighborhoodAttentionS2 for self-attention and downsampling (distributed upsample not yet implemented)
Optional per-head QK RMS norm (use_qknorm) for AttentionS2 and NeighborhoodAttentionS2; shape checks across attention layers
Fixed Q/K/V projection gain when input dim != embedding dim
Breaking: default NeighborhoodAttentionS2 scale changed from 1/sqrt(k_channels) to 1/sqrt(k_channels // num_heads) to match standard MHA head-dim scaling (num_heads > 1)
Faster Legendre coefficient precomputation for SHT layers
Differentiable polar_halo_exchange and get_group_neighbors for distributed attention
More robust distributed transpose; _reduce clones before all_reduce for torch.compile compatibility
Fixed Galewsky initial condition NaN from overflow; convolution adapter for mismatched residual channel counts