feat: add ASVD (activation-aware SVD) to FC_Decomposer#38
Merged
Conversation
Phase 1 of misc module revamp: - Remove unused `import F` from bn_folding and fc_decomposer - Replace cpu_optimizer with optimize_for_cpu (torch.compile backend) - Old accelerate_model_for_cpu deprecated with shim - Fixed bug: torch.jit.script doesn't use example_input (was dead param) - Removed dependency on deprecated optimize_for_mobile - Added tests (was skip_exec with zero coverage) - Add conv_decomposer.ipynb and cpu_optimizer.ipynb to _quarto.yml sidebar - Add cpu_optimizer to misc/all.py exports - Fix rank_ratio → percent_removed doc bug in fc_decomposer tutorial
Phase 2 of misc module revamp: FC_Decomposer + Conv_Decomposer: - Add energy_threshold: auto rank selection via singular value energy retention (e.g., 0.99 keeps 99% of energy). Mutually exclusive with percent_removed. - Add layers/exclude: per-layer control using exact layer names (matching Sparsifier dict-based pattern, not regex) - Shared helpers: _rank_from_energy, _should_decompose Conv_Decomposer: - Expose n_iter (default 10, was hardcoded 5) and tol (1e-4) for HOOI - Early stopping: HOOI exits when factor matrices converge within tol Traversal refactored from recursive _modules to named_modules() + parent replacement (cleaner, handles nested modules correctly). All backward compatible — new params have defaults matching old behavior.
Pass calibration data to get better decomposition — channels with higher activations are prioritized during SVD truncation. Algorithm (from Yuan et al., 2024): 1. Collect per-channel activation RMS via forward hooks 2. Scale weight columns: W_scaled = W * diag(rms) 3. SVD on W_scaled → truncate to rank k 4. Undo scaling: W2 = W2 / diag(rms) The scaling cancels out exactly — only the truncation decision changes. Backward compatible: data=None gives standard SVD. Usage: FC_Decomposer().decompose(model, 0.5, data=[calibration_batch])
Conv_Decomposer now supports two methods: - method='tucker' (default): 3 layers — pointwise compress + spatial + pointwise expand - method='svd' (new): 2 layers — spatial at reduced rank + pointwise expand SVD reshapes the 4D weight to (C_out, C_in*K*K), applies standard SVD, then splits into a spatial conv (C_in → R) and pointwise conv (R → C_out). Simpler, less overhead, better when moderate compression is enough. Usage: Conv_Decomposer().decompose(model, 0.5, method='svd') # 2 layers Conv_Decomposer().decompose(model, 0.5, method='tucker') # 3 layers (default)
Conv_Decomposer now supports 4 decomposition methods: | Method | Layers | Decomposes | Structure | |-----------|--------|--------------------|------------------------------| | 'tucker' | 3 | Both channels | 1×1 + K×K + 1×1 | | 'svd' | 2 | Output channels | K×K + 1×1 | | 'spatial' | 2 | Kernel (K×K→K×1+1×K) | K×1 + 1×K (grouped) | | 'cp' | 4 | Everything | 1×1 + K×1(dw) + 1×K(dw) + 1×1| Usage: Conv_Decomposer().decompose(model, 0.5, method='spatial') # K×K → K×1 + 1×K Conv_Decomposer().decompose(model, 0.5, method='cp') # max compression
- Replace _unfold with _mode_unfold using rearrange (clearer intent) - Spatial: vectorize with batched SVD (no more O(C_out×C_in) loop) - CP: vectorize spatial decomposition with batched SVD - Tucker/SVD: use rearrange for weight reshaping (replaces unsqueeze chains) - All methods: cleaner, faster on GPU, same results
…work docs - Tucker: pass activation RMS as input_scale to HOOI — weights mode-1 unfolding by activation statistics (distribution-aware Tucker) - SVD: scale input channels by activation RMS before SVD, undo after (same ASVD pattern as FC_Decomposer) - Usage: Conv_Decomposer().decompose(model, 0.5, data=[batch]) - Backward compatible: data=None = standard decomposition - Document future work: LayerNorm_Folder, NuclearNormCallback, latency-aware rank selection
Activation-aware Tucker/SVD increases raw reconstruction error on small CNNs (15% accuracy drop on Pets/ResNet-18). The 4D tensor structure makes exact scale/unscale (which works for FC's 2D SVD) incorrect — the weighted HOOI optimizes a different objective than standard HOOI, and projecting the original weight onto the scaled factors introduces error. Keep ASVD only in FC_Decomposer (2D SVD, exact scale/unscale). Document Conv activation-aware as future work pending reference impl.
This was referenced Apr 13, 2026
Strip execution metadata to pass CI's clean-checkout check.
In CI (Python 3.12), torch.jit.trace may emit additional warnings. Filter for DeprecationWarning specifically instead of asserting exact count.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 3 of misc module revamp. Adds activation-aware SVD (ASVD) from Yuan et al. 2024.
What it does
Standard SVD treats all input channels equally. ASVD weights channels by their actual activation magnitude before decomposing — channels the model uses heavily are harder to truncate.
Usage
Algorithm
W_scaled = W * diag(rms + eps)W_scaled, truncate to rank kW2 = W2 / diag(rms + eps)The scaling cancels out exactly — only the truncation decision changes.
Also includes (from PR #37)
energy_thresholdfor automatic rank selectionlayers/excludefor per-layer control_collect_activation_rmshelper (reusable by Conv_Decomposer)Test plan
data=Nonematches standard SVD behaviornbdev-testsuite passesReference: ASVD4LLM