[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690
Merged
nihui merged 4 commits intoTencent:masterfrom Apr 22, 2026
Merged
[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690nihui merged 4 commits intoTencent:masterfrom
nihui merged 4 commits intoTencent:masterfrom
Conversation
Summary: Add x86 SIMD fast-path for PixelShuffle (r=2, r=4) to replace the scalar stride-store in the base implementation. The kernel performs an r-by-vector in-block transpose so the output write is contiguous, yielding ~3x speedup on single-layer benchmarks (loop=100, min) for both 1T and 8T on Zen5. Uncommon upscale factors fall back to the base layer implementation to preserve correctness. Changes: 1. Add src/layer/x86/pixelshuffle_x86.h declaring PixelShuffle_x86 2. Implement r=2 path using AVX unpacklo/unpackhi + permute2f128 (8 cols/step) and SSE unpack (4 cols/step) 3. Implement r=4 path using SSE _MM_TRANSPOSE4_PS (4 cols/step), covering both CRD (mode=0) and DCR (mode=1) 4. Fall back to PixelShuffle::forward for r=1/r=3/other shapes and non-pack1 / non-fp32 inputs
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6690 +/- ##
========================================
Coverage 93.96% 93.96%
========================================
Files 932 933 +1
Lines 299059 299668 +609
========================================
+ Hits 280998 281597 +599
- Misses 18061 18071 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary: Remove the non-fp32 and non-pack1 fallback checks from PixelShuffle_x86 because the net forward path already normalizes input layout and storage according to layer capabilities before calling forward(). This keeps the fast-path focused on the only cases that can reach it in normal framework execution. Changes: 1. Remove the elembits guard that redundantly fell back to the base PixelShuffle implementation 2. Remove the elempack guard that is made unreachable by convert_layout() for layers without packing support 3. Keep the upscale-factor fallback so unsupported r values still use the base implementation
Summary: Extend the PixelShuffle test set with shapes that specifically exercise the new x86 r=2 and r=4 SIMD kernels, including vector-width and scalar-tail paths. This improves regression coverage for the x86 fast-path without changing the generic test harness. Changes: 1. Add r=2 cases that cover AVX-width, SSE-width, and remainder execution in both PixelShuffle modes 2. Add r=4 cases that cover the 4-lane transpose path and scalar tail in both PixelShuffle modes 3. Run the expanded test_pixelshuffle target in the existing build-tests configuration
Summary: Rename the additional PixelShuffle coverage helper to match the existing numbered test naming convention used in this file. This keeps the testcase layout consistent while preserving the same x86-oriented shape coverage. Changes: 1. Rename test_pixelshuffle_x86 to test_pixelshuffle_2 2. Update main() to call the renamed helper 3. Drop stale x86-specific inline comments from the helper body
nihui
approved these changes
Apr 22, 2026
Member
|
Thanks for your contribution ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an x86-specific implementation of
PixelShuffle(src/layer/x86/pixelshuffle_x86.{h,cpp}) that replaces the original scalar stride-store loop with an SSE/AVX block-transpose, yielding a ~3× speedup on pure-PixelShuffle workloads for both single- and multi-thread paths, with no regressions on any other layer or model.Motivation
The generic
PixelShuffle::forwardinsrc/layer/pixelshuffle.cppuses a stride storeinside the inner-most loop:
Each write jumps
upscale_factorfloats, which defeats auto-vectorization and causesrepeated dirty cache-line traffic. For common super-resolution networks (ESPCN,
RealSR, waifu2x, etc.) the upscale heads are
r=2orr=4, so an SSE/AVX blocktranspose followed by contiguous stores is a natural fit.
Algorithm
For each
(output_channel p, sub-row sh)we gatherrsource channel pointers andwalk them in lockstep. Each tile of
vec_width × ris transposed and writtencontiguously:
r == 2: AVXunpacklo_ps / unpackhi_ps + permute2f128→ 8 columns per step;SSE
unpacklo_ps / unpackhi_ps→ 4 columns per step.r == 4: SSE_MM_TRANSPOSE4_PS→ 4 columns per step (writes 16 contiguous floats).r == 3orr > 4: fall back to the base scalar implementation (correct but slow;rarely used in practice).
mode=0(CRD) andmode=1(DCR) are supported; only the source channelindex formula differs.
The new layer keeps
support_packing = false(inherited from the base), so ncnn'spacking machinery automatically unpacks to
pack1beforeforwardruns; the fastpath stays simple and only handles
pack1 + fp32.Correctness
Covers the existing test cases (
r=1/2/3/4,mode=0/1, assorted channel counts).Performance
Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements from commit900ca77usingbenchncnnwithloop=100andtaskset -c 0-7; reported asthe
minmetric (most stable at sub-millisecond workloads).Pure-PixelShuffle workloads
pixelshuffle_demopixelshuffle_demopixelshuffle_stacked_demopixelshuffle_stacked_demoEnd-to-end super-resolution network (ESPCN topology)
Synthetic ESPCN
benchmark/espcn_demo.param:Input 256×256×3 → Conv5×5 (64) → ReLU → Conv3×3 (32) → ReLU → Conv3×3 (48) → PixelShuffle r=4 → 1024×1024×3.
Conv work dominates this topology (≈4.8 GMACs vs. a pure-data-movement
PixelShuffle). The layer-level speedup is real but its absolute magnitude
(~0.7 ms) is small compared to the conv total. The optimization matters most
in networks where PixelShuffle is a larger fraction of the runtime
(upsampling heads, lightweight SR / style transfer, embedded / single-thread
deployments).
No regressions
All 35+ built-in benchmark models (squeezenet, mobilenet v1/v2/v3,
shufflenet v1/v2, googlenet, resnet18/50, vgg16, yolo v3/v4-tiny, nanodet,
efficientnet b0 / v2-b0, vision_transformer, etc.) were measured before/after
at both
threads=1andthreads=8. All timings are within noise (±3%). Noneof these networks use PixelShuffle.
Files
src/layer/x86/pixelshuffle_x86.h— new headersrc/layer/x86/pixelshuffle_x86.cpp— new implementation