Skip to content

[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690

Merged
nihui merged 4 commits intoTencent:masterfrom
crafcat7:feat/x86-pixelshuffle
Apr 22, 2026
Merged

[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690
nihui merged 4 commits intoTencent:masterfrom
crafcat7:feat/x86-pixelshuffle

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

Summary

Adds an x86-specific implementation of PixelShuffle (src/layer/x86/pixelshuffle_x86.{h,cpp}) that replaces the original scalar stride-store loop with an SSE/AVX block-transpose, yielding a ~3× speedup on pure-PixelShuffle workloads for both single- and multi-thread paths, with no regressions on any other layer or model.

Motivation

The generic PixelShuffle::forward in src/layer/pixelshuffle.cpp uses a stride store
inside the inner-most loop:

for (int j = 0; j < w; j++)
{
    outptr[0] = sptr[0];
    sptr++;
    outptr += upscale_factor; // stride store
}

Each write jumps upscale_factor floats, which defeats auto-vectorization and causes
repeated dirty cache-line traffic. For common super-resolution networks (ESPCN,
RealSR, waifu2x, etc.) the upscale heads are r=2 or r=4, so an SSE/AVX block
transpose followed by contiguous stores is a natural fit.

Algorithm

For each (output_channel p, sub-row sh) we gather r source channel pointers and
walk them in lockstep. Each tile of vec_width × r is transposed and written
contiguously:

  • r == 2: AVX unpacklo_ps / unpackhi_ps + permute2f128 → 8 columns per step;
    SSE unpacklo_ps / unpackhi_ps → 4 columns per step.
  • r == 4: SSE _MM_TRANSPOSE4_PS → 4 columns per step (writes 16 contiguous floats).
  • r == 3 or r > 4: fall back to the base scalar implementation (correct but slow;
    rarely used in practice).
  • Both mode=0 (CRD) and mode=1 (DCR) are supported; only the source channel
    index formula differs.

The new layer keeps support_packing = false (inherited from the base), so ncnn's
packing machinery automatically unpacks to pack1 before forward runs; the fast
path stays simple and only handles pack1 + fp32.

Correctness

ctest -R test_pixelshuffle --output-on-failure
Passed

Covers the existing test cases (r=1/2/3/4, mode=0/1, assorted channel counts).

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements from commit
900ca77 using benchncnn with loop=100 and taskset -c 0-7; reported as
the min metric (most stable at sub-millisecond workloads).

Pure-PixelShuffle workloads

Network Shape Threads Baseline (ms) Optimized (ms) Speedup
pixelshuffle_demo 256×256×64 → 1024×1024×4 (r=4) 1 0.85 0.29 2.93×
pixelshuffle_demo same 8 0.28 0.09 3.11×
pixelshuffle_stacked_demo two stacked r=4 1 0.97 0.32 3.03×
pixelshuffle_stacked_demo same 8 0.34 0.10 3.40×

End-to-end super-resolution network (ESPCN topology)

Synthetic ESPCN benchmark/espcn_demo.param:
Input 256×256×3 → Conv5×5 (64) → ReLU → Conv3×3 (32) → ReLU → Conv3×3 (48) → PixelShuffle r=4 → 1024×1024×3.

Threads Baseline (ms) Optimized (ms) Speedup
1 18.66 18.77 ≈1× (within noise)
8 10.96 10.12 1.08×

Conv work dominates this topology (≈4.8 GMACs vs. a pure-data-movement
PixelShuffle). The layer-level speedup is real but its absolute magnitude
(~0.7 ms) is small compared to the conv total. The optimization matters most
in networks where PixelShuffle is a larger fraction of the runtime
(upsampling heads, lightweight SR / style transfer, embedded / single-thread
deployments).

No regressions

All 35+ built-in benchmark models (squeezenet, mobilenet v1/v2/v3,
shufflenet v1/v2, googlenet, resnet18/50, vgg16, yolo v3/v4-tiny, nanodet,
efficientnet b0 / v2-b0, vision_transformer, etc.) were measured before/after
at both threads=1 and threads=8. All timings are within noise (±3%). None
of these networks use PixelShuffle.

Files

  • src/layer/x86/pixelshuffle_x86.h — new header
  • src/layer/x86/pixelshuffle_x86.cpp — new implementation

Summary:
  Add x86 SIMD fast-path for PixelShuffle (r=2, r=4) to replace the
  scalar stride-store in the base implementation. The kernel performs an
  r-by-vector in-block transpose so the output write is contiguous,
  yielding ~3x speedup on single-layer benchmarks (loop=100, min) for
  both 1T and 8T on Zen5. Uncommon upscale factors fall back to the
  base layer implementation to preserve correctness.

Changes:
  1. Add src/layer/x86/pixelshuffle_x86.h declaring PixelShuffle_x86
  2. Implement r=2 path using AVX unpacklo/unpackhi + permute2f128 (8 cols/step) and SSE unpack (4 cols/step)
  3. Implement r=4 path using SSE _MM_TRANSPOSE4_PS (4 cols/step), covering both CRD (mode=0) and DCR (mode=1)
  4. Fall back to PixelShuffle::forward for r=1/r=3/other shapes and non-pack1 / non-fp32 inputs
@github-actions github-actions Bot added the x86 label Apr 21, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 98.73418% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.96%. Comparing base (71b1a61) to head (266825e).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/x86/pixelshuffle_x86.cpp 98.73% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6690    +/-   ##
========================================
  Coverage   93.96%   93.96%            
========================================
  Files         932      933     +1     
  Lines      299059   299668   +609     
========================================
+ Hits       280998   281597   +599     
- Misses      18061    18071    +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Summary:
  Remove the non-fp32 and non-pack1 fallback checks from PixelShuffle_x86 because the net forward path already normalizes input layout and storage according to layer capabilities before calling forward(). This keeps the fast-path focused on the only cases that can reach it in normal framework execution.

Changes:
  1. Remove the elembits guard that redundantly fell back to the base PixelShuffle implementation
  2. Remove the elempack guard that is made unreachable by convert_layout() for layers without packing support
  3. Keep the upscale-factor fallback so unsupported r values still use the base implementation
Summary:
  Extend the PixelShuffle test set with shapes that specifically exercise the new x86 r=2 and r=4 SIMD kernels, including vector-width and scalar-tail paths. This improves regression coverage for the x86 fast-path without changing the generic test harness.

Changes:
  1. Add r=2 cases that cover AVX-width, SSE-width, and remainder execution in both PixelShuffle modes
  2. Add r=4 cases that cover the 4-lane transpose path and scalar tail in both PixelShuffle modes
  3. Run the expanded test_pixelshuffle target in the existing build-tests configuration
@github-actions github-actions Bot added the test label Apr 22, 2026
Summary:
  Rename the additional PixelShuffle coverage helper to match the existing numbered test naming convention used in this file. This keeps the testcase layout consistent while preserving the same x86-oriented shape coverage.

Changes:
  1. Rename test_pixelshuffle_x86 to test_pixelshuffle_2
  2. Update main() to call the renamed helper
  3. Drop stale x86-specific inline comments from the helper body
@nihui nihui merged commit d0d5063 into Tencent:master Apr 22, 2026
106 of 109 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 22, 2026

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants