[OPT] x86: optimize PixelShuffle with SIMD block transpose by crafcat7 · Pull Request #6690 · Tencent/ncnn

crafcat7 · 2026-04-21T16:00:25Z

Summary

Adds an x86-specific implementation of PixelShuffle (src/layer/x86/pixelshuffle_x86.{h,cpp}) that replaces the original scalar stride-store loop with an SSE/AVX block-transpose, yielding a ~3× speedup on pure-PixelShuffle workloads for both single- and multi-thread paths, with no regressions on any other layer or model.

Motivation

The generic PixelShuffle::forward in src/layer/pixelshuffle.cpp uses a stride store
inside the inner-most loop:

for (int j = 0; j < w; j++)
{
    outptr[0] = sptr[0];
    sptr++;
    outptr += upscale_factor; // stride store
}

Each write jumps upscale_factor floats, which defeats auto-vectorization and causes
repeated dirty cache-line traffic. For common super-resolution networks (ESPCN,
RealSR, waifu2x, etc.) the upscale heads are r=2 or r=4, so an SSE/AVX block
transpose followed by contiguous stores is a natural fit.

Algorithm

For each (output_channel p, sub-row sh) we gather r source channel pointers and
walk them in lockstep. Each tile of vec_width × r is transposed and written
contiguously:

r == 2: AVX unpacklo_ps / unpackhi_ps + permute2f128 → 8 columns per step;
SSE unpacklo_ps / unpackhi_ps → 4 columns per step.
r == 4: SSE _MM_TRANSPOSE4_PS → 4 columns per step (writes 16 contiguous floats).
r == 3 or r > 4: fall back to the base scalar implementation (correct but slow;
rarely used in practice).
Both mode=0 (CRD) and mode=1 (DCR) are supported; only the source channel
index formula differs.

The new layer keeps support_packing = false (inherited from the base), so ncnn's
packing machinery automatically unpacks to pack1 before forward runs; the fast
path stays simple and only handles pack1 + fp32.

Correctness

ctest -R test_pixelshuffle --output-on-failure
Passed

Covers the existing test cases (r=1/2/3/4, mode=0/1, assorted channel counts).

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements from commit
900ca77 using benchncnn with loop=100 and taskset -c 0-7; reported as
the min metric (most stable at sub-millisecond workloads).

Pure-PixelShuffle workloads

Network	Shape	Threads	Baseline (ms)	Optimized (ms)	Speedup
`pixelshuffle_demo`	256×256×64 → 1024×1024×4 (r=4)	1	0.85	0.29	2.93×
`pixelshuffle_demo`	same	8	0.28	0.09	3.11×
`pixelshuffle_stacked_demo`	two stacked r=4	1	0.97	0.32	3.03×
`pixelshuffle_stacked_demo`	same	8	0.34	0.10	3.40×

End-to-end super-resolution network (ESPCN topology)

Synthetic ESPCN benchmark/espcn_demo.param:
Input 256×256×3 → Conv5×5 (64) → ReLU → Conv3×3 (32) → ReLU → Conv3×3 (48) → PixelShuffle r=4 → 1024×1024×3.

Threads	Baseline (ms)	Optimized (ms)	Speedup
1	18.66	18.77	≈1× (within noise)
8	10.96	10.12	1.08×

Conv work dominates this topology (≈4.8 GMACs vs. a pure-data-movement
PixelShuffle). The layer-level speedup is real but its absolute magnitude
(~0.7 ms) is small compared to the conv total. The optimization matters most
in networks where PixelShuffle is a larger fraction of the runtime
(upsampling heads, lightweight SR / style transfer, embedded / single-thread
deployments).

No regressions

All 35+ built-in benchmark models (squeezenet, mobilenet v1/v2/v3,
shufflenet v1/v2, googlenet, resnet18/50, vgg16, yolo v3/v4-tiny, nanodet,
efficientnet b0 / v2-b0, vision_transformer, etc.) were measured before/after
at both threads=1 and threads=8. All timings are within noise (±3%). None
of these networks use PixelShuffle.

Files

src/layer/x86/pixelshuffle_x86.h — new header
src/layer/x86/pixelshuffle_x86.cpp — new implementation

Summary: Add x86 SIMD fast-path for PixelShuffle (r=2, r=4) to replace the scalar stride-store in the base implementation. The kernel performs an r-by-vector in-block transpose so the output write is contiguous, yielding ~3x speedup on single-layer benchmarks (loop=100, min) for both 1T and 8T on Zen5. Uncommon upscale factors fall back to the base layer implementation to preserve correctness. Changes: 1. Add src/layer/x86/pixelshuffle_x86.h declaring PixelShuffle_x86 2. Implement r=2 path using AVX unpacklo/unpackhi + permute2f128 (8 cols/step) and SSE unpack (4 cols/step) 3. Implement r=4 path using SSE _MM_TRANSPOSE4_PS (4 cols/step), covering both CRD (mode=0) and DCR (mode=1) 4. Fall back to PixelShuffle::forward for r=1/r=3/other shapes and non-pack1 / non-fp32 inputs

codecov-commenter · 2026-04-22T03:00:57Z

Codecov Report

❌ Patch coverage is 98.73418% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.96%. Comparing base (71b1a61) to head (266825e).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/x86/pixelshuffle_x86.cpp	98.73%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #6690    +/-   ##
========================================
  Coverage   93.96%   93.96%            
========================================
  Files         932      933     +1     
  Lines      299059   299668   +609     
========================================
+ Hits       280998   281597   +599     
- Misses      18061    18071    +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Summary: Remove the non-fp32 and non-pack1 fallback checks from PixelShuffle_x86 because the net forward path already normalizes input layout and storage according to layer capabilities before calling forward(). This keeps the fast-path focused on the only cases that can reach it in normal framework execution. Changes: 1. Remove the elembits guard that redundantly fell back to the base PixelShuffle implementation 2. Remove the elempack guard that is made unreachable by convert_layout() for layers without packing support 3. Keep the upscale-factor fallback so unsupported r values still use the base implementation

Summary: Extend the PixelShuffle test set with shapes that specifically exercise the new x86 r=2 and r=4 SIMD kernels, including vector-width and scalar-tail paths. This improves regression coverage for the x86 fast-path without changing the generic test harness. Changes: 1. Add r=2 cases that cover AVX-width, SSE-width, and remainder execution in both PixelShuffle modes 2. Add r=4 cases that cover the 4-lane transpose path and scalar tail in both PixelShuffle modes 3. Run the expanded test_pixelshuffle target in the existing build-tests configuration

Summary: Rename the additional PixelShuffle coverage helper to match the existing numbered test naming convention used in this file. This keeps the testcase layout consistent while preserving the same x86-oriented shape coverage. Changes: 1. Rename test_pixelshuffle_x86 to test_pixelshuffle_2 2. Update main() to call the renamed helper 3. Drop stale x86-specific inline comments from the helper body

nihui · 2026-04-22T11:42:06Z

Thanks for your contribution !

github-actions Bot added the x86 label Apr 21, 2026

crafcat7 added 2 commits April 22, 2026 11:16

github-actions Bot added the test label Apr 22, 2026

nihui approved these changes Apr 22, 2026

View reviewed changes

nihui merged commit d0d5063 into Tencent:master Apr 22, 2026
106 of 109 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690

[OPT] x86: optimize PixelShuffle with SIMD block transpose#6690
nihui merged 4 commits intoTencent:masterfrom
crafcat7:feat/x86-pixelshuffle

crafcat7 commented Apr 21, 2026

Uh oh!

codecov-commenter commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

nihui commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

crafcat7 commented Apr 21, 2026

Summary

Motivation

Algorithm

Correctness

Performance

Pure-PixelShuffle workloads

End-to-end super-resolution network (ESPCN topology)

No regressions

Files

Uh oh!

codecov-commenter commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

nihui commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 22, 2026 •

edited

Loading