feat: add TopK and Gather ncnn operators for YOLOv10 deployment by vlordier · Pull Request #6668 · Tencent/ncnn

vlordier · 2026-04-11T07:31:53Z

Summary

This PR adds the missing TopK and Gather ncnn operators and fixes the PNNX conversion passes so that YOLOv10 models convert end-to-end without any ignored core operators.

Problem

YOLOv10 integrates NMS directly into the model graph, including TopK and torch.gather operators. The upstream PNNX pipeline marks both as ignore, making YOLOv10 unusable on ncnn.

Solution

1. Fix TopK conversion (pass_level2 → pass_ncnn)

pass_level2/torch_topk.cpp: Capture k, dim, largest, sorted as prim::Constant parameters instead of tensor inputs, enabling the ncnn pass to access these values
pass_ncnn/TopK.cpp: Updated match pattern to consume the parameterized torch.topk operator and produce ncnn TopK layer with correct axis/largest/sorted params

2. Add Gather ncnn operator

src/layer/gather.h — Gather layer header
src/layer/gather.cpp — Implementation supporting 1D/2D/3D tensors with arbitrary axis
src/CMakeLists.txt — Registered as ncnn_add_layer(Gather)
pass_ncnn/torch_gather.cpp (NEW) — Match torch.gather with captured dim parameter and convert to ncnn Gather

3. PNNX build system fixes

Per-target Torch include directories to avoid protobuf header conflicts (Torch bundles protobuf v3, system has v34.x)
Abseil linking for Homebrew protobuf 34.x (macOS/Linux)
Directory-level INCLUDE_DIRECTORIES_BEFORE for protobuf headers

Verified

YOLOv10n converts with 2 TopK + 2 Gather layers, only cosmetic ops ignored:

TopK                     topk_162   1 2 296 297 298 0=-1 1=1 2=1
TopK                     topk_163   1 2 304 305 306 0=-1 1=1 2=1
Gather                   gather_165 2 1 294 302 303 0=1
Gather                   gather_166 2 1 292 317 318 0=1

Previously:

ignore torch.topk torch.topk_11 param dim=-1
ignore torch.topk torch.topk_11 param k=300
ignore torch.gather torch.gather_69 param dim=1

- Generate TopK class definition in pnnx.py output with forward() method - Instantiate TopK modules in Model.__init__() with proper parameters - Update forward() method to call self.topk_name() instead of direct TopK() calls - Fixes pnnx inference to properly execute TopK operations using torch.topk() - Test confirms TopK ONNX→pnnx conversion and inference working correctly

- Fix IR pattern syntax to use explicit parameter names (axis=%, largest=%, sorted=%) - Replace incorrect parameter lookup from 'op_0.axis' to 'axis' to match captured names - TopK pass now properly fires during ONNX→pnnx→ncnn conversion - All TopK parameters (axis, largest, sorted) correctly captured and set in ncnn layers - End-to-end test confirms ONNX→pnnx→ncnn conversion with TopK working correctly

use c++03-style topk comparator and keep deterministic nan/inf ordering remove redundant constructor param initialization fix tests cmakelists alphabetical order (Tile before TopK) expand torch_topk onnx tests (k=0/k=1, negative dim, sorted=false cases) drop generated topk onnx/pnnx/ncnn sidecar artifacts from repo

- Guard <algorithm>/<vector> behind #if NCNN_SIMPLESTL, include simplestl.h - Use std::partial_sort in simplestl mode (no std::nth_element available) - Guard <math.h> in tests behind #if !NCNN_SIMPLESTL to avoid simplemath.h conflict; define INFINITY/NAN as float expressions in simplestl mode - Fix cstep-unaware indexing for 3D/4D output tensors: use actual cstep for channel offset instead of assuming contiguous w*h layout - Convert #pragma omp parallel + inner #pragma omp for to #pragma omp parallel for to avoid __kmpc_barrier in simpleomp mode - Fix copyright year 2026->2025 - Apply code-format whitespace cleanup

Topk ci tests

- pass_level2/torch_topk.cpp: capture k/dim/largest/sorted as parameters (prim::Constant) instead of tensor inputs, enabling ncnn pass matching - pass_level2/torch_gather.cpp: restore original pattern (dim as tensor) - pass_ncnn/TopK.cpp: match torch.topk with captured parameters and convert to ncnn TopK layer (axis, largest, sorted) - pass_ncnn/torch_gather.cpp (NEW): match torch.gather with 2 inputs (input, index) and captured dim parameter, convert to ncnn Gather layer - src/layer/gather.{h,cpp} (NEW): implement Gather ncnn operator supporting 1D/2D/3D tensors with arbitrary axis - PNNX CMakeLists fixes: - per-target Torch include dirs to avoid protobuf header conflicts - Abseil linking for Homebrew protobuf 34.x - disable onnxruntime auto-detection (protobuf conflict) - directory-level INCLUDE_DIRECTORIES_BEFORE for protobuf headers Verified: YOLOv10n converts with 2 TopK + 2 Gather layers, only cosmetic ops (Tensor.to, pnnx.Expression) ignored. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Copilot

Pull request overview

This PR aims to enable end-to-end YOLOv10 deployment on ncnn by adding missing TopK and Gather operators, updating PNNX conversion passes to no longer ignore these ops, and adjusting the PNNX build system to avoid protobuf header/link conflicts.

Changes:

Add ncnn runtime layers: TopK and Gather, plus a test_topk unit test and CMake registrations.
Update PNNX conversion: capture torch.topk constants in level2 and add ncnn graph rewrite passes for torch.topk → TopK and torch.gather → Gather.
Adjust PNNX build/includes and add an ONNX TopK conversion test + a dedicated GitHub Actions workflow.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tools/pnnx/tests/onnx/test_torch_topk.py	Adds ONNX roundtrip test coverage for torch.topk conversion.
tools/pnnx/tests/onnx/CMakeLists.txt	Registers the new ONNX TopK test.
tools/pnnx/src/pass_onnx/shape_inference.cpp	Makes onnxruntime C API include path more robust across installs.
tools/pnnx/src/pass_onnx/fold_constants.cpp	Same onnxruntime C API include path robustness for constant folding.
tools/pnnx/src/pass_ncnn/torch_gather.cpp	Introduces torch.gather → ncnn Gather rewrite pass.
tools/pnnx/src/pass_ncnn/TopK.cpp	Introduces torch.topk → ncnn TopK rewrite pass (values+indices and values-only patterns).
tools/pnnx/src/pass_level2/torch_topk.cpp	Captures topk args as prim::Constant nodes so later passes can read k/dim/largest/sorted.
tools/pnnx/src/load_onnx.cpp	Adds multi-path onnxruntime C API header inclusion.
tools/pnnx/src/ir.cpp	Adds Python codegen support for TopK custom module emission.
tools/pnnx/src/CMakeLists.txt	Registers new ncnn passes and adjusts protobuf/abseil include/link handling.
tools/pnnx/CMakeLists.txt	Changes how protobuf is discovered and modifies onnxruntime detection behavior.
tests/test_topk.cpp	Adds CPU correctness tests for TopK (including NaN/Inf behavior and fastpaths).
tests/CMakeLists.txt	Registers TopK as a layer test target.
src/layer/topk.h / src/layer/topk.cpp	Implements the ncnn TopK layer.
src/layer/gather.h / src/layer/gather.cpp	Implements the ncnn Gather layer.
src/CMakeLists.txt	Registers TopK and Gather layers for build.
.github/workflows/topk-linux-test.yml	Adds a CI workflow intended to validate TopK (and a PNNX ONNX TopK test).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-11T07:38:40Z

+# Disable onnxruntime auto-detection — we only need torch2pnnx for YOLOv10
+set(onnxruntime_FOUND FALSE)


This unconditionally disables onnxruntime support (onnxruntime_FOUND is forced to FALSE). That disables building onnx2pnnx and breaks ONNX->PNNX conversion and the existing ONNX test suite. Please make this behavior opt-in (e.g., a CMake option) or keep auto-detection so existing workflows/users that rely on onnxruntime continue to work.

Suggested change

# Disable onnxruntime auto-detection — we only need torch2pnnx for YOLOv10

set(onnxruntime_FOUND FALSE)

# Allow specialized builds to disable onnxruntime explicitly, while preserving

# the default auto-detection behavior for existing ONNX->PNNX workflows.

option(PNNX_DISABLE_ONNXRUNTIME "disable onnxruntime support" OFF)

if(PNNX_DISABLE_ONNXRUNTIME)

message(STATUS "Building without onnxruntime support")

set(onnxruntime_FOUND FALSE)

endif()

Copilot · 2026-04-11T07:38:40Z

+    // index_blob should contain int64 or int32 indices
+    // For simplicity we treat it as float and cast
+    const int index_size = (int)index_blob.total();
+


Gather currently treats index_blob memory as int32 via a raw cast, but upstream indices are commonly int64 (and sometimes int32) depending on exporter. Reading an int64 tensor as int32 will corrupt indices and can segfault/out-of-bounds. Please branch on index_blob.elemsize (4 vs 8) and load indices accordingly (or convert into a temporary int32 buffer).

Copilot · 2026-04-11T07:38:41Z

+        op->params["0"] = new_axis;
+        op->params["1"] = largest;
+        op->params["2"] = sorted;
+    }


The captured k parameter from torch.topk is never written into the generated TopK layer params. As-is, the TopK layer will fall back to its default k, producing incorrect results for any k != 1. Extract k from captured_params and write it to the TopK param id used by TopK::load_param (pd.get(3, ...)), or explicitly wire k as a second input blob.

Copilot · 2026-04-11T07:38:41Z

+            {
+                fprintf(pyfp, "%s=", it.first.c_str());
+
+                const Parameter& param = it.second;
+                if (param.type == 2)


TopK Python codegen emits keyword arguments using op->params keys directly. After ncnn rewrite these keys are numeric strings ("0", "1", ...), which produces invalid Python syntax like "0=...". Map param ids to valid names (axis/largest/sorted/k) when emitting the TopK(...) constructor call.

Copilot · 2026-04-11T07:38:41Z

+            fprintf(pyfp, "    def __init__(self, axis=1, largest=1, sorted=1):\n");
+            fprintf(pyfp, "        super(TopK, self).__init__()\n");
+            fprintf(pyfp, "        self.axis = axis\n");
+            fprintf(pyfp, "        self.largest = largest\n");
+            fprintf(pyfp, "        self.sorted = sorted\n");
+            fprintf(pyfp, "    def forward(self, x, k):\n");
+            fprintf(pyfp, "        # Torch topk returns (values, indices)\n");
+            fprintf(pyfp, "        return torch.topk(x, k.item() if hasattr(k, 'item') else k, dim=self.axis, largest=bool(self.largest), sorted=bool(self.sorted))\n");


The generated TopK module signature is forward(self, x, k), but the ncnn rewrite pattern for TopK has only one input. This will cause the generated model forward to call the module with the wrong number of arguments. Align codegen with the rewritten graph by either (1) making forward(self, x) use a stored k parameter, or (2) ensuring the rewritten TopK op actually has a k input operand.

Suggested change

fprintf(pyfp, " def __init__(self, axis=1, largest=1, sorted=1):\n");

fprintf(pyfp, " super(TopK, self).__init__()\n");

fprintf(pyfp, " self.axis = axis\n");

fprintf(pyfp, " self.largest = largest\n");

fprintf(pyfp, " self.sorted = sorted\n");

fprintf(pyfp, " def forward(self, x, k):\n");

fprintf(pyfp, " # Torch topk returns (values, indices)\n");

fprintf(pyfp, " return torch.topk(x, k.item() if hasattr(k, 'item') else k, dim=self.axis, largest=bool(self.largest), sorted=bool(self.sorted))\n");

fprintf(pyfp, " def __init__(self, k=1, axis=1, largest=1, sorted=1):\n");

fprintf(pyfp, " super(TopK, self).__init__()\n");

fprintf(pyfp, " self.k = k\n");

fprintf(pyfp, " self.axis = axis\n");

fprintf(pyfp, " self.largest = largest\n");

fprintf(pyfp, " self.sorted = sorted\n");

fprintf(pyfp, " def forward(self, x):\n");

fprintf(pyfp, " # Torch topk returns (values, indices)\n");

fprintf(pyfp, " k = self.k.item() if hasattr(self.k, 'item') else self.k\n");

fprintf(pyfp, " return torch.topk(x, k, dim=self.axis, largest=bool(self.largest), sorted=bool(self.sorted))\n");

Copilot · 2026-04-11T07:38:42Z

+    // index_blob should contain int64 or int32 indices
+    // For simplicity we treat it as float and cast
+    const int index_size = (int)index_blob.total();
+


The Gather implementation comments say indices are treated as float and cast, but the code later reinterprets index_blob memory as int32. Indices are commonly int64 in exported graphs, so this will corrupt reads. Please load indices according to index_blob.elemsize (4 vs 8) or convert to a temporary int32 buffer.

Copilot · 2026-04-11T07:38:42Z

+
+    // Allocate output (same dtype as input, shape matches index)
+    Mat& top_blob = top_blobs[0];
+    top_blob.create(out_shape.w, out_shape.h, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);


Output allocation always uses Mat::create(w,h,c,...) which forces a 3D output, even when index_blob is 1D/2D (and potentially 4D). This breaks Gather's contract that output shape matches index shape. Allocate top_blob with the same dims as index_blob using the appropriate create overload.

Suggested change

top_blob.create(out_shape.w, out_shape.h, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);

if (out_shape.dims == 1)

top_blob.create(out_shape.w, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);

else if (out_shape.dims == 2)

top_blob.create(out_shape.w, out_shape.h, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);

else if (out_shape.dims == 3)

top_blob.create(out_shape.w, out_shape.h, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);

else if (out_shape.dims == 4)

top_blob.create(out_shape.w, out_shape.h, out_shape.d, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);

else

return -1;

Copilot · 2026-04-11T07:38:42Z

+    const float* inp = input_blob;
+    const int* idx = (const int*)index_blob;
+    float* out = top_blob;


Gather treats input and output tensors as float* (and writes via float*), but top_blob is allocated using input_blob.elemsize/elempack which may be fp16/int8/etc. This will produce incorrect results or memory corruption when elemsize != 4. Either restrict Gather to float32 with a runtime check or add proper type handling.

Copilot · 2026-04-11T07:38:42Z

+        else if (dims == 3)
+        {
+            // ncnn 3D layout: w * h * c, with cstride padding
+            size_t cstep = input_blob.cstep;
+            flat_in = coord_in[0] + coord_in[1] * input_blob.w + coord_in[2] * (int)cstep;


Only dims 1/2/3 are handled in flat index computation. For dims==4, flat_in remains incorrect and Gather will return wrong results silently. Either implement 4D indexing (w,h,d,c with cstep) or explicitly reject dims > 3 early with a clear error code.

Copilot · 2026-04-11T07:38:42Z

+    branches:
+    - topk-ci-tests


This workflow only runs on pushes to the 'topk-ci-tests' branch, so it will not run for pull_request events or for the repository's normal branches after merge. If it is meant to provide ongoing CI coverage, add appropriate pull_request/push triggers; otherwise consider not adding it to the mainline PR.

Suggested change

branches:

- topk-ci-tests

pull_request:

- src/layer/cast.{h,cpp}: extend Cast layer with int64 (type 5) and int32 (type 6) support, adding conversions int64↔float32 and int32↔float32 - pass_ncnn/tensor_to.cpp (NEW): convert Tensor.to (dtype cast) to ncnn Cast layer, mapping torch dtype strings to ncnn type codes - CMakeLists.txt: register tensor_to.cpp in pass_ncnn sources Verified: YOLOv10n Tensor.to (i64→f32) now converts to Cast layer instead of being ignored. Only cosmetic ops (pnnx.Expression) remain. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

- ir.cpp: store k in TopK.__init__, use forward(self, x) — k was a ctor param but forward() expected it as an input arg, causing runtime error - ir.cpp: pass k= in TopK instantiation (k_val from params["3"]) - gather.cpp: reject non-float32 data (elemsize != 4) and dims > 3 explicitly - pnnx/src/CMakeLists: replace invalid set_property(INCLUDE_DIRECTORIES_BEFORE) with include_directories(BEFORE ...) to correctly force protobuf header order - pnnx/tests/onnx: add test_torch_gather.py roundtrip test (1D/2D/3D, multiple axes, negative axis) and register it in CMakeLists

vlordier · 2026-04-16T14:57:59Z

Superseded by #6669 which consolidates all operators (TopK, Gather, GatherElements, Expand, Tile, Mod) with all review issues addressed.

vlordier and others added 28 commits April 10, 2026 12:24

Add TopK layer and pnnx ONNX TopK lowering

8bd7d30

Add ONNX torch_topk pnnx regression test

b2c445a

tests: add sorted=0 coverage for topk

4b4b87a

tests: remove generated topk onnx artifacts

c9e856e

pnnx: drop unrelated cmake and symlink changes

4d5b35f

topk: reuse per-thread scratch buffer in forward

5c11058

topk: optimize sorted path and k=0 fast return

226bd88

topk: add k=1 fast path for embedded runtime

6c5978b

topk: avoid pair temporaries in k=1 hot loop

e16514b

topk: reduce writeback branching in hot loop

00be7f8

topk: fast path unsorted full-k copy

1fe4463

topk: add small-k hot path for embedded runtime

6ea29eb

topk: add guarded neon fast path for k=1

7befff6

topk: fix neon k=1 inf initialization edge case

5ba7fbc

topk: make neon mask check arm-portable

e4b4073

topk: optimize small-k unsorted selection path

49dbc7b

tests: add values-only topk coverage in cpp and onnx

9d31f3b

apply code-format

2ea44dd

apply code-format changes

5674b1c

ci: add topk test coverage and pnnx onnx test

caa9de3

ci: fix pnnx test invocation — use ctest

4e39cb6

apply code-format changes

ca55f8a

Merge pull request #2 from vlordier/topk-ci-tests

2b5fa16

Topk ci tests

Copilot AI review requested due to automatic review settings April 11, 2026 07:31

github-actions Bot added the core label Apr 11, 2026

github-actions Bot added tool pnnx test layer labels Apr 11, 2026

Copilot started reviewing on behalf of vlordier April 11, 2026 07:32 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

vlordier and others added 2 commits April 11, 2026 07:43

apply code-format changes

d68852d

vlordier force-pushed the fix-pnnx-gather-support branch from 33e04a2 to 93bd423 Compare April 11, 2026 08:59

Merge branch 'master' into fix-pnnx-gather-support

168cdea

vlordier closed this Apr 16, 2026

vlordier mentioned this pull request Apr 16, 2026

feat: add TopK, Gather, GatherElements, Expand, Tile, and Mod operators #6669

Open

vlordier deleted the fix-pnnx-gather-support branch April 16, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TopK and Gather ncnn operators for YOLOv10 deployment#6668

feat: add TopK and Gather ncnn operators for YOLOv10 deployment#6668
vlordier wants to merge 31 commits intoTencent:masterfrom
vlordier:fix-pnnx-gather-support

vlordier commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

vlordier commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Disable onnxruntime auto-detection — we only need torch2pnnx for YOLOv10
		set(onnxruntime_FOUND FALSE)

-# Disable onnxruntime auto-detection — we only need torch2pnnx for YOLOv10
-set(onnxruntime_FOUND FALSE)
+# Allow specialized builds to disable onnxruntime explicitly, while preserving
+# the default auto-detection behavior for existing ONNX->PNNX workflows.
+option(PNNX_DISABLE_ONNXRUNTIME "disable onnxruntime support" OFF)
+if(PNNX_DISABLE_ONNXRUNTIME)
+    message(STATUS "Building without onnxruntime support")
+    set(onnxruntime_FOUND FALSE)
+endif()

-    top_blob.create(out_shape.w, out_shape.h, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);
+    if (out_shape.dims == 1)
+        top_blob.create(out_shape.w, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);
+    else if (out_shape.dims == 2)
+        top_blob.create(out_shape.w, out_shape.h, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);
+    else if (out_shape.dims == 3)
+        top_blob.create(out_shape.w, out_shape.h, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);
+    else if (out_shape.dims == 4)
+        top_blob.create(out_shape.w, out_shape.h, out_shape.d, out_shape.c, input_blob.elemsize, input_blob.elempack, opt.blob_allocator);
+    else
+        return -1;

Conversation

vlordier commented Apr 11, 2026

Summary

Problem

Solution

1. Fix TopK conversion (pass_level2 → pass_ncnn)

2. Add Gather ncnn operator

3. PNNX build system fixes

Verified

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

vlordier commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants