Overview
Add SIMD optimization for np.where(condition, x, y) using ILKernelGenerator to improve performance for contiguous arrays.
Problem
The current np.where(condition, x, y) implementation uses NDIterator-based sequential access for all cases. For large contiguous arrays, this is significantly slower than SIMD-optimized code. NumPy uses vectorized operations internally.
Proposal
Add a SIMD fast path using Vector256.ConditionalSelect while keeping the iterator fallback for non-contiguous arrays.
Implementation
Dtype Support
All 12 NumSharp types are supported:
| Type |
Path |
Reason |
| Boolean, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Char, Single, Double |
SIMD |
1-8 byte types, vectorizable |
| Decimal |
Iterator |
16 bytes, not vectorizable |
SIMD Eligibility Criteria
bool canSimd = ILKernelGenerator.Enabled &&
outType != NPTypeCode.Decimal &&
cond.typecode == NPTypeCode.Boolean &&
cond.Shape.IsContiguous &&
xArr.Shape.IsContiguous &&
yArr.Shape.IsContiguous;
Bool Mask Expansion Challenge
The condition array is bool[] (1 byte per element), but x/y can be any dtype (1-8 bytes):
| Type |
Element Size |
V256 Elements |
Bools to Load |
| byte |
1 |
32 |
32 |
| int/float |
4 |
8 |
8 |
| long/double |
8 |
4 |
4 |
Solution: Load N bools, expand to N-element mask, then ConditionalSelect.
Evidence
Implemented in commit 3162df0c. All 83 tests pass:
- 36 existing
np.where tests
- 21 battle tests
- 26 new SIMD correctness tests
Scope / Non-goals
- Broadcast arrays: Use iterator path (stride=0 not contiguous)
- Non-bool conditions: Use iterator path (need truthiness conversion)
Related Files
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs
src/NumSharp.Core/APIs/np.where.cs
test/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs
Overview
Add SIMD optimization for
np.where(condition, x, y)usingILKernelGeneratorto improve performance for contiguous arrays.Problem
The current
np.where(condition, x, y)implementation uses NDIterator-based sequential access for all cases. For large contiguous arrays, this is significantly slower than SIMD-optimized code. NumPy uses vectorized operations internally.Proposal
Add a SIMD fast path using
Vector256.ConditionalSelectwhile keeping the iterator fallback for non-contiguous arrays.Implementation
ILKernelGenerator.Where.cswith SIMD helpersnp.where.csto dispatch to SIMD path when eligibleDtype Support
All 12 NumSharp types are supported:
SIMD Eligibility Criteria
Bool Mask Expansion Challenge
The condition array is
bool[](1 byte per element), but x/y can be any dtype (1-8 bytes):Solution: Load N bools, expand to N-element mask, then
ConditionalSelect.Evidence
Implemented in commit 3162df0c. All 83 tests pass:
np.wheretestsScope / Non-goals
Related Files
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cssrc/NumSharp.Core/APIs/np.where.cstest/NumSharp.UnitTest/Backends/Kernels/WhereSimdTests.cs