-
Notifications
You must be signed in to change notification settings - Fork 5k
Improve System.Collections.BitArray #115069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @dotnet/area-system-collections |
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
Vector256<byte> isFalse = Vector256.Equals(vector, Vector256<byte>.Zero); | ||
|
||
uint result = isFalse.ExtractMostSignificantBits(); | ||
m_array[i / 32u] = (int)(~result); | ||
m_array[i / 32] = (int)~(isFalse.ExtractMostSignificantBits()); | ||
} | ||
} | ||
else if (Vector128.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other similar code dropped the else
for this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the loop conditions for Vector256
and Vector128
paths are basically same (i <= values.Length - 32
), so there is no point executing the Vector128
path if the Vector256
path is already executed.
case 3: | ||
last = byteSpan[2] << 16; | ||
goto case 2; | ||
// fall through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know these are being caried over from the previous, but the comments are superfluous. The goto case #
makes it explicit where the control flow goes and so it isn't "technically" fallthrough and more-so just adds noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading it right, this block (lines 87 - 118) can be
bytes.CopyTo(MemoryMarshal.AsBytes(m_array.AsSpan()));
if (!BitConverter.IsLittleEndian)
{
BinaryPrimitives.ReverseEndianness<int>(m_array, m_array);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I think that 'last 4 bytes' case doesn't need to be handled explicitly.
BinaryPrimitives.ReverseEndianness<int>(m_array, m_array);
minus the <int>
part as ReverseEndianness
is a non-generic method.
bytes.CopyTo(MemoryMarshal.AsBytes(m_array.AsSpan()));
if (!BitConverter.IsLittleEndian)
{
BinaryPrimitives.ReverseEndianness(m_array, m_array);
}
ulong result = isFalse.ExtractMostSignificantBits(); | ||
m_array[i / 32u] = (int)(~result & 0x00000000FFFFFFFF); | ||
m_array[(i / 32u) + 1] = (int)((~result >> 32) & 0x00000000FFFFFFFF); | ||
ulong result = ~(isFalse.ExtractMostSignificantBits()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less efficient, you should rather have this be ulong result = (~Vector512.IsZero(vector)).ExtractMostSignificantBits()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost always better to do the negations and similar on the comparison or vector directly, as it allows it to become a direct cmpneq
instead of staying a cmpeq
followed by not
or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is no
pcmpneq
, onlypcmpeq
. - AVX512 does have
vptestmb
orvpcmpub
, but neither(~Vector512.IsZero(value)).ExtractMostSignificantBits()
nor~(Vector512.IsZero(value).ExtractMostSignificantBits())
are optimised to use them (latest daily build).
(~Vector512.IsZero(value)).ExtractMostSignificantBits()
:
vmovups zmm0, zmmword ptr [rcx]
vptestnmb k1, zmm0, zmm0
vpmovm2b zmm0, k1
vpternlogd zmm0, zmm0, zmm0, 85
vpmovb2m k1, zmm0
kmovq rax, k1
vzeroupper
ret
~(Vector512.IsZero(value).ExtractMostSignificantBits())
:
vmovups zmm0, zmmword ptr [rcx]
vptestnmb k1, zmm0, zmm0
kmovq rax, k1
not rax
vzeroupper
ret
Currently, ~(Vector512.IsZero(value).ExtractMostSignificantBits())
looks better.
It's the same with Vector512.Equals(value, Vector512<byte>.Zero)
in the place of Vector512.IsZero(value)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is xplat code and we strictly care more about the readability, legibility, and maintainability of it then micro tuning the exact instructions.
Longer term this code will likely be shared with the vector128/256 paths using generics and the ISimdVector interface, so that we only need to maintain 1 rather than 3 paths. We may even centralize the looping logic in a similar way as is done for TensorPrimitives
As such, the pattern I indicated is the best long term pattern across all potential hardware. If there is a case of suboptimal codegen then we should separately log an issue to track that and get it improved because we want developers to use and follow the more idiomatic patterns and use the relevant dedicated APIs where possible
Notably vpcmpneq does exist and is an alias for vpcmp with the correct immediate control byte (I believe 4). The same “should” be an optimization we do over vptest, it just looks to be missing right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't go to the details of the exact instructions, the intuitive reasoning should rather be that inverting a 64-bit ulong
looks less expensive than inverting a 512-bit Vector512<byte>
, so it prefers ~(Vector512.IsZero(value).ExtractMostSignificantBits())
.
use the relevant dedicated APIs where possible
What dedicated API? There is no VectorXXX.NotEquals()
or VectorXXX.IsNonZero()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What dedicated API? There is no VectorXXX.NotEquals() or VectorXXX.IsNonZero().
Using ~
directly on the comparison/query, much as you would use !
on something returning bool
the intuitive reasoning should rather be that inverting a 64-bit ulong looks less expensive than inverting a 512-bit Vector512, so it prefers ~(Vector512.IsZero(value).ExtractMostSignificantBits()).
That isn't how vectors work at all nor how someone using vectors should expect them to work. Vectors are "primitives" that are expected to be able to process Count
items in the same general time as 1 scalar item. Any kind of reduction operation, such as Sum
or ExtractMostSignificantBits
which has to create a value combined from the contained elements is expected to be more expensive, and in many cases it will be.
Use xplat SIMD intrinsics in
BitArray.CopyTo
. Closes #116079Other minor improvements.