Skip to content

Improve System.Collections.BitArray #115069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

tfenise
Copy link
Contributor

@tfenise tfenise commented Apr 25, 2025

Use xplat SIMD intrinsics in BitArray.CopyTo. Closes #116079

Other minor improvements.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 25, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@tfenise tfenise marked this pull request as ready for review May 1, 2025 17:00
Vector256<byte> isFalse = Vector256.Equals(vector, Vector256<byte>.Zero);

uint result = isFalse.ExtractMostSignificantBits();
m_array[i / 32u] = (int)(~result);
m_array[i / 32] = (int)~(isFalse.ExtractMostSignificantBits());
}
}
else if (Vector128.IsHardwareAccelerated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other similar code dropped the else for this branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the loop conditions for Vector256 and Vector128 paths are basically same (i <= values.Length - 32), so there is no point executing the Vector128 path if the Vector256 path is already executed.

case 3:
last = byteSpan[2] << 16;
goto case 2;
// fall through
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these are being caried over from the previous, but the comments are superfluous. The goto case # makes it explicit where the control flow goes and so it isn't "technically" fallthrough and more-so just adds noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading it right, this block (lines 87 - 118) can be

bytes.CopyTo(MemoryMarshal.AsBytes(m_array.AsSpan()));

if (!BitConverter.IsLittleEndian)
{
    BinaryPrimitives.ReverseEndianness<int>(m_array, m_array);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I think that 'last 4 bytes' case doesn't need to be handled explicitly.

BinaryPrimitives.ReverseEndianness<int>(m_array, m_array);

minus the <int> part as ReverseEndianness is a non-generic method.

bytes.CopyTo(MemoryMarshal.AsBytes(m_array.AsSpan()));

if (!BitConverter.IsLittleEndian)
{
    BinaryPrimitives.ReverseEndianness(m_array, m_array);
}

ulong result = isFalse.ExtractMostSignificantBits();
m_array[i / 32u] = (int)(~result & 0x00000000FFFFFFFF);
m_array[(i / 32u) + 1] = (int)((~result >> 32) & 0x00000000FFFFFFFF);
ulong result = ~(isFalse.ExtractMostSignificantBits());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less efficient, you should rather have this be ulong result = (~Vector512.IsZero(vector)).ExtractMostSignificantBits()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost always better to do the negations and similar on the comparison or vector directly, as it allows it to become a direct cmpneq instead of staying a cmpeq followed by not or similar

Copy link
Contributor Author

@tfenise tfenise May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is no pcmpneq, only pcmpeq.
  2. AVX512 does have vptestmb or vpcmpub, but neither (~Vector512.IsZero(value)).ExtractMostSignificantBits() nor ~(Vector512.IsZero(value).ExtractMostSignificantBits()) are optimised to use them (latest daily build).

(~Vector512.IsZero(value)).ExtractMostSignificantBits():

vmovups  zmm0, zmmword ptr [rcx]
vptestnmb k1, zmm0, zmm0
vpmovm2b zmm0, k1
vpternlogd zmm0, zmm0, zmm0, 85
vpmovb2m k1, zmm0
kmovq    rax, k1
vzeroupper
ret

~(Vector512.IsZero(value).ExtractMostSignificantBits()):

vmovups  zmm0, zmmword ptr [rcx]
vptestnmb k1, zmm0, zmm0
kmovq    rax, k1
not      rax
vzeroupper
ret

Currently, ~(Vector512.IsZero(value).ExtractMostSignificantBits()) looks better.

It's the same with Vector512.Equals(value, Vector512<byte>.Zero) in the place of Vector512.IsZero(value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is xplat code and we strictly care more about the readability, legibility, and maintainability of it then micro tuning the exact instructions.

Longer term this code will likely be shared with the vector128/256 paths using generics and the ISimdVector interface, so that we only need to maintain 1 rather than 3 paths. We may even centralize the looping logic in a similar way as is done for TensorPrimitives

As such, the pattern I indicated is the best long term pattern across all potential hardware. If there is a case of suboptimal codegen then we should separately log an issue to track that and get it improved because we want developers to use and follow the more idiomatic patterns and use the relevant dedicated APIs where possible

Notably vpcmpneq does exist and is an alias for vpcmp with the correct immediate control byte (I believe 4). The same “should” be an optimization we do over vptest, it just looks to be missing right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't go to the details of the exact instructions, the intuitive reasoning should rather be that inverting a 64-bit ulong looks less expensive than inverting a 512-bit Vector512<byte>, so it prefers ~(Vector512.IsZero(value).ExtractMostSignificantBits()).

use the relevant dedicated APIs where possible

What dedicated API? There is no VectorXXX.NotEquals() or VectorXXX.IsNonZero().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What dedicated API? There is no VectorXXX.NotEquals() or VectorXXX.IsNonZero().

Using ~ directly on the comparison/query, much as you would use ! on something returning bool

the intuitive reasoning should rather be that inverting a 64-bit ulong looks less expensive than inverting a 512-bit Vector512, so it prefers ~(Vector512.IsZero(value).ExtractMostSignificantBits()).

That isn't how vectors work at all nor how someone using vectors should expect them to work. Vectors are "primitives" that are expected to be able to process Count items in the same general time as 1 scalar item. Any kind of reduction operation, such as Sum or ExtractMostSignificantBits which has to create a value combined from the contained elements is expected to be more expensive, and in many cases it will be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Collections community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update BitArray.CopyTo to use the xplat intrinsics
7 participants