Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lzct SSE instruction when available #42

Open
LeeCampbell opened this issue Jan 15, 2017 · 4 comments
Open

Lzct SSE instruction when available #42

LeeCampbell opened this issue Jan 15, 2017 · 4 comments

Comments

@LeeCampbell
Copy link
Collaborator

We currently have at the heart of the hot path a manual method for find the leading zero count.
This is used to identify the correct bucket to assign a recorded value.
Frustratingly, this is supported as an intrinsic instruction on most modern CPU architectures.

The code is found in Bitwise.NumberOfLeadingZeros, and has been isolated with the intent that it can be a single place to refactor/optimize if the opportunity arises.

Follow this .NET core issue for progress/resolution

@bbarry
Copy link

bbarry commented Mar 10, 2017

Couldn't that class be written like this:

internal static class Bitwise {

    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }
    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value < uint.MaxValue)
            return 63 - Log2((uint)value);
        return 31 - Log2((uint)(value>>32));
    }

    private static int Log2(uint i)
    {
        if (i >= 0x1000000) { return Lookup[i >> 24] + 24; }
        if (i >= 0x10000) { return Lookup[i >> 16] + 16; }
        if (i >= 0x100) { return Lookup[i >> 8] + 8; }
        return Lookup[i];
    }
}

or inlined:

internal static class Bitwise
{
    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }

    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value >= 0x100000000000000) { return 7 - Lookup[value >> 56]; }
        if (value >= 0x1000000000000) { return 15 - Lookup[value >> 48]; }
        if (value >= 0x10000000000) { return 23 - Lookup[value >> 40]; }
        if (value >= 0x100000000) { return 31 - Lookup[value >> 32]; }
        if (value >= 0x1000000) { return 39 - Lookup[value >> 24]; }
        if (value >= 0x10000) { return 47 - Lookup[value >> 16]; }
        if (value >= 0x100) { return 56 - Lookup[value >> 8]; }
        return 63 - Lookup[value];
    }
}

I don't have any profiling (or for that matter any stake in this project at all, I was searching for this particular function) but it seems odd that these functions are operating on signed values and that the impl for int is fundamentally different than long when the high half of a long is an int; surely one way is faster than the other.

@LeeCampbell
Copy link
Collaborator Author

Thanks for the interest and contribution @bbarry . Personally I am not a bit-fiddler magician, I am just leaning on the contributions from Gil and people like yourself.
I will add your ideas to my benchmarks and see what bubbles up.

@LeeCampbell
Copy link
Collaborator Author

Some quick feedback.
This code base uses long not ulong.
Negative values mean something to us.

Updating the proposed code to use long and work correctly (that hard coded 56 should be a 55), I added it to my benchmark suite.

For 64bit values:

 |                   Method |       Jit | Runtime |         Mean | Scaled |
 |------------------------- |---------- |-------- |-------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |     842.9 ns |   1.00 |
 |                   BBarry | LegacyJit |     Clr |     805.3 ns |   0.96 |
 |    CurrentImplementation |    RyuJit |     Clr |     712.3 ns |   1.00 |
 |                   BBarry |    RyuJit |     Clr |     853.6 ns |   1.20 |
 |    CurrentImplementation |    RyuJit |    Core |     750.9 ns |   1.00 |
 |                   BBarry |    RyuJit |    Core |   1,224.5 ns |   1.63 |

Note that on LegacyJit it sneaks ahead of the current implementation (but still comes 3rd to some other implementations).
However on RyuJIT it is significantly slower.

For predominantly 32 bit values (which in this libraries case are the more performance sensitive values):

 |                   Method |       Jit | Runtime |        Mean | Scaled |
 |------------------------- |---------- |-------- |------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |    442.8 ns |   1.00 |
 |              BBarry_imp2 | LegacyJit |     Clr |    464.9 ns |   1.05 |
 |    CurrentImplementation |    RyuJit |     Clr |    359.5 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |     Clr |    497.3 ns |   1.38 |
 |    CurrentImplementation |    RyuJit |    Core |    348.4 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |    Core |    661.8 ns |   1.90 |

Again we see that it is 5%, 38% and 90% slower than the current implementation.

I hope you find this valuable.

@LeeCampbell
Copy link
Collaborator Author

It appears that this functionality has been merged to dotnet/coreclr master on via dotnet/coreclr#14456.
I am not sure when this will be released or how it will be made available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants