Lzct SSE instruction when available #42

LeeCampbell · 2017-01-15T14:25:43Z

We currently have at the heart of the hot path a manual method for find the leading zero count.
This is used to identify the correct bucket to assign a recorded value.
Frustratingly, this is supported as an intrinsic instruction on most modern CPU architectures.

The code is found in Bitwise.NumberOfLeadingZeros, and has been isolated with the intent that it can be a single place to refactor/optimize if the opportunity arises.

Follow this .NET core issue for progress/resolution

dotnet/corefx#2209
possibly moved from coreFx to CoreCLR https://github.com/dotnet/coreclr/issues/8089

The text was updated successfully, but these errors were encountered:

bbarry · 2017-03-10T16:49:12Z

Couldn't that class be written like this:

internal static class Bitwise {

    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }
    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value < uint.MaxValue)
            return 63 - Log2((uint)value);
        return 31 - Log2((uint)(value>>32));
    }

    private static int Log2(uint i)
    {
        if (i >= 0x1000000) { return Lookup[i >> 24] + 24; }
        if (i >= 0x10000) { return Lookup[i >> 16] + 16; }
        if (i >= 0x100) { return Lookup[i >> 8] + 8; }
        return Lookup[i];
    }
}

or inlined:

internal static class Bitwise
{
    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }

    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value >= 0x100000000000000) { return 7 - Lookup[value >> 56]; }
        if (value >= 0x1000000000000) { return 15 - Lookup[value >> 48]; }
        if (value >= 0x10000000000) { return 23 - Lookup[value >> 40]; }
        if (value >= 0x100000000) { return 31 - Lookup[value >> 32]; }
        if (value >= 0x1000000) { return 39 - Lookup[value >> 24]; }
        if (value >= 0x10000) { return 47 - Lookup[value >> 16]; }
        if (value >= 0x100) { return 56 - Lookup[value >> 8]; }
        return 63 - Lookup[value];
    }
}

I don't have any profiling (or for that matter any stake in this project at all, I was searching for this particular function) but it seems odd that these functions are operating on signed values and that the impl for int is fundamentally different than long when the high half of a long is an int; surely one way is faster than the other.

LeeCampbell · 2017-07-14T07:49:32Z

Thanks for the interest and contribution @bbarry . Personally I am not a bit-fiddler magician, I am just leaning on the contributions from Gil and people like yourself.
I will add your ideas to my benchmarks and see what bubbles up.

LeeCampbell · 2017-07-14T13:18:21Z

Some quick feedback.
This code base uses long not ulong.
Negative values mean something to us.

Updating the proposed code to use long and work correctly (that hard coded 56 should be a 55), I added it to my benchmark suite.

For 64bit values:

 |                   Method |       Jit | Runtime |         Mean | Scaled |
 |------------------------- |---------- |-------- |-------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |     842.9 ns |   1.00 |
 |                   BBarry | LegacyJit |     Clr |     805.3 ns |   0.96 |
 |    CurrentImplementation |    RyuJit |     Clr |     712.3 ns |   1.00 |
 |                   BBarry |    RyuJit |     Clr |     853.6 ns |   1.20 |
 |    CurrentImplementation |    RyuJit |    Core |     750.9 ns |   1.00 |
 |                   BBarry |    RyuJit |    Core |   1,224.5 ns |   1.63 |

Note that on LegacyJit it sneaks ahead of the current implementation (but still comes 3rd to some other implementations).
However on RyuJIT it is significantly slower.

For predominantly 32 bit values (which in this libraries case are the more performance sensitive values):

 |                   Method |       Jit | Runtime |        Mean | Scaled |
 |------------------------- |---------- |-------- |------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |    442.8 ns |   1.00 |
 |              BBarry_imp2 | LegacyJit |     Clr |    464.9 ns |   1.05 |
 |    CurrentImplementation |    RyuJit |     Clr |    359.5 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |     Clr |    497.3 ns |   1.38 |
 |    CurrentImplementation |    RyuJit |    Core |    348.4 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |    Core |    661.8 ns |   1.90 |

Again we see that it is 5%, 38% and 90% slower than the current implementation.

I hope you find this valuable.

LeeCampbell · 2017-11-17T02:54:20Z

It appears that this functionality has been merged to dotnet/coreclr master on via dotnet/coreclr#14456.
I am not sure when this will be released or how it will be made available.

LeeCampbell added enhancement help wanted question labels Jan 15, 2017

LeeCampbell mentioned this issue Jan 15, 2017

Lzct SSE instruction when available LeeCampbell/HdrHistogram.NET#66

Closed

LeeCampbell mentioned this issue Jul 17, 2017

Leading zero count benchmark #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lzct SSE instruction when available #42

Lzct SSE instruction when available #42

LeeCampbell commented Jan 15, 2017

bbarry commented Mar 10, 2017

LeeCampbell commented Jul 14, 2017

LeeCampbell commented Jul 14, 2017

LeeCampbell commented Nov 17, 2017

Lzct SSE instruction when available #42

Lzct SSE instruction when available #42

Comments

LeeCampbell commented Jan 15, 2017

bbarry commented Mar 10, 2017

LeeCampbell commented Jul 14, 2017

LeeCampbell commented Jul 14, 2017

LeeCampbell commented Nov 17, 2017