Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

Merged
merged 54 commits into from Apr 3, 2019
Merged
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
2ef76c9
Add generic utility functions for safely cross-converting numeric types
vexx32 Sep 29, 2018
d9e8e09
Add CharTraits flag for binary digit
vexx32 Sep 29, 2018
76c1812
Add NumberSuffixFlags for BigInt
vexx32 Sep 29, 2018
8628806
Add extended hex and binary parsing tests
vexx32 Sep 29, 2018
fc9e5cd
Fix decimal parsing issue where 0_00 would fail to parse
vexx32 Sep 29, 2018
2b00b64
Style & spacing fixes
vexx32 Sep 29, 2018
c7c129f
Trim complex logic & fix issues
vexx32 Sep 29, 2018
f737d30
Further parse optimizations
vexx32 Oct 4, 2018
3e92c0d
Introduce local for clarity in ScanNumberHelper
vexx32 Oct 4, 2018
dd558a3
Update parser tests for underscore formats
vexx32 Oct 4, 2018
5bbe98f
Optimize ParseBinary & tokenizer parsing cases
vexx32 Oct 4, 2018
bbf76c5
Minor Fixes post-rebase
vexx32 Oct 11, 2018
738c33e
Whitespace / style fixes for clarirt & StyleCop
vexx32 Oct 12, 2018
4f68e43
Address review comments
vexx32 Oct 12, 2018
52f7739
Refactor for efficiency per @iSazonov's recommendation
vexx32 Oct 12, 2018
c537855
Update comment
vexx32 Oct 12, 2018
26248e6
Update AsBigInt extension method
vexx32 Oct 12, 2018
dca6c6d
Pre-insert leading 0 to hex literals in ScanNumberHelper
vexx32 Oct 13, 2018
d59fd89
Update ParseBinary to use spans throughout
vexx32 Oct 14, 2018
5eb6bf6
Use stackalloc for small binary numerals
vexx32 Oct 15, 2018
63ffd97
Add comments
vexx32 Oct 15, 2018
abc3e0b
Fix glitch with `0b1` in binary parser
vexx32 Oct 15, 2018
5d34f9e
Rename variables for clarity
vexx32 Oct 16, 2018
f358087
Tidy and consolidate comments for binary parser
vexx32 Oct 16, 2018
f66218d
Add comments and format code for readability
vexx32 Oct 17, 2018
1dd414e
Fix decimal literal parsing into double
vexx32 Oct 22, 2018
8223a6a
Remove underscore parsing per Committee decision
vexx32 Nov 29, 2018
e74c4a1
Use `n`/`N` suffix for bigint instead of `I`
vexx32 Nov 29, 2018
1c16a15
Restore test
vexx32 Nov 29, 2018
d03a4b9
Prevent hex and binary automatically parsing as decimal
vexx32 Nov 29, 2018
8bbe0e3
Swap `i`->`n` bigint suffix in tests
vexx32 Nov 29, 2018
b1b246c
Alphabetize using statements in Utils.cs
vexx32 Nov 29, 2018
cccd610
Update CharTratis enum with proper documentation headers
vexx32 Nov 29, 2018
f555a2f
Re-add missing using statement
vexx32 Dec 3, 2018
6b72b94
Refactor to address review comments
vexx32 Dec 4, 2018
917af3d
Define start index based on allocated bytes
vexx32 Dec 5, 2018
169d687
Revert refactor for readability
vexx32 Dec 5, 2018
1b52625
Small refactor for readability and logical cohesion.
vexx32 Dec 7, 2018
68f0a87
Replace variable name for readability
vexx32 Dec 8, 2018
f7c16e5
Merge branch 'master' into Tokenizer/Refactor
vexx32 Dec 20, 2018
6e1711b
Merge branch 'master' into Tokenizer/Refactor
vexx32 Dec 28, 2018
5ecdf52
Merge branch 'master' into Tokenizer/Refactor
vexx32 Jan 1, 2019
e08ad0b
Merge branch 'master' into Tokenizer/Refactor
vexx32 Jan 12, 2019
270f6a0
Merge branch 'master' into Tokenizer/Refactor
vexx32 Jan 18, 2019
ecc37bd
:white_check_mark: Update tests to use errorID
vexx32 Jan 30, 2019
959e52c
:pencil: Use correct ErrorID for should -throw
vexx32 Jan 30, 2019
99ec56d
:pencil: Fix last failing test
vexx32 Jan 30, 2019
091df16
:sparkles: Use suggested pattern for binary/decimal digits
vexx32 Jan 31, 2019
684cc31
:white_check_mark: Add more tests for hex parsing
vexx32 Jan 31, 2019
8822139
:recycle: Address Rob's feedback
vexx32 Feb 12, 2019
ec2e105
:memo: Add comment to clarify why less readable method is used.
vexx32 Feb 12, 2019
07649fb
Empty commit to restart CI
vexx32 Feb 12, 2019
c2c5e64
Merge branch 'Tokenizer/Refactor' of https://github.com/vexx32/PowerS…
vexx32 Feb 12, 2019
4f4cf22
Merge branch 'master' into Tokenizer/Refactor
vexx32 Mar 4, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

@@ -5,17 +5,18 @@
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.ComponentModel;
using System.Diagnostics;
using System.Diagnostics.CodeAnalysis;
using System.Globalization;
using System.IO;
using System.Linq;
using System.ComponentModel;
using System.Management.Automation.Configuration;
using System.Management.Automation.Internal;
using System.Management.Automation.Language;
using System.Management.Automation.Runspaces;
using System.Management.Automation.Security;
using System.Numerics;
using System.Reflection;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
@@ -38,6 +39,229 @@ namespace System.Management.Automation
/// </summary>
internal static class Utils
{
/// <summary>
/// Converts a given double value to BigInteger via Math.Round().
/// </summary>
/// <param name="d">The value to convert.</param>
/// <returns>Returns a BigInteger value equivalent to the input value rounded to nearest integer.</returns>
internal static BigInteger AsBigInt(this double d) => new BigInteger(Math.Round(d));

internal static bool TryCast(BigInteger value, out byte b)
{
if (value < byte.MinValue || byte.MaxValue < value)
{
b = 0;
return false;
}

b = (byte)value;
return true;
}

internal static bool TryCast(BigInteger value, out sbyte sb)
{
if (value < sbyte.MinValue || sbyte.MaxValue < value)
{
sb = 0;
return false;
}

sb = (sbyte)value;
return true;
}

internal static bool TryCast(BigInteger value, out short s)
{
if (value < short.MinValue || short.MaxValue < value)
{
s = 0;
return false;
}

s = (short)value;
return true;
}

internal static bool TryCast(BigInteger value, out ushort us)
{
if (value < ushort.MinValue || ushort.MaxValue < value)
{
us = 0;
return false;
}

us = (ushort)value;
return true;
}

internal static bool TryCast(BigInteger value, out int i)
{
if (value < int.MinValue || int.MaxValue < value)
{
i = 0;
return false;
}

i = (int)value;
return true;
}

internal static bool TryCast(BigInteger value, out uint u)
{
if (value < uint.MinValue || uint.MaxValue < value)
{
u = 0;
return false;
}

u = (uint)value;
return true;
}

internal static bool TryCast(BigInteger value, out long l)
{
if (value < long.MinValue || long.MaxValue < value)
{
l = 0;
return false;
}

l = (long)value;
return true;
}

internal static bool TryCast(BigInteger value, out ulong ul)
{
if (value < ulong.MinValue || ulong.MaxValue < value)
{
ul = 0;
return false;
}

ul = (ulong)value;
return true;
}

internal static bool TryCast(BigInteger value, out decimal dm)
{
if (value < (BigInteger)decimal.MinValue || (BigInteger)decimal.MaxValue < value)
{
dm = 0;
return false;
}

dm = (decimal)value;
return true;
}

internal static bool TryCast(BigInteger value, out double db)
{
if (value < (BigInteger)double.MinValue || (BigInteger)double.MaxValue < value)
{
db = 0;
return false;
}

db = (double)value;
return true;
}

/// <summary>
/// Parses a given string or ReadOnlySpan&lt;char&gt; to calculate its value as a binary number.
/// Assumes input has already been sanitized and only contains zeroes (0) or ones (1).
This conversation was marked as resolved by iSazonov

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 14, 2018

Collaborator

For MSFT experts: should we add Debug.Assert() for this?

This comment has been minimized.

Copy link
@SteveL-MSFT

SteveL-MSFT Feb 1, 2019

Member

Don't think there's much value with an assert for this

/// </summary>
/// <param name="digits">Span or string of binary digits. Assumes all digits are either 1 or 0.</param>
/// <param name="unsigned">
/// Whether to treat the number as unsigned. When false, respects established conventions
/// with sign bits for certain input string lengths.
/// </param>
/// <returns>Returns the value of the binary string as a BigInteger.</returns>
internal static BigInteger ParseBinary(ReadOnlySpan<char> digits, bool unsigned)
{
if (!unsigned)
{
if (digits[0] == '0')
{
unsigned = true;
}
else
{
switch (digits.Length)
{
// Only accept sign bits at these lengths:
case 8: // byte
case 16: // short
case 32: // int
case 64: // long
case 96: // decimal

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?

@mklement0 have you any thoughts?

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

@vexx32 It is not low level problem because we say about digits - it can be PowerShell language propblem if ARM users expect sign bit on another position.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

I found nothing useful. I hope @mklement0 help.

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 22, 2018

Contributor

As for CPU architectures: I don't think we need to worry about those, as long as the behavior is well-documented in terms of what bit patterns result in what .NET primitive types.

Note that we already abstract endianness away by always interpreting a byte-spanning bit / hex-digit pattern from left to right (significant bits first).

However, note that this proposed implementation does not match the existing parsing of hex constants:

PS> 0xff; 0b11111111; 0b011111111 
255 # unsigned
-1  # !! signed
255 # unsigned again, simply due to prepending 0

We should at least remain consistent with hex parsing, which - based on my experiments (haven't looked at the source) is not based on fixed bit positions - leading zeros make no difference - but is simply based on the smallest signed integer type >= [int32] that can accommodate the value as a bit pattern, meaning irrespective of its sign, which currently (before this PR) tops out at [int64]:

PS> 0xffffffff; 0x0ffffffff  # 32 effective bits
-1  # !! bit pattern fits into [int32], and because the high bit happens to be set, the result is *negative*
-1  # behavior is not dependent on specific number of digits

PS> 0xffffffffffffffff; 0x0ffffffffffffffff # 64 effective bits
-1  # !! bit pattern fits into [int64], and because the high bit happens to be set, the result is *negative*
-1  # behavior is not dependent on specific number of digits

Therefore:

  • leading zeros should be ignored.

  • the boundaries should be just 32 and 64 for consistency with current hex-literal parsing, and 96 for the new widening to [decimal].

  • parsing of decimal literals > [decimal] is seemingly broken by this PR; e.g., the equivalent of 0x1000000000000000000000000 ([decimal]::MaxValue + 1) is decimal 79228162514264337593543950336, which in v6.1.0 yields 7.92281625142643E+28, i.e. a [double](!)

PS>  79228162514264337593543950336 # in 6.1.0: 7.92281625142643E+28
...
The numeric constant 79228162514264337593543950336 is not valid. # !! ERROR, in this PR
...

Note that there's a current inconsistency in that hex literals do not widen beyond [int64], whereas decimal ones do, as shown above, but ultimately to [double] (-> [decimal] -> [double)

While ultimately widening to [bigint] instead of [double] makes sense for decimal literals too, it would technically be a breaking change we should discuss; at the very least the existing behavior with decimal literals shouldn't be broken.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 22, 2018

Author Contributor

Thank you very much for taking the time to set this out! 😄

So... yes, I agree. Mostly.

The current hexadecimal parsing behaviour has been maintained as-is for compatibility reasons; I don't want folks who've been using 0xFF for 255 to suddenly find their literals coming back as -1 and so forth.

However, given that binary strings are typically really most useful at the byte level for dealing with and helping to conceptualize bitwise operations, I don't think it's at all useful to copy the present hex behaviour of signed bits. Very few people are going to want to write out 32 digits just to get a sign bit. And, yes, you can use a negative-sign prefix to bypass that and just flip your numbers. However, given that when working with binary literals users will largely want them to mimic expected bit-pattern behaviours and that binary strings will be unwieldy to use at int32/64 lengths and beyond, I cannot imagine that it is meaningful to restrict sign bits to 32, 64, and 96 digit lengths.

After all, if they need the numeral to be unsigned, they need only append the u suffix. (Same for hex, incidentally, but I doubt that will be changing anytime soon.)

Really, the only reason hex literals don't widen beyond int64 is simply due to the fact that there is no native TryParse() method that allows it for double and other types, which is rendered irrelevant with the methods I chose to use here.

Regarding your point about leading zeroes... currently they are ignored, yes, and I am unsure whether that ought to be retained? Given they are currently ignored, I'm not sure I'd expect many scripts to bother including leading zeros, but I'll leave that choice up to folks cleverer than I.

Let me check that decimal parsing case... I might have let a code path slip me by, it's no matter. Just gotta make sure the integer widens to double as it needs to. I'll have that fixed soon -- nice catch! 😄

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 22, 2018

Author Contributor

@mklement0 yep, I made an error in the TryCast overload for double casting. Thanks for catching that!

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 22, 2018

Contributor

My pleasure; thank you for all the work on typed number literals - great additions to the language - and your thoughtful contributions to this community in general.


Generally, yes, the use of u avoids the whole sign-bit problem - but you have to remember to use it.

By default, I would think that you would usually want positive, i.e., unsigned numbers, and I wish that's how it had always worked with hex. literals - sadly, that ship has sailed.

Given that binary / hex literals are typically used for bit fields / flag enumeration values, there's rarely a reason for negative numbers - or am I missing important use cases?

Personally, I don't think of binary literals as byte-oriented, because 16-bit bit fields aren't unusual, for instance, though, in general, you'll be dealing with smaller numbers than in hex. notation, I presume.

Tying the signed-bit use to the resulting type, which is how hex. literals currently do it, is a simple rule that could apply to binary literals equally.

In fact, from a bit-field perspective, the current implementation yields unexpected and ambiguous results, because the result type's -1 bit pattern is substituted for the input's:

E.g., 0b11111111 is interpreted as a signed byte and therefore becomes -1, but it doesn't become -1 as a (signed) byte, but as the Int32 instance it is converted to (as the smallest return type chosen), whose bit pattern is then obviously different: 0b11111111111111111111111111111111; to put it differently:
currently, 0b11111111, 0b1111111111111111, and 0b11111111111111111111111111111111 all turn in the very same number: -1 as an [int].

I suppose the inferred result type could be made more granular for binary literals, and start with [sbyte], then widen to [short], but I'm not sure that's the way to go, both for reasons of diverging from hex.-literal behavior and for the questionable utility of the interpretation as a signed type in general.

Staying consistent with hex. literals additionally has the up-side that you're less likely to run into the potentially surprising sign-bit behavior, given that binary literals will typically be smaller in value; i.e., you're less likely to get a negative number when you didn't expect one.


Re leading zeros: Yes, it would be odd for users to include them to begin with.

I think simply ignoring them is an easy and easily documentable solution to this problem, especially given that they currently (pre-PR) serve no useful purpose in hex. literals - they do not widen the type.

And neither do they widen the type of the binary literal type in this PR; all they do is cause the - obscure - change in signed/unsigned behavior.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 22, 2018

Author Contributor

Yeah, I hear you. I actually did experiment with having binary (and hex as well) return sub-int32 types, all the way down to byte and sbyte.

Given that in a majority of cases any actual arithmetic automatically upscales them appropriately in any case, this isn't a terrible solution. Actually, due to its backing in C#, any arithmetic performed on lower types automatically converts values up to int32 in many cases. However, it would be a bit of an oddity and ultimately I agree that it would probably do more harm than good.

I suppose that given the fact that one can ultimately flip the value with a prefixed negative sign, it's much of a muchness which sign bit lengths the parser respects, as it really is entirely up to us to figure out the most effective and useful behaviour in this instance.

Is it worth looking at how C# implements binary literals and attempting to match that relatively closely, do you think?

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 22, 2018

Contributor

C# exhibits consistent behavior across hex and binary literals, and implicitly promotes from signed to unsigned types as needed; in other words: you always get positive numbers - it's how I wish it worked in PowerShell too.

As in PowerShell, the smallest type chosen is Int32.

// Hex
0x1 // -> Int32
0x7fffffff // -> Int32 - int.MaxValue
0x80000000 // -> UInt32 - int.MaxValue + 1

// Binary equivalents
0b1 // -> Int32  
0b01111111111111111111111111111111 // -> Int32
0b10000000000000000000000000000000 // -> Uint32

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 22, 2018

Author Contributor

That would be nice to do, and well within the realm of possibility. I suppose that depends on whether in not it is considered an allowable change at this point in time :s

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 22, 2018

Contributor

Good question, @vexx32.

Perhaps, if we're lucky, the following changes will fall into Bucket 3: Unlikely Grey Area:

  • Treat all number bases in literals the same - as C# does (decimal, hex, binary)

  • Use [int32] as the smallest type, and promote as needed from signed to unsigned, as shown above - resulting in positive numbers always.

    • This is a deviation from current hex.-literal parsing.
  • (PowerShell-specific) For values > [int64]::MaxValue (which C# doesn't support), promote to [decimal] first, and, if > [decimal]::MaxValue, to [bigint]

    • This is a deviation from current decimal-literal parsing, which promotes to [double] for values > [decimal]::MaxValue

What do you think, @SteveL-MSFT?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 23, 2018

Collaborator

From docs https://docs.microsoft.com/en-us/dotnet/api/system.numerics.biginteger?view=netframework-4.7.2

When parsing a hexadecimal string, the BigInteger.Parse(String, NumberStyles) and BigInteger.Parse(String, NumberStyles, IFormatProvider) methods assume that if the most significant bit of the first byte in the string is set, or if the first hexadecimal digit of the string represents the lower four bits of a byte value, the value is represented by using two's complement representation. For example, both "FF01" and "F01" represent the decimal value -255. To differentiate positive from negative values, positive values should include a leading zero. The relevant overloads of the ToString method, when they are passed the "X" format string, add a leading zero to the returned hexadecimal string for positive values. This makes it possible to round-trip BigInteger values by using the ToString and Parse methods, as the following example shows.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 23, 2018

Author Contributor

Aye, that's the BigInteger parse method. C#'s own parsing rules for hex / binary literals in code is very different. I understand why there are so many varying ways that these can be parsed in C#, but it's a bit baffling to me that they don't implement any kind of direct standard parsing for hex/binary that does mimic their native parsing. Very odd!

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 24, 2018

Contributor

Good points, @vexx32.

@iSazonov, remember that [bigint] here is just a helper type behind the scenes for implementing arbitrarily large binary and hex source-code literals (and possible decimal ones, based on the suggestions above) - we are not trying to emulate [bigint]'s runtime parsing of strings.

As shown above, C# - sensibly - interprets all hex and binary source-code literals as positive numbers, choosing the smallest signed or unsigned type >= Int32 to accommodate that number, up to UInt64.

Regrettably, PowerShell doesn't currently do that for hex numbers, hence my Bucket 3 suggestion above.

If we get the green light to fix this, it means applying the same logic to the > UInt64 ranges we support:

  • <= 0x7fffffffffffffffffffffff (and its binary equivalent) -> [decimal]
  • >= 0x800000000000000000000000 (and its binary equivalent) -> [bigint]

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 24, 2018

Contributor

@vexx32: Another thing about the current PR: a hex literal with >= 40 f digits reverts to an Int32:

PS> (0xffffffffffffffffffffffffffffffffffffffff).GetType().FullName
System.Int32

While the seemingly equivalent binary literal composed of 160 1 digits actually remains a [bigint], it still turns into a negative number:

PS> 0b1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
-1

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 24, 2018

Author Contributor

Yep! I'm already working around how BigInt wants to parse things and special-casing things to mimic present behaviour. It would only simplify the special casing that I'd need to perform if we can opt to emulate C#'s interpretation of binary/hexadecimal literals.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 24, 2018

Author Contributor

Yep, the special casing is still a bit tricky, having types match up properly with values and whatnot. Once we get the final word on exactly how we want the literals to behave I can clean up and have the return types and casing behave to that specification.

BigInt is a bit tricky to work with here and requires some additional handling, but nothing too crazy. It certainly doesn't work as one initially expects in high values, either; every string 8n digits long is treated with the leading bit representing sign unless we tell it that it's an unsigned number. Similarly with hex. In both cases the output type has to be manually detected from the input string's length due to sign bit shenanigans.

Simplifying it as suggested would cut down quite a bit on necessary parsing logic to determine these things, but ultimately it's not a huge deal (for me, at least) either way. I just need a concrete spec to build to before I go changing it again. 😄

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 24, 2018

Contributor

@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:

From a hex string: prepend a 0:

// 0xff as 255; without the leading 0, the result would be -1
BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)

From a byte array - prepend a 0x byte (with Big-Endian ordering):

new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 24, 2018

Author Contributor

Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.

And with binary, it would simply always mean invoking the BigInteger constructor with a value of true for the isUnsigned parameter.

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 5, 2018

Author Contributor

Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.

This comment has been minimized.

Copy link
@SteveL-MSFT

SteveL-MSFT Feb 2, 2019

Member

Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 2, 2019

Author Contributor

I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.

I could even put it in as an experimental feature later on, I think.

case int n when n >= 128: // BigInteger
break;
default:
// If we do not flag these as unsigned, bigint assumes a sign bit for any (8 * n) string length
unsigned = true;
break;
}
}
}

// Only use heap allocation for very large numbers
const int MaxStackAllocation = 512;

// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number.
int outputByteCount = (digits.Length + 7) / 8;
Span<byte> outputBytes = outputByteCount <= MaxStackAllocation ? stackalloc byte[outputByteCount] : new byte[outputByteCount];

This comment has been minimized.

Copy link
@iSazonov

iSazonov Dec 4, 2018

Collaborator

CoreFX team recommend to use a const for stack allocations to get more fast code.

Suggested change
Span<byte> outputBytes = outputByteCount <= MaxStackAllocation ? stackalloc byte[outputByteCount] : new byte[outputByteCount];
Span<byte> outputBytes = outputByteCount <= MaxStackAllocation ? stackalloc byte[MaxStackAllocation] : new byte[outputByteCount];

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 4, 2018

Author Contributor

I did give this a look, but this completely breaks parsing due to it typically filling the bytes from first to last, and the actual BigInteger constructor being used as big-endian (it ends up parsing every number as at least a 512-byte number, and with the method of filling bytes it currently uses, this means it fills the high byte instead of the low byte).

This could be worked in, I think, but I think it would require additional allocations that I think it would likely be more effective to avoid.

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 5, 2018

Author Contributor

@iSazonov once I got the logic fixed, which wasn't nearly as hard as I first thought, here are the results of two different methods of doing it vs the current method.

The first alternate method slices the over-allocated span down to size before passing it to the BigInteger constructor, and the second simply ensures that all bytes are positioned in the span to be correctly parsed, which just meant redefining the starting point to be the end of the span, instead of counting the bytes needed there.

https://gist.github.com/vexx32/375d5553dfe3edee60cd9616228dd3a5

In this instance, it seems that even allocating according to a const value is slower thanks to the extra allocations and/or operations needed somewhere along the line. These benchmarks actually have MaxStackAlloc set to 128 and not 512, because setting it at 512 was ludicrously slow for the new methods.

This comment has been minimized.

Copy link
@iSazonov

iSazonov Dec 5, 2018

Collaborator

Using const speeds up calling the method not the method itself. Are you sure that your test measure this correctly?

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 5, 2018

Author Contributor

Even if it speeds up calling the method, the excess allocation is taking far more time than it could possibly be saving on the method call, unless method calls are an order of magnitude more expensive than I am generally led to believe.

But be that as it may, I'm not really sure how one would measure that appropriately.

This comment has been minimized.

Copy link
@iSazonov

iSazonov Dec 6, 2018

Collaborator

Sorry I was not accurate. The compiler optimization can be performed only if stackalloc Const is not under a condition.
I mean a code like:

Suggested change
Span<byte> outputBytes = outputByteCount <= MaxStackAllocation ? stackalloc byte[outputByteCount] : new byte[outputByteCount];
Span<byte> outputBytes = stackalloc byte[MaxStackAllocation];
if (outputByteCount > MaxStackAllocation)
{
outputBytes = new byte[outputByteCount];
}

I'm ok with current code too.

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 6, 2018

Author Contributor

Ah, interesting, thank you! I'll see if that makes an appreciable difference. 😄

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 6, 2018

Author Contributor

@iSazonov I'm not sure what kind of perf benefit this should give, but all my attempts to include this suggestion haven't seemed to produce the results you're describing.

https://gist.github.com/vexx32/d2b7fff82f93635ea974bdbc0c26fa89

This is the benchmarking code I'm using, with BenchmarkDotNet targeting netcoreapp2.1; if you're able to take a look and see what I might be doing wrong here that'd be awesome.

Thank you for all your fantastic help! 😄

int outputByteIndex = outputBytes.Length - 1;

// We need to be prepared for any partial leading bytes, (e.g., 010|00000011|00101100), or cases
// where we only have less than 8 bits to work with from the beginning.
//
// Walk bytes right to left, stepping one whole byte at a time (if there are any whole bytes).
int byteWalker;
for (byteWalker = digits.Length - 1; byteWalker >= 7; byteWalker -= 8)
{
// Use bit shifts and binary-or to sum the values in each byte. These calculations will
// create values higher than a single byte, but the higher bits will be stripped out when cast
// to byte.
//
// The low bits are added in separately to allow us to strip the higher 'noise' bits before we
// sum the values using binary-or.
//
// Simplified representation of logic: (byte)( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 )
//
// N.B.: This code has been tested against a straight for loop iterating through the byte, and in no
// circumstance was it faster or more effective than this unrolled version.
outputBytes[outputByteIndex--] =
(byte)(
( (digits[byteWalker - 7] << 7)
| (digits[byteWalker - 6] << 6)
| (digits[byteWalker - 5] << 5)
| (digits[byteWalker - 4] << 4)
)
| (
( (digits[byteWalker - 3] << 3)
| (digits[byteWalker - 2] << 2)
| (digits[byteWalker - 1] << 1)
| (digits[byteWalker])
) & 0b1111
)
);
}

// With complete bytes parsed, byteWalker is either at the partial byte start index, or at -1
if (byteWalker >= 0)
{
int currentByteValue = 0;
for (int i = 0; i <= byteWalker; i++)
{
currentByteValue = (currentByteValue << 1) | (digits[i] - '0');
}

outputBytes[outputByteIndex] = (byte)currentByteValue;
}

return new BigInteger(outputBytes, isUnsigned: unsigned, isBigEndian: true);
}

// From System.Web.Util.HashCodeCombiner
internal static int CombineHashCodes(int h1, int h2)
{
@@ -27,39 +27,64 @@ internal static class SpecialChars
[Flags]
internal enum CharTraits
{
/// <summary>
/// No specific character traits.
/// </summary>
None = 0x0000,

// For identifiers, is the character a letter?
/// <summary>
/// For identifiers, the first character must be a letter or underscore.
/// </summary>
IdentifierStart = 0x0002,

// The character is a valid first character of a multiplier
/// <summary>
/// The character is a valid first character of a multiplier.
/// </summary>
MultiplierStart = 0x0004,

// The character is a valid type suffix for numeric literals
/// <summary>
/// The character is a valid type suffix for numeric literals.
/// </summary>
TypeSuffix = 0x0008,

// The character is a whitespace character
/// <summary>
/// The character is a whitespace character.
/// </summary>
Whitespace = 0x0010,

// The character terminates a line.
/// <summary>
/// The character terminates a line.
/// </summary>
Newline = 0x0020,

// The character is a hexadecimal digit.
/// <summary>
/// The character is a hexadecimal digit.
/// </summary>
HexDigit = 0x0040,

// The character is a decimal digit.
/// <summary>
/// The character is a decimal digit.
/// </summary>
Digit = 0x0080,

// The character is allowed as the first character in an unbraced variable name.
/// <summary>
/// The character is allowed as the first character in an unbraced variable name.
/// </summary>
VarNameFirst = 0x0100,

// The character is not part of the token being scanned.
/// <summary>
/// The character is not part of the token being scanned.
/// </summary>
ForceStartNewToken = 0x0200,

// The character is not part of the token being scanned, when the token is known to be part of an assembly name.
/// <summary>
/// The character is not part of the token being scanned, when the token is known to be part of an assembly name.
/// </summary>
ForceStartNewAssemblyNameSpecToken = 0x0400,

// The character is the first character of some operator (and hence is not part of a token that starts a number)
/// <summary>
/// The character is the first character of some operator (and hence is not part of a token that starts a number).
/// </summary>
ForceStartNewTokenAfterNumber = 0x0800,
}

@@ -150,7 +175,7 @@ static CharExtensions()
/* K */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* L */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.TypeSuffix,
/* M */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* N */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
/* N */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.TypeSuffix,
/* O */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
/* P */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* Q */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
@@ -182,7 +207,7 @@ static CharExtensions()
/* k */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* l */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.TypeSuffix,
/* m */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* n */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
/* n */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.TypeSuffix,
/* o */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
/* p */ CharTraits.IdentifierStart | CharTraits.VarNameFirst | CharTraits.MultiplierStart,
/* q */ CharTraits.IdentifierStart | CharTraits.VarNameFirst,
@@ -298,18 +323,17 @@ internal static bool IsHexDigit(this char c)
return false;
}

// Return true if the character is a decimal digit.
internal static bool IsDecimalDigit(this char c)
{
if (c < 128)
{
return (s_traits[c] & CharTraits.Digit) != 0;
}
// Returns true if the character is a decimal digit.
internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;

return false;
}
// These decimal/binary checking methods are more performant than the alternatives due to requiring
// less overall operations than a more readable check such as {(this char c) => c == 0 | c == 1},
// especially in the case of IsDecimalDigit().

// Returns true if the character is a binary digit.
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;

This comment has been minimized.

Copy link
@rjmholt

rjmholt Feb 11, 2019

Member

While this is true, it seems more straightforward to just have => c == '0' || c == '1';. It took me a minute or so to realise why this isn't a problem.

This comment has been minimized.

Copy link
@rjmholt

rjmholt Feb 11, 2019

Member

(Perhaps the coercion + comparison has a performance advantage?)

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 12, 2019

Author Contributor

@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.

Should I change it?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Feb 12, 2019

Collaborator

@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 12, 2019

Author Contributor

Should I add a comment to this effect?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Feb 12, 2019

Collaborator

Because we get the question the comment can help code readers.


// Return true if the character is a type suffix character.
// Returns true if the character is a type suffix character.
internal static bool IsTypeSuffix(this char c)
{
if (c < 128)
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.