Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

Merged
merged 54 commits into from Apr 3, 2019

Conversation

@vexx32
Copy link
Contributor

commented Oct 11, 2018

PR Summary

Fixes #7557

  • Adds support for binary parsing in format echoing hex: 0b11010110
    • Works with all existing type suffixes and multipliers.
    • Supports arbitrary length parsing with n suffix using BigInteger; details below.
  • Adds NumberFormat enum to specify hex/binary/base 10 for the tokenizer, replacing old bool hex.
  • Adds n suffix for all numeric literals to support returning value as a BigInteger if requested. This bypasses the issue of large literals losing accuracy when they cast through double.
  • Adds tests for all new behaviours.

Binary / Hex Parsing Implementation

  • Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary.
    • i.e., 0xFFFFFFFF -eq ([int]-1) and 0xFFFFFFFFFFFFFFFF -eq ([long]-1), but suffixing u creates int.MaxValue and long.MaxValue, respectively, instead.
  • Sign bits higher than this are accepted for bigint-suffixed numerals:
    • Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit
    • Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters.
    • Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. 0b011111111
  • Specifying an unsigned suffix (or combination suffix that includes u) ignores sign bits, similar to how parsing a hex string using [Convert]::ToUint32() would do so.
  • Supports negating literals using - prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.

Refactored numeric tokenizer parsing

New flow:

  1. Check for real (.01, 0.0, or 0e0 syntaxes)
    1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns false.
    2. TryParse as Double, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns false.
      1. Check type suffixes and attempt to cast into appropriate type. This will return false if the value exceeds the specified type's bounds.
      2. Default to parsing as double where no suffix has been applied.
  2. Check number format.
    • If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string.
    • If hex, TryParse into BigInteger using some special casing to retain original behaviours in int/long ranges.
    • If neither binary nor hex, TryParse normally as a BigInteger.
  3. Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows.
  4. Check type suffixes.
    • If a specific type suffix is used, check type bounds and attempt to parse into that type.
      • If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed.
  5. If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number.
    • int
    • long
    • decimal (base-10 literals only)
    • double (base-10 literals only)
    • BigInteger for binary or hex literals. If the value is outside long range (for hex and binary) or double range (for base 10), the parse will fail; higher values must be explicitly requested using the n/N BigInteger suffix.

This is a breaking change as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.

Notes:

  • Binary literal support was approved by the committee in #7557
  • The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.
    • Supporting underscore literals is a further breaking change causing some generic tokens like 1_000_000 to be read as numerals instead. Per @SteveL-MSFT's comment this proposal was rejected.
    • Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason not to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.

PR Checklist

/cc @iSazonov , @SteveL-MSFT as it pertains to the tagged-for-further-discussion #7557

@vexx32 vexx32 requested review from BrucePay and daxian-dbw as code owners Oct 11, 2018

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2018

These are the benchmarks @HumanEquivalentUnit and I performed whilst working on this binary parser. Top notch work from him, frankly astonishing. The two intermediary methods were some of the better solutions of the 10 or 20 iterations of this code that we examined.

https://gist.github.com/vexx32/852cd1d0ddc3e1a93f77b90ab23cfe3d

Note that all cases were tested within ranges that Convert.ToIntX() methods are available, but the current method (labelled BigEndianBinaryOr in this table) generalises seamlessly to arbitrary-length binary strings.

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Oct 11, 2018

@vexx32 I'd prefer to see small PR - with 0b only. We could more fast review.

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2018

I understand your viewpoint, but in order to permit 0b parsing a majority of these changes are required -- or a significant amount of redundant code would need to be in place, because TryParse() methods are not capable of parsing binary.

I think the alternative would leave you guys with a lot more to review, only to possibly pull it out later. Also, it would make any further PRs significantly more complex. I felt it best to simplify the code paths as much as possible, in order to prevent the tokenizer becoming a complex morass of unmaintainable code.

The only parts I could really strip out for you guys to simplify them is the underscore literal support, and that is... not a lot. :(

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Oct 11, 2018

Underscore and I suffix is not approved in related issue. So we have to wait.

@vexx32 vexx32 referenced this pull request Oct 11, 2018
8 of 11 tasks complete

@vexx32 vexx32 force-pushed the vexx32:Tokenizer/Refactor branch Oct 12, 2018

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

Rebased to pick up CI fixes. 😄

@iSazonov
Copy link
Collaborator

left a comment

I'll continue after my computer rebooted :-)

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/parser/tokenizer.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated
case 32: // int
case 64: // long
case 96: // decimal
case int n when n >= 128: // BigInteger

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 12, 2018

Collaborator

It is clear logic for sign but I think we need more feedback whether this is a good UX.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 12, 2018

Author Contributor

This is consistent with the existing patterns for things like hexadecimal, but I would be inclined to agree that there could be improvements made here, should we want to.

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/parser/CharTraits.cs Outdated
{
if (c < 128)
{
return (s_traits[c] & CharTraits.BinaryDigit) != 0;

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 14, 2018

Collaborator

What about:

return c & 0x30 || c & 0x31;

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 14, 2018

Author Contributor

There's no reason we can't do this, but I don't see anything else in the file that behaves this way and felt it best to continue the prevalent pattern here?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 15, 2018

Collaborator

+30% performance :-)

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 15, 2018

Collaborator

Based on the experiment below my suggestion should be faster :-)

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 15, 2018

Author Contributor

Do you meant c == 0x30 || c == 0x31? & does not return a straight true/false or 0 / 1
(we can only do this in the binary parse because we also & 0b1111 or strip the high bits by cast to (byte))

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator
Suggested change
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

@iSazonov honestly a lot of these could also be reformatted like this regardless. For example:

internal static bool IsDecimalDigit(this char c)
{
    if (c < 128)
    {
        return (s_traits[c] & CharTraits.Digit) != 0;
    }

    return false;
}

becomes

internal static bool IsDecimalDigit(this char c) => c < 128 && (s_traits[c] & CharTraits.Digit) != 0;

Is this a worthwhile change to make to the common pattern of these methods?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Jan 30, 2019

Collaborator

Yes, I think we can do this.
Pattern from CoreFX repo:

internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;
internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;

This comment has been minimized.

Copy link
@vexx32

vexx32 Jan 30, 2019

Author Contributor

Done! :)

test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1 Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number.
// This value will also be used as the indexer for our array.
int outputByteWalker = (digits.Length + 7) / 8;
Span<byte> outputBytes = new byte[outputByteWalker--];

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 15, 2018

Collaborator

Why we use auto-decrement here? I'd prefer do "-1" in previous init line.

This comment has been minimized.

Copy link
@HumanEquivalentUnit

HumanEquivalentUnit Oct 15, 2018

It just felt clever to use autodecrement to change the variable from "size of array" to "last index of array".
It is not very clear, and could be outputByteWalker -= 1 in the following line, if that's more readable. Or like this?

            Span<byte> outputBytes = new byte[(digits.Length + 7) / 8];
            int outputByteWalker = outputBytes.Length - 1;

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 15, 2018

Collaborator

We could use more informational name like outputByteWalkerLastIndex.

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated
@@ -202,7 +202,7 @@ internal static BigInteger ParseBinary(ReadOnlySpan<char> digits, bool unsigned)
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number.
// This value will also be used as the indexer for our array.
int outputByteWalker = (digits.Length + 7) / 8;
Span<byte> outputBytes = new byte[outputByteWalker--];
Span<byte> outputBytes = outputByteWalker <= 512 ? stackalloc byte[outputByteWalker--] : new byte[outputByteWalker--];

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 15, 2018

Collaborator

Please don't use outputByteWalker-- here - very bad readability is way to bugs.
I'd add new local variable - one for size/length, second for last index if needed.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 15, 2018

Author Contributor

I'll take care of this momentarily. 😄

@vexx32 vexx32 force-pushed the vexx32:Tokenizer/Refactor branch Oct 15, 2018

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved

@vexx32 vexx32 force-pushed the vexx32:Tokenizer/Refactor branch to 7d4ff02 Oct 16, 2018

case 16: // short
case 32: // int
case 64: // long
case 96: // decimal

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?

@mklement0 have you any thoughts?

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

@vexx32 It is not low level problem because we say about digits - it can be PowerShell language propblem if ARM users expect sign bit on another position.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

I found nothing useful. I hope @mklement0 help.

This comment has been minimized.

Copy link
@mklement0

mklement0 Oct 24, 2018

Contributor

@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:

From a hex string: prepend a 0:

// 0xff as 255; without the leading 0, the result would be -1
BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)

From a byte array - prepend a 0x byte (with Big-Endian ordering):

new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 24, 2018

Author Contributor

Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.

And with binary, it would simply always mean invoking the BigInteger constructor with a value of true for the isUnsigned parameter.

This comment has been minimized.

Copy link
@vexx32

vexx32 Dec 5, 2018

Author Contributor

Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.

This comment has been minimized.

Copy link
@SteveL-MSFT

SteveL-MSFT Feb 2, 2019

Member

Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 2, 2019

Author Contributor

I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.

I could even put it in as an experimental feature later on, I think.

| (digits[byteWalker - 6] << 6)
| (digits[byteWalker - 5] << 5)
| (digits[byteWalker - 4] << 4))
| (((digits[byteWalker - 3] << 3)

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

Why we use the extra parentheses?

This comment has been minimized.

Copy link
@HumanEquivalentUnit

HumanEquivalentUnit Oct 17, 2018

It lets us make (8x zero + 7x OR) operations down to (1x zeroing + 8x OR), and is faster.

The brackets are:

(byte) (
    ((7)|(6)|(5)|(4))   |  ( ((3)|(2)|(1)|0) & 0b1111 )
)

This works because the bit-patterns in the input (chars '0', '1') are 00110000 and 00110001. That is a coincidence of them being ASCII 48, ASCII 49. We can take advantage of those bit patterns, and do less work. We could zero the 0011 on every character, like this more readable slower code, and then shift the least significant bit, and OR them all together:

(byte)(
      ((digits[byteWalker - 7] - '0') << 7) |
      ((digits[byteWalker - 6] - '0') << 6) |
      ((digits[byteWalker - 5] - '0') << 5) |
      ((digits[byteWalker - 4] - '0') << 4) |
      ((digits[byteWalker - 3] - '0') << 3) |
      ((digits[byteWalker - 2] - '0') << 2) |
      ((digits[byteWalker - 1] - '0') << 1) |
      ((digits[byteWalker - 0] - '0')     )
      )

But, for the 7,6,5,4 shifts, they will push the 0011 part far out of the way. When we cast to (byte) they get dropped. Saves 4x zeroing operations if we treat those as a group.

The 3,2,1,0 shifts will combine with each other, but the high 0011 bits would clash if we weren't careful. So we merge these as a group, then do a single & 0b1111 of this group, and save 3x zeroing.

Then +1x OR to merge the two halves. ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )

This is quite hard to explain, I hope this is readable.

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

Why not ( (7)|(6)|(5)|(4)|(3)|(2)|(1)|(0) ) & 0b1111?

This comment has been minimized.

Copy link
@HumanEquivalentUnit

HumanEquivalentUnit Oct 17, 2018

That would not give the right answer, it would make the high bits from inputs 3,2,1,0 overwrite the data from 7,6,5,4 (then & 0b1111 would leave only 0-15 output values).

String input:        "10101011"
index from right:     76543210

Bit patterns of shifted input characters, 
we want just the single far right bit from each line, 
all merged down:

    0011000|7
     001100|06
      00110|005   
       0011|0004
             |-------possible overlap when combining
        001|1000|3
         00|1100|02
          0|0110|001
           |0011|0000

             ^ removed by &0b1111 to prevent overlap
                only in the 3,2,1,0 section

      ^ far left overlap is OK, dropped by (byte) cast

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator

Thanks for clarification!

I think we need reformat the code and add the comment.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

I would appreciate some suggestions on how to approach the reformatting. The actual behaviour of the code is documented in the comments (although perhaps that can also be improved upon).

I suspect it will be necessary to skirt StyleCop rules heavily to make this much more readable than it is?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator
  1. We could add the well-formatted sample ( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 ) in your comment.
  2. We could add white spaces in code for better readability.

This comment has been minimized.

Copy link
@vexx32

vexx32 Oct 17, 2018

Author Contributor

Sounds good!

I think this might annoy StyleCop, but... such is the price of readability. 😄

src/System.Management.Automation/engine/parser/CharTraits.cs Outdated
{
if (c < 128)
{
return (s_traits[c] & CharTraits.BinaryDigit) != 0;

This comment has been minimized.

Copy link
@iSazonov

iSazonov Oct 17, 2018

Collaborator
Suggested change
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';
test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1 Outdated Show resolved Hide resolved
@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2018

@vexx32 I have no more ideas than to occupy your time :-) - thanks for great work! We are waiting PowerShell Committee.

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Oct 17, 2018

Appreciate you taking the time to help us make it better, Ilya! 😄

I might rebase once more to tidy the commit history a little if I get a moment, but otherwise I don't mind waiting till the Committee makes a decision here.

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Oct 17, 2018

Please don't rebase until maintainers ask you.

@stale

This comment has been minimized.

Copy link

commented Nov 23, 2018

This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days.
Thank you for your contributions.
Community members are welcome to grab these works.

@stale stale bot added the Stale label Nov 23, 2018

Add more tests for hex parsing
💬 Be as thorough as is reasonable to prevent regressions

@vexx32 vexx32 force-pushed the vexx32:Tokenizer/Refactor branch from 9a3e131 to 684cc31 Feb 1, 2019

@iSazonov iSazonov requested a review from SteveL-MSFT Feb 1, 2019

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Feb 1, 2019

@SteveL-MSFT @TravisEz13 @daxian-dbw Please review the PR.

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2019

@TravisEz13 are we just pending Dongbo to get back from his well-deserved vacation and review?

Any chance this will make it for 6.2, or should I expect it to wait till after that is released? :slight_smile:

@TravisEz13

This comment has been minimized.

Copy link
Member

commented Feb 11, 2019

@vexx32 I'd like to get someone to review this that has some experience in the area. Or at least one more person from the PowerShell team.

@rjmholt
Copy link
Member

left a comment

This looks good to me. Left a minor comment, but not a blocker

return false;
}
// Return true if the character is a binary digit.
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;

This comment has been minimized.

Copy link
@rjmholt

rjmholt Feb 11, 2019

Member

While this is true, it seems more straightforward to just have => c == '0' || c == '1';. It took me a minute or so to realise why this isn't a problem.

This comment has been minimized.

Copy link
@rjmholt

rjmholt Feb 11, 2019

Member

(Perhaps the coercion + comparison has a performance advantage?)

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 12, 2019

Author Contributor

@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.

Should I change it?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Feb 12, 2019

Collaborator

@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.

This comment has been minimized.

Copy link
@vexx32

vexx32 Feb 12, 2019

Author Contributor

Should I add a comment to this effect?

This comment has been minimized.

Copy link
@iSazonov

iSazonov Feb 12, 2019

Collaborator

Because we get the question the comment can help code readers.

result = dmValue;
return true;
}
else if (Utils.TryCast(bigValue, out double d))

This comment has been minimized.

Copy link
@rjmholt

rjmholt Feb 11, 2019

Member

No need for the else here; following the branch above always returns

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2019

@TravisEz13 Seems like CI got stuck here, sorry; could you kick it for me when you get some time? Cheers! 😊

@TravisEz13

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

@vexx32 The jobs never started. I cannot restart them without closing the PR and re-opening it. I've seen in a few cases where you have force-pushed that the PR cannot be reopened. The easiest way is to force push again. For example, make a whitespace change to one of your commit messages and push.

@TravisEz13 TravisEz13 added this to the 6.3.0-preview.1 milestone Feb 13, 2019

@TravisEz13

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Because this is a breaking change, I'm scheduling it to be merged first thing during 6.3.

@vexx32

This comment has been minimized.

Copy link
Contributor Author

commented Apr 2, 2019

@TravisEz13 Do you guys need anything more from me before you merge? 🙂

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Apr 3, 2019

Perhaps @daxian-dbw could find something :-)

@iSazonov iSazonov added the CL-Engine label Apr 3, 2019

@TravisEz13 TravisEz13 merged commit 3f52adb into PowerShell:master Apr 3, 2019

7 of 8 checks passed

Codacy/PR Quality Review Not up to standards. This pull request quality could be better.
Details
CodeFactor 563 issues fixed. 17 issues found.
Details
PowerShell-CI-linux #PR-7993-20190304.01 succeeded
Details
PowerShell-CI-macos #PR-7993-20190304.01 succeeded
Details
PowerShell-CI-static-analysis #PR-7993-20190304.01 succeeded
Details
PowerShell-CI-windows #PR-7993-20190304.01 succeeded
Details
WIP Ready for review
Details
license/cla All CLA requirements met.
Details
@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Apr 4, 2019

@vexx32 Thanks for the great improvement!

@vexx32 vexx32 deleted the vexx32:Tokenizer/Refactor branch Apr 4, 2019

adityapatwardhan added a commit to adityapatwardhan/PowerShell that referenced this pull request Apr 4, 2019

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberH…
…elper (PowerShell#7993)

Fixes PowerShell#7557 

* Adds support for binary parsing in format echoing hex: `0b11010110`
  * Works with all existing type suffixes and multipliers.
  * Supports arbitrary length parsing with `n` suffix using BigInteger; details below.
* Adds `NumberFormat` enum to specify hex/binary/base 10 for the tokenizer, replacing old `bool hex`.
* Adds `n` suffix for all numeric literals to support returning value as a `BigInteger` if requested. This bypasses the issue of large literals losing accuracy when they cast through `double`.
* Adds tests for all new behaviours.

---

### Binary / Hex Parsing Implementation

* Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary.
  * i.e., `0xFFFFFFFF -eq ([int]-1)` and `0xFFFFFFFFFFFFFFFF -eq ([long]-1)`, but suffixing `u` creates `int.MaxValue` and `long.MaxValue`, respectively, instead.
* Sign bits higher than this are accepted for bigint-suffixed numerals:
    * Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit
    * Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters.
    * Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. `0b011111111`
* Specifying an `u`nsigned suffix (or combination suffix that includes `u`) ignores sign bits, similar to how parsing a hex string using `[Convert]::ToUint32()` would do so.
* Supports negating literals using `-` prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.

---

### Refactored numeric tokenizer parsing

**New flow:**

1. Check for `real` (`.01`, `0.0`, or `0e0` syntaxes)
    1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns `false`.
    2. TryParse as `Double`, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns `false`.
        1. Check type suffixes and attempt to cast into appropriate type. This will return `false` if the value exceeds the specified type's bounds.
        2. Default to parsing as `double` where no suffix has been applied.
2. Check number format.
    * If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string.
    * If hex, TryParse into `BigInteger` using some special casing to retain original behaviours in int/long ranges.
    * If neither binary nor hex, TryParse normally as a `BigInteger`.
3. Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows.
4. Check type suffixes.
    * If a specific type suffix is used, check type bounds and attempt to parse into that type.
      * If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed.
5. If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number. 
    * `int`
    * `long`
    * `decimal` (base-10 literals only)
    * `double` (base-10 literals only)
    * ~~`BigInteger` for binary or hex literals.~~ If the value is outside `long` range (for hex and binary) or `double` range (for base 10), the parse will fail; higher values must be explicitly requested using the `n`/`N` BigInteger suffix.

---

*This is a breaking change* as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.

Notes:
* Binary literal support was approved by the committee in PowerShell#7557 
* ~~The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.~~
    * ~~Supporting underscore literals is a further breaking change causing some generic tokens like `1_000_000` to be read as numerals instead.~~ Per @SteveL-MSFT's [comment](PowerShell#7993 (comment)) this proposal was rejected.
    * ~~Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason *not* to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.~~
@sdwheeler sdwheeler referenced this pull request Apr 9, 2019
1 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.