Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debug flag -RegexBytecodeDebug and initial RegExp bytecode refactoring #3915

Merged
merged 3 commits into from
Oct 13, 2017

Conversation

dilijev
Copy link
Contributor

@dilijev dilijev commented Oct 10, 2017

(Note: I'll rebase this to a single commit shortly.)

-RegexDebug displays various views of RegExp compilation in ChakraCore. In an effort to improve the memory efficiency of the RegExp bytecode that makes up our compiled RegExp programs, this change adds a new view under the additional flag -RegexBytecodeDebug to view the bytecode binary layout.

For example, for this test case:

let re = /[ab]c+(d|e)?/i;
console.log(re.test('abcce')); // true

Running ch.exe -RegexDebug will display some output including the AST which is used to generate the bytecode. The bytecode is displayed last (in a human-readable disassembly), and looks like this:

REGEX PROGRAM /[ab]c+(d|e)?/
Program {
    source:       [ab]c+(d|e)?
    flags:        ignorecase
    numGroups:    2
    numLoops:     0
    instructions: {
        L0000: SyncToSetAndBackup(set: [Cc], backup: [1-inf])
        L0040: SyncToSetAndConsume(set: [ABab])
        L0078: ChompSet<Plus>(set: [Cc])
        L00b0: TryMatchSet(set: [DEde], failLabel: L0110)
        L00f0: DefineGroupFixed(groupId: 1, length: 1, noNeedToSave: true)
        L0110: Succ()
    }
}

Now you can run ch.exe -RegexDebug -RegexBytecodeDebug and see the following more detailed view of the disassembly which includes the actual bytecode layout.

REGEX PROGRAM /[ab]c+(d|e)?/
Program {
    source:       [ab]c+(d|e)?
    flags:        ignorecase
    numGroups:    2
    numLoops:     0
    instructions: {
        L0000: (0x040 bytes) SyncToNegatedSetAndBackup(set: [Cc], backup: [1-inf])
            0x000001C19403C128[+0x008](0x008) [Inst]: 2f000000 00000000
            0x000001C19403C130[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 08000000 08000000 00000000 00000000 00000000 00000000
            0x000001C19403C158[+0x038](0x008) [BackupMixin]: 01000000 ffffffff
        L0040: (0x038 bytes) SyncToNegatedSetAndConsume(set: [ABab])
            0x000001C19403C168[+0x008](0x008) [Inst]: 27000000 00000000
            0x000001C19403C170[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 06000000 06000000 00000000 00000000 00000000 00000000
        L0078: (0x038 bytes) ChompSet<Plus>(set: [Cc])
            0x000001C19403C1A0[+0x008](0x008) [Inst]: 4c000000 00000000
            0x000001C19403C1A8[+0x010](0x028) [SetMixin]: 00000000 00000000 00000000 00000000 08000000 08000000 00000000 00000000 00000000 00000000
        L00b0: (0x040 bytes) TryMatchSet(set: [DEde], failLabel: L0110)
            0x000001C19403C1D8[+0x008](0x008) [Inst]: 58000000 00000000
            0x000001C19403C1E0[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 30000000 30000000 00000000 00000000 00000000 00000000
            0x000001C19403C208[+0x038](0x004) [TryMixin]: 10010000
        L00f0: (0x020 bytes) DefineGroupFixed(groupId: 1, length: 1, noNeedToSave: true)
            0x000001C19403C218[+0x008](0x008) [Inst]: 3a000000 00000000
            0x000001C19403C220[+0x010](0x004) [GroupMixin]: 01000000
            0x000001C19403C224[+0x014](0x004) [FixedLengthMixin]: 01000000
            0x000001C19403C228[+0x018](0x001) [NoNeedToSaveMixin]:01
        L0110: (0x010 bytes) Succ()
            0x000001C19403C238[+0x008](0x008) [Inst]: 02000000 00000000
            0x000001C19403C230[+0x000](0x010) [NopInst]: (no unique data -- skipping)
    }
}

Background and Future Opportunities

The following is some background on what you're seeing above and possible optimizations this exercise has illuminated.

The bytecode is constructed by using placement-new to allocate structs representing bytecode operations into a memory buffer. Since every opcode (aka instruction "tag") value fits into a single byte (at present), the above view makes it clear how much internal fragmentation results from adding padding for small quantities like the opcode (aka instruction tag), and small quantities.

For many opcodes for small regexes, many quantities also fit into a single byte or a short, instead of an int32, so there is a lot of support for the idea of adding variable-length encoding to save space.

Some portions of the opcode data like SetMixin are wasteful by always including space for a pointer (when we might know from the AST that this is guaranteed to be nullptr before compiling the bytecode), and we could split that opcode into two separate operations to save the 8 bytes used for the pointer when it is not needed. This would also remove a branch from execution and save on runtime cost.

Additionally, the bit vector representing the first 256 characters may be wasteful because 128-255 are not ASCII and characters in that range may be uncommon enough that it would be more efficient overall to encode that information in the same data structure used for other unicode code points (the data structure stored by the aforementioned sometimes-null pointer).

Many opcodes encode one or more booleans which get padded out to 4 or 8 bytes (depending) for a single bit of information. Encoding that information in a template param would follow the practice of other places in the bytecode and significantly save on the size of those bytecode layouts.

All of the bytecodes start at offset [+0x008] because I'm skipping over the Debug-only vtable pointer when displaying the layout. I'm planning on refactoring out the virtual Print method (in favor of a static alternative) because this code already uses static dispatch based on the bytecode's opcode in many places, and this vtable pointer distorts how the bytecode would be laid-out in production.

Throughout this exercise, I've been bouncing ideas off of @Penguinwizzard. Thanks to him for listening and reviewing my work in progress!

@MikeHolman
Copy link
Contributor

What does perf look like?

@dilijev
Copy link
Contributor Author

dilijev commented Oct 10, 2017

@MikeHolman This is a debug-only visualization tool. No changes to perf. This refactoring introduced a slight memory regression which will disappear when I move the boolean fields into template params.

@MikeHolman
Copy link
Contributor

Ah, reading is hard. Thought the optimizations you mention were part of this.

@rajatd
Copy link
Contributor

rajatd commented Oct 10, 2017

The comments in RegexOpcodes.h - are those left over or intended to serve as pointers for future work?

@dilijev
Copy link
Contributor Author

dilijev commented Oct 10, 2017

@rajatd Thanks, I meant to remove most of those comments. I'll do so. Any other TODOs left over in this patch will be indicators for future work.

@dilijev
Copy link
Contributor Author

dilijev commented Oct 10, 2017

I will rebase this PR soon. Original commits can be found at dilijev:re-bc-0 in case that aids in reviewing.

@dilijev
Copy link
Contributor Author

dilijev commented Oct 12, 2017

@dotnet-bot test Windows 7 ci_dev12_x64_debug please

@dilijev
Copy link
Contributor Author

dilijev commented Oct 12, 2017

^ ah, we removed the legacy checks so the dev12 check doesn't exist anymore.

/cc @Penguinwizzard I updated the set of required checks for master. (FYI @MSLaguana)

Copy link
Contributor

@Penguinwizzard Penguinwizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dilijev
Copy link
Contributor Author

dilijev commented Oct 12, 2017

    PRINT_BYTES(NopInst);

SuccInst

Not important enough to hold back this PR since this results in printing no additional bytes (all of NopInst, SuccInst and FailInst behave the same in this regard).

I'll fix in a subsequent change.

a la

        L0110: (0x010 bytes) Succ()
            0x000001C19403C238[+0x008](0x008) [Inst]: 02000000 00000000
            0x000001C19403C230[+0x000](0x010) [NopInst]: (no unique data -- skipping)

Refers to: lib/Parser/RegexRuntime.cpp:1028 in 27a2cb8. [](commit_id = 27a2cb8, deletion_comment = False)

@dilijev
Copy link
Contributor Author

dilijev commented Oct 12, 2017

    PRINT_BYTES(NopInst);

FailInst

Not important enough to hold back this PR since this results in printing no additional bytes (all of NopInst, SuccInst and FailInst behave the same in this regard).

I'll fix in a subsequent change.


Refers to: lib/Parser/RegexRuntime.cpp:1006 in 27a2cb8. [](commit_id = 27a2cb8, deletion_comment = False)

@dilijev
Copy link
Contributor Author

dilijev commented Oct 12, 2017

// actual opcode name, even though the logic is basically the same in each definition. See notes below.

Could avoid the duplication with a macro.

This is not important enough to hold back this PR. If I decide to make this change I'll fix it in a subsequent PR.

#WontFix


Refers to: lib/Parser/RegexRuntime.cpp:2113 in 27a2cb8. [](commit_id = 27a2cb8, deletion_comment = False)

@chakrabot chakrabot merged commit 27a2cb8 into chakra-core:master Oct 13, 2017
chakrabot pushed a commit that referenced this pull request Oct 13, 2017
… RegExp bytecode refactoring

Merge pull request #3915 from dilijev:re-bc

(Note: I'll rebase this to a single commit shortly.)

`-RegexDebug` displays various views of RegExp compilation in ChakraCore. In an effort to improve the memory efficiency of the RegExp bytecode that makes up our compiled RegExp programs, this change adds a new view under the additional flag -RegexBytecodeDebug to view the bytecode binary layout.

For example, for this test case:

```js
let re = /[ab]c+(d|e)?/i;
console.log(re.test('abcce')); // true
```

Running `ch.exe -RegexDebug` will display some output including the AST which is used to generate the bytecode. The bytecode is displayed last (in a human-readable disassembly), and looks like this:

```
REGEX PROGRAM /[ab]c+(d|e)?/
Program {
    source:       [ab]c+(d|e)?
    flags:        ignorecase
    numGroups:    2
    numLoops:     0
    instructions: {
        L0000: SyncToSetAndBackup(set: [Cc], backup: [1-inf])
        L0040: SyncToSetAndConsume(set: [ABab])
        L0078: ChompSet<Plus>(set: [Cc])
        L00b0: TryMatchSet(set: [DEde], failLabel: L0110)
        L00f0: DefineGroupFixed(groupId: 1, length: 1, noNeedToSave: true)
        L0110: Succ()
    }
}
```

Now you can run `ch.exe -RegexDebug -RegexBytecodeDebug` and see the following more detailed view of the disassembly which includes the actual bytecode layout.

```
REGEX PROGRAM /[ab]c+(d|e)?/
Program {
    source:       [ab]c+(d|e)?
    flags:        ignorecase
    numGroups:    2
    numLoops:     0
    instructions: {
        L0000: (0x040 bytes) SyncToNegatedSetAndBackup(set: [Cc], backup: [1-inf])
            0x000001C19403C128[+0x008](0x008) [Inst]: 2f000000 00000000
            0x000001C19403C130[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 08000000 08000000 00000000 00000000 00000000 00000000
            0x000001C19403C158[+0x038](0x008) [BackupMixin]: 01000000 ffffffff
        L0040: (0x038 bytes) SyncToNegatedSetAndConsume(set: [ABab])
            0x000001C19403C168[+0x008](0x008) [Inst]: 27000000 00000000
            0x000001C19403C170[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 06000000 06000000 00000000 00000000 00000000 00000000
        L0078: (0x038 bytes) ChompSet<Plus>(set: [Cc])
            0x000001C19403C1A0[+0x008](0x008) [Inst]: 4c000000 00000000
            0x000001C19403C1A8[+0x010](0x028) [SetMixin]: 00000000 00000000 00000000 00000000 08000000 08000000 00000000 00000000 00000000 00000000
        L00b0: (0x040 bytes) TryMatchSet(set: [DEde], failLabel: L0110)
            0x000001C19403C1D8[+0x008](0x008) [Inst]: 58000000 00000000
            0x000001C19403C1E0[+0x010](0x028) [SetMixin<false>]: 00000000 00000000 00000000 00000000 30000000 30000000 00000000 00000000 00000000 00000000
            0x000001C19403C208[+0x038](0x004) [TryMixin]: 10010000
        L00f0: (0x020 bytes) DefineGroupFixed(groupId: 1, length: 1, noNeedToSave: true)
            0x000001C19403C218[+0x008](0x008) [Inst]: 3a000000 00000000
            0x000001C19403C220[+0x010](0x004) [GroupMixin]: 01000000
            0x000001C19403C224[+0x014](0x004) [FixedLengthMixin]: 01000000
            0x000001C19403C228[+0x018](0x001) [NoNeedToSaveMixin]:01
        L0110: (0x010 bytes) Succ()
            0x000001C19403C238[+0x008](0x008) [Inst]: 02000000 00000000
            0x000001C19403C230[+0x000](0x010) [NopInst]: (no unique data -- skipping)
    }
}
```

# Background and Future Opportunities

The following is some background on what you're seeing above and possible optimizations this exercise has illuminated.

The bytecode is constructed by using placement-new to allocate structs representing bytecode operations into a memory buffer. Since every opcode (aka instruction "tag") value fits into a single byte (at present), the above view makes it clear how much internal fragmentation results from adding padding for small quantities like the opcode (aka instruction tag), and small quantities.

For many opcodes for small regexes, many quantities also fit into a single byte or a short, instead of an int32, so there is a lot of support for the idea of adding variable-length encoding to save space.

Some portions of the opcode data like SetMixin are wasteful by always including space for a pointer (when we might know from the AST that this is guaranteed to be `nullptr` before compiling the bytecode), and we could split that opcode into two separate operations to save the 8 bytes used for the pointer when it is not needed. This would also remove a branch from execution and save on runtime cost.

Additionally, the bit vector representing the first 256 characters may be wasteful because 128-255 are not ASCII and characters in that range may be uncommon enough that it would be more efficient overall to encode that information in the same data structure used for other unicode code points (the data structure stored by the aforementioned sometimes-null pointer).

Many opcodes encode one or more booleans which get padded out to 4 or 8 bytes (depending) for a single bit of information. Encoding that information in a template <bool> param would follow the practice of other places in the bytecode and significantly save on the size of those bytecode layouts.

All of the bytecodes start at offset [+0x008] because I'm skipping over the Debug-only vtable pointer when displaying the layout. I'm planning on refactoring out the virtual `Print` method (in favor of a static alternative) because this code already uses static dispatch based on the bytecode's opcode in many places, and this vtable pointer distorts how the bytecode would be laid-out in production.

Throughout this exercise, I've been bouncing ideas off of @Penguinwizzard. Thanks to him for listening and reviewing my work in progress!
@dilijev dilijev deleted the re-bc branch October 13, 2017 16:44
chakrabot pushed a commit that referenced this pull request Oct 27, 2017
…BoundaryTest's isNegation, converting bools to template parameters.

Merge pull request #4066 from dilijev:rebc-rm-bool-hardfail

Follow up to #3915 as part of the "remove booleans" opportunity. Other work on removing the rest of the booleans is in progress but more invasive.

This PR also:

* removes virtual Print from `Inst` so that Debug build `Inst`s don't have a vtable pointer to disrupt the bytecode layout.
* includes other changes to -RegexBytecodeDebug output to help view important aspects of bytecode layout.

I don't yet have data about runtime and memory improvement with this change, but since we are removing some branches through the use of a boolean template instead of a boolean member variable and decreasing memory by removing the boolean and all of the associated padding (4 or 8 bytes, depending) any time these bytecodes are used, there should be moderate gains in both runtime and memory.

BOITest / BOIHardFailTest / EOITest / EOIHardFailTest / WordBoundaryTest / NegatedWordBoundaryTest go from 0x8 to 0x4 bytes. (Opcode is 32 bit and the bool gets padded out 4 bytes.)

Before/after snippet for BOITest, for `/(^token|foo)/` (running as `ch.exe -RegexDebug -RegexBytecodeDebug test.js`):

Before:
```
        L0010: (0x008 bytes) BOITest(hardFail: false)
            0x000001F4591AC0B0[+0x000](0x004) [Inst]: 0c000000
            0x000001F4591AC0B4[+0x004](0x001) [HardFailMixin]:00
        L0018: (0x00c bytes) MatchLiteral(literal: "token")
```

After:
```
        L0010: (0x004 bytes) BOITest(<hardFail>: false)
            0x0000020E2BECC0B0[+0x000](0x004)(size:0x04)(align:0x04) [Inst]: 0d000000
        L0014: (0x00c bytes) MatchLiteral(literal: "token")
```

Again for `/\b\B/`:

Before:
```
        L0000: (0x008 bytes) WordBoundaryTest()
            0x0000023973F34100[+0x000](0x004) [Inst]: 10000000
            0x0000023973F34100[+0x000](0x008) [WordBoundaryTestInst]: 10000000 00000000
        L0008: (0x008 bytes) WordBoundaryTest()
            0x0000023973F34108[+0x000](0x004) [Inst]: 10000000
            0x0000023973F34108[+0x000](0x008) [WordBoundaryTestInst]: 10000000 01000000
        L0010: (0x004 bytes) Succ()
```

After:
```
        L0000: (0x004 bytes) WordBoundaryTest()
            0x0000019F284AC190[+0x000](0x004)(size:0x04)(align:0x04) [Inst]: 13000000
        L0004: (0x004 bytes) NegatedWordBoundaryTest()
            0x0000019F284AC194[+0x000](0x004)(size:0x04)(align:0x04) [Inst]: 12000000
        L0008: (0x004 bytes) Succ()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants