Investigate alternate stack- and register-allocation strategies in backend #218

MatthewFluet · 2017-11-06T16:39:40Z

#217 updated the stack-allocation strategy in the RSSA to Machine translation:

Do not force Cont block arguments to stack (7ec42a1)
Improve Allocation.Stack.get (31b6e29)

Unfortunately, the improved Allocation.Stack.get has degraded the hamlet benchmark (but not others):

MLton0 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen amd64
MLton1 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen c
MLton2 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen llvm
MLton3 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen amd64
MLton4 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen c
MLton5 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen llvm
MLton6 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen amd64
MLton7 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen c
MLton8 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen llvm
MLton9 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen amd64
MLton10 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen c
MLton11 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen llvm
run time ratio
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet      1.00   2.53   2.27   0.99   2.39   2.29   1.18   2.61   2.44   1.16    2.70    2.46
size
benchmark    MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8    MLton9   MLton10   MLton11
hamlet    1,458,564 1,501,276 1,529,948 1,454,516 1,497,868 1,526,700 1,445,700 1,490,156 1,519,524 1,449,492 1,492,732 1,523,316
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     12.82  23.79  40.39  11.96  23.34  41.01  12.11  23.10  40.84  12.66   23.58   41.44
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     33.70  85.28  76.34  33.37  80.64  77.29  39.73  87.84  82.11  39.26   90.87   82.93

(20171105.113517-gfd0d8e4 is 31b6e29 with 7ec42a1 reverted.)

The improved Allocation.Stack.get was not expected to have a negative effect on runtime; it should simply find different stack slots for variables and possibly decrease stack frame sizes.

One conjecture is that the improved Allocation.Stack.get causes variables that previously shared a stack slot to now have different stack slots. This could impact intraprocedural Goto transfers, where, previously, a formal and an actual argument were assigned the same stack slot (and the move of the actual to the formal was a noop), but, now, the formal and the actual are assigned different stack slots (and the move of the actual to the formal is an executed move).

In general, the current stack- and register-allocation strategy (i.e., the mapping of RSSA variables to Machine globals, stack-slots, or registers) in the RSSA to Machine translation is greedy. [Remember, in the Machine IR, "registers" are really local variables that are not live across a non-tail or runtime call. The hope is that they will map to hardware registers, but in the Machine IR, there can be an arbitrary number of registers.] The strategy is essentially to begin with the stack- and register-allocation dictated by the live-in of a block and then greedily assign stack-slots and registers to variables defined within the block. Allocations are not updated when variables go dead within a block (potentially freeing up stack-slots and/or registers). Nor is any attempt made to optimize the placement of variables to reduce moves.

For Machine IR registers, this is mostly acceptable, since they codegens will perform their own assignment of these Machine IR registers (remember, "local variables") to hardware registers and spill locations, presumably using coalescing and other heuristics.

For Machine IR stack-slots, though, this can be problematic, because the codegens will generally perform reads and writes to stack-slots as instructed; in particular, a stack-slot to stack-slot move can never be elided by the C and LLVM codegens (since they appear as reads and writes of memory) and aren't easily elided by the native x86/amd64 codegens (the -native-live-stack {false|true} option was meant to track the liveness of stack slots in the native codegens so as to allow a Machine IR stack slot to live entirely in hardware registers without requiring writing the value back to memory, but the management of liveness information in the native codeges is naive (and performs poorly) and the number of stack-allocated variables can be large, so it never performed well-enough on self-compiles to become the default).

In any case, one could investigate better allocation strategies for the RSSA to Machine translation. In particular, actually build an interference graph and recognize variable-to-variable moves and make assignments to avoid extraneous moves.

The text was updated successfully, but these errors were encountered:

Add new overflow-checking primitives Motivation MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception: val +! = _prim "WordS8_addCheck": int * int -> int; In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements. This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions: * https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html * https://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics Implementation This patch refactors the primitive checked-arithmetic operations such that the suffix `?` represents a new overflow-checking predicate, a suffix `!` represents the non-overflow-checking variant, and a suffix `$` represents the overflow-checking variant (mnemonic: `$` for "safe" or "expensive"). The behavior of the new `$`-operations is controlled by a compile-time constant, `MLton.newOverflow`. When set to false (the default), the `$`-operations make use of the old-style implicit-`Overflow` primitives. When set to true, the `$`-operations are implemented as an `if`-expression that branches on the result of the corresponding `?`-operation and either raises the `Overflow` exception or returns the result of the corresponding `!`-operation. Finally, the bare operation is aliased to either the `$`-form (with overflow detection enabled) or the `!`-form (with overflow detection disabled). Essentially: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = if MLton.newOverflow then fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y)) handle PrimOverflow => raise Overflow val + = if MLton.detectOverflow then +$ else +! Note that the checked-arithmetic using `!`- and `$`-operations is careful to perform the `!`-operation before the `$`-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `__builtin_{add,sub,mul}_overflow` intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `{sadd,uadd,smul,umul,ssub,usub}.with.overflow` intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation. In addition, the `RedundantTests` optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives. Performance The native-codegen peephole optimization and `RedundantTests` have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative): | Benchmark | Native | LLVM | C | |-----------------|--------|------|------| | even-odd | 1.00 | 1.00 | 1.09 | | hamlet | 0.98 | 0.90 | 0.93 | | imp-for | 0.99 | 1.50 | 0.46 | | lexgen | 0.92 | 1.31 | 1.24 | | matrix-multiply | 0.99 | 1.00 | 0.87 | | md5 | 1.06 | 1.01 | 0.97 | | tensor | 1.01 | 1.00 | 0.57 | No benchmarks were consistently worse or better on all codegen, e.g., the `imp-for` benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown in `lexgen` with the C- and LLVM-codegens is unknown. For the `md5` benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins. Future work The `CommonSubexp` optimization currently handles `Arith` transfers specially; in particular, the result of an `Arith` transfer can be used in common `Arith` transfers that it dominates. This was done so that code like: (n + m) + (n + m) can be transformed to let val r = n + m in r + r end With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but `Case` discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhaps `KnownCase` could eliminate subsequent discriminations). Extending `CommonSubexpr` to propagate flow-sensitive relationships at `Case` transfers to the dominated targets could improve the performance `md5` with `MLton.newOverflow true` and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well). Once all performance slowdowns with `MLton.newOverflow true` have been eliminated, it would be desirable to remove the old-style implicit-`Overflow` primitives and `Arith` transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers. Finally, it may be worth investigating an implementation of the checked-operations via val +$ = fn (x, y) => let val b = +? (x, y) val r = +! (x, y) in if b then raise Overflow else r end rather than val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end The advantage of calculating the boolean first is that when `x` (or `y`) is a loop variable and `r` will be passed as the value for the next iteration, then both `x` and `r` could be assigned the same location (pseudo-register or stack slot). `x` and `r` cannot share the same location when the boolean is calculated second, because `x` and `r` are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.

Alternate strategies for globalization in ConstantPropagation Extend globalization aspect of ConstantPropagation to support globalization of arrays and to support different "small type" strategies. Closes #206. Space-safety prohibits ConstantPropagation from globalizing all arrays and refs that are allocated at most once by a program. In particular, because globals are live for the duration of the program, globalizing an `int list ref` (for example) would not be safe-for-space: an arbitrarily large list may be written to the reference and never be garbage collected (whereas, when the `int list ref` is not globalized, it will be garbage collected when it is no longer live). On the other hand, globalizing an `int ref` is safe-for-space. However, MLton previously used only a very conservative estimation for space safety. Only "small" types may be globalized, where smallness is defined as: fun isSmall t = case dest t of Array _ => false | Datatype _ => false | Ref t => isSmall t | Tuple ts => Vector.forall (ts, isSmall) | Vector _ => false | _ => true Note that no `Datatype` is small; this is conservative (since a recursive datatype could represent unbounded data), but prevents globalizing `bool ref`. Also, no `Array` is small; this is correct (because an `int array ref` should not be globalized), but the globalization of a `val a: t array = Array_alloc[t] (l)` was conditioned on the smallness of `t array`, not the smallness of `t`. It is correct to globalize an array if `t` were small; note that to globalize `val a: t array = Array_alloc[t] (l)`, `l` (the length) must be globalized and must, therefore, be a constant and the array is of constant size. (This is Stephen Weeks's relaxed notion of safe-for-space, where the constant factor blowup can be chosen per program.) This pull request adds support for alternate globalization strategies: * `-globalize-arrays {false|true}`: globalize arrays * `-globalize-refs {true|false}`: globalize refs * `-globalize-small-int-inf {true|false}`:globalize `IntInf` as a small type * `-globalize-small-type {1|0|2|3|4|9}`: strategies for classifying a type as "small": * `0`: constant `false` function (no types considered small) * `1`: no `Datatype` is considered small (original strategy) * `2`: `Datatype`s with all nullary constructors are considered small * `3`: `Datatype`s with all constructor arguments considered small according to strategy `2` are considered small * `4`: Fixed-point analysis of `Datatype`s to determine smallness * `9`: constant `true` function (all types considered small; not safe-for-space) The defaults correspond to the previous behavior. Unfortunately, additional globalization has little to no (positive) effect on benchmarks: MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 run time ratio benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 DLXSimulator 1.00 1.00 1.03 1.00 1.00 0.99 1.00 1.01 1.01 checksum 1.00 1.00 1.13 1.14 1.13 1.00 1.12 1.13 1.14 flat-array 1.00 1.00 1.00 1.00 1.01 1.19 1.19 1.19 1.19 hamlet 1.00 1.00 1.02 1.01 1.01 1.00 0.96 0.97 0.97 imp-for 1.00 1.00 1.06 1.05 1.06 1.00 1.05 1.06 1.06 knuth-bendix 1.00 1.00 1.03 1.03 1.03 1.00 1.03 1.03 1.03 lexgen 1.00 0.99 1.05 0.99 1.03 1.04 1.03 1.02 1.07 model-elimination 1.00 1.01 1.02 1.02 1.02 1.00 1.03 1.02 1.03 peek 1.00 1.00 1.03 1.03 1.04 1.00 1.04 1.04 1.04 ray 1.00 1.03 0.99 0.99 1.01 0.99 0.99 0.99 0.98 raytrace 1.00 1.03 1.00 1.02 1.03 0.99 1.00 1.01 1.01 simple 1.00 0.98 0.99 1.00 0.99 0.97 0.97 0.96 0.97 tak 1.00 1.00 1.04 1.03 1.10 1.01 1.02 1.08 1.00 wc-scanStream 1.00 1.00 1.06 1.06 1.06 1.00 1.04 1.06 1.07 Note that `MLton0` and `MLton1` generate identical code (modulo the random magic number), so the slowdowns in `ray` and `raytrace` are noise, which also suggests that slowdowns/speedups of <= 3% are also likely noise. The slowdown in `flat-array` with `-globalize-array true` is explained as follows. The `flat-array` benchmark uses `Vector.tabulate` to allocate a vector that is used for all iterations of the benchmark. With `-globalize-array false`, the array is not globalized, and in SSA/SSA2, we have: x_1212: ((word32, word32) tuple) array = prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*)) ... x_757: ((word32, word32) tuple) vector = prim Array_toVector((word32, word32) tuple) (x_1212) ... x_1287: (word32, word32) tuple = prim Vector_sub((word32, word32) tuple) (x_757, x_1283) but with `-globalize-array true`, the array is globalized, and in SSA/SS2, we have: global_490: ((word32, word32) tuple) array = prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*)) ... x_757: ((word32, word32) tuple) vector = prim Array_toVector((word32, word32) tuple) (global_490) ... x_1286: (word32, word32) tuple = prim Vector_sub((word32, word32) tuple) (x_757, x_1282) At RSSA, the `Array_toVector` becomes a header update and the array variable is cast/copy-propagated for the vector variable; with `-globalize-arrays false`, we have L_531 (x_1212: Objptr (opt_11)) CReturn {func = {..., target = GC_sequenceAllocate}} = ... OW64 (x_1212, ~8): Word64 := opt_12 ... x_1354: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 0) x_1353: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 4) but with `-globalize-arrays true`, we have L_488 (global_490: Objptr (opt_7)) CReturn {func = {..., target = GC_sequenceAllocate}} = ... OW64 (global_490, ~8): Word64 := opt_12 ... x_1353: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 0) x_1352: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 4) Finally, with `-globalize-arrays false`, `x_1212` becomes a local (because the loops to initialize and use the vector are non-allocating): RW32(2): Word32 = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)), RW64(0): Word64, 8, 0): Word32 RW32(3): Word32 = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)), RW64(0): Word64, 8, 4): Word32 but with `-globalize-arrays true`: RW32(2): Word32 = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)}, Objptr (opt_12)), RW64(0): Word64, 8, 0): Word32 RW32(3): Word32 = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)}, Objptr (opt_12)), RW64(0): Word64, 8, 4): Word32 The innermost loop of the benchmark goes from indexing a sequence stored in a local (`RP(0)`) to indexing a sequence stored in a global (`GP(1)`). All of the codegens should implement the former by using a hardware register for `RP(0)`, but will implement the latter with a memory read. In light of the above, and related to #218, it may be beneficial to "deglobalize" object pointer globals; that is, in RSSA functions that have multiple accesses through the same object pointer global (particularly within loops) could be translated to copy the global to a local. The slowdown in `checksum` is less easily explained. The only new objects globalized with `-globalize-small-type 2` as compared to `-globalize-small-type 1` are two `bool ref` objects, corresponding to the `exiting` flag of `basis-library/mlton/exit.sml` and the `staticIsInUse` flag of `basis-library/util/one.sml` used by `Int.fmt`. That small change seems to lead to code layout and cache effects that result in the slowdown, because the assembly code is not substantial different. With `-enable-pass machineSuffle` and `-seed-rand <w>`, one can perturb the code layout and observe that the slowdowns are not universal: MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242 MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242 MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242 MLton9 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef MLton10 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 MLton11 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242 MLton12 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef MLton13 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 MLton14 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242 MLton15 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef MLton16 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 MLton17 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242 MLton18 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef MLton19 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 MLton20 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242 MLton21 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef MLton22 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 MLton23 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242 MLton24 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef run time ratio benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13 MLton14 MLton15 MLton16 MLton17 MLton18 MLton19 MLton20 MLton21 MLton22 MLton23 MLton24 checksum 1.00 1.00 1.00 1.01 1.14 1.01 1.00 1.14 1.01 1.01 1.14 1.00 1.04 1.00 1.00 1.00 1.14 1.00 1.01 1.16 1.00 1.01 1.15 1.00 1.01 flat-array 1.00 1.01 1.02 1.00 1.00 1.01 1.01 1.03 1.01 1.01 1.01 1.01 1.01 1.23 1.19 1.19 1.19 1.20 1.19 1.20 1.20 1.19 1.20 1.19 1.20 hamlet 1.00 0.99 1.00 0.99 1.01 1.01 1.00 1.01 1.03 1.01 1.01 1.02 1.01 1.00 1.00 0.99 0.95 0.95 0.94 0.97 0.96 0.96 0.98 0.98 0.99 imp-for 1.00 1.00 1.05 1.05 1.05 1.02 1.00 1.05 0.99 1.00 1.05 1.00 1.01 1.00 1.05 1.05 1.05 0.99 1.00 1.05 1.00 1.00 1.05 1.00 0.99 lexgen 1.00 0.97 1.00 0.97 1.03 1.03 1.01 1.04 0.99 0.95 0.99 0.99 0.95 0.97 0.98 0.95 0.96 1.01 0.98 1.00 1.04 0.95 0.96 1.00 0.95 peek 1.00 1.00 1.00 1.01 1.03 1.01 1.04 1.03 1.01 1.04 1.03 1.01 1.03 1.00 1.00 1.01 1.04 1.01 1.05 1.03 1.01 1.04 1.04 1.00 1.04 simple 1.00 1.01 1.01 1.00 1.00 0.99 1.02 1.00 0.99 1.00 1.00 1.00 1.00 0.98 0.97 0.99 0.98 0.97 0.99 0.97 0.99 0.98 0.98 0.98 0.99 tak 1.00 0.99 0.90 1.00 1.05 0.90 0.99 1.02 0.89 0.99 1.04 0.90 1.00 0.99 0.89 0.99 0.99 0.90 0.99 1.01 0.90 1.00 1.00 0.90 1.00 wc-scanStream 1.00 1.01 1.01 1.03 1.06 1.02 1.03 1.05 1.02 1.02 1.04 1.01 1.00 1.01 1.01 1.00 1.06 1.01 1.00 1.07 1.01 1.00 1.05 1.03 1.02 Note that while `checksum` with MLton4 has a slowdown, `checksum` with MLton5 and MLton6 (which are identical up to shuffling of the functions and basic blocks at the MachineIR) do not have a slowdown. Similarly `tak` with MLton0 and MLton1 have similar running time, but `tak` with MLton3 has a speedup. On the other hand, `flat-array`'s slowdowns with `-globalize-arrays true` are not due to code layout effects. `hamlet` may have a slight speedup with `-globalize-arrays true`, but that is significantly outweighted by the slowdown in `flat-array`. The conclusion is to leave the defaults corresponding to the original behavior. Full benchmark results: MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 run time ratio benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 DLXSimulator 1.00 1.00 1.03 1.00 1.00 0.99 1.00 1.01 1.01 barnes-hut 1.00 1.01 1.01 1.02 1.01 1.01 1.01 1.02 1.01 boyer 1.00 1.00 1.01 1.01 1.02 1.00 1.01 1.01 1.01 checksum 1.00 1.00 1.13 1.14 1.13 1.00 1.12 1.13 1.14 count-graphs 1.00 1.00 0.99 1.01 0.99 1.00 1.00 1.00 0.99 even-odd 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 fft 1.00 1.00 1.01 1.02 1.02 1.01 1.01 1.00 1.01 fib 1.00 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.00 flat-array 1.00 1.00 1.00 1.00 1.01 1.19 1.19 1.19 1.19 hamlet 1.00 1.00 1.02 1.01 1.01 1.00 0.96 0.97 0.97 imp-for 1.00 1.00 1.06 1.05 1.06 1.00 1.05 1.06 1.06 knuth-bendix 1.00 1.00 1.03 1.03 1.03 1.00 1.03 1.03 1.03 lexgen 1.00 0.99 1.05 0.99 1.03 1.04 1.03 1.02 1.07 life 1.00 1.00 1.00 1.00 1.00 1.01 1.01 1.00 1.00 logic 1.00 1.00 0.98 0.99 0.99 1.00 0.98 0.99 1.00 mandelbrot 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 matrix-multiply 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.01 1.00 md5 1.00 1.00 0.99 0.99 0.99 1.00 0.99 0.99 0.99 merge 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 mlyacc 1.00 1.01 1.01 1.01 1.01 1.00 0.99 0.99 1.01 model-elimination 1.00 1.01 1.02 1.02 1.02 1.00 1.03 1.02 1.03 mpuz 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 nucleic 1.00 1.01 1.00 1.00 0.99 1.00 0.99 0.99 1.00 output1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 peek 1.00 1.00 1.03 1.03 1.04 1.00 1.04 1.04 1.04 psdes-random 1.00 1.00 1.00 1.00 1.00 1.00 1.01 1.00 1.00 ratio-regions 1.00 0.99 0.98 0.98 1.01 1.01 0.99 1.00 1.01 ray 1.00 1.03 0.99 0.99 1.01 0.99 0.99 0.99 0.98 raytrace 1.00 1.03 1.00 1.02 1.03 0.99 1.00 1.01 1.01 simple 1.00 0.98 0.99 1.00 0.99 0.97 0.97 0.96 0.97 smith-normal-form 1.00 1.00 1.01 0.99 1.00 1.00 1.00 1.00 1.00 string-concat 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 tailfib 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 tak 1.00 1.00 1.04 1.03 1.10 1.01 1.02 1.08 1.00 tensor 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 tsp 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 tyan 1.00 1.00 1.00 1.01 1.01 1.01 1.01 1.01 1.01 vector-rev 1.00 0.99 0.99 0.98 0.98 0.98 0.99 0.98 0.98 vector32-concat 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 vector64-concat 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 vliw 1.00 0.98 1.00 1.01 1.00 1.00 0.98 0.98 0.98 wc-input1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 wc-scanStream 1.00 1.00 1.06 1.06 1.06 1.00 1.04 1.06 1.07 zebra 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 zern 1.00 1.02 1.02 0.99 1.01 0.99 0.99 1.01 0.99 size benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 DLXSimulator 209,076 209,076 209,140 209,140 209,140 209,076 208,836 208,340 208,340 barnes-hut 176,199 176,199 176,071 176,071 176,071 176,199 176,071 176,071 176,071 boyer 243,369 243,369 243,289 243,289 243,289 243,369 243,289 243,289 243,289 checksum 117,561 117,561 117,433 117,433 117,433 117,561 117,433 117,433 117,433 count-graphs 145,065 145,065 145,017 145,017 145,017 145,065 144,937 144,937 144,937 even-odd 117,529 117,529 117,433 117,433 117,433 117,529 117,433 117,433 117,433 fft 142,307 142,307 141,315 141,315 141,315 142,307 141,315 141,315 141,315 fib 117,449 117,449 117,321 117,321 117,321 117,449 117,321 117,321 117,321 flat-array 117,177 117,177 117,049 117,049 117,049 117,193 117,081 117,081 117,081 hamlet 1,434,228 1,434,228 1,433,220 1,433,220 1,433,220 1,434,228 1,432,564 1,427,956 1,427,396 imp-for 117,241 117,241 117,145 117,145 117,145 117,241 117,145 117,145 117,145 knuth-bendix 186,116 186,116 186,212 186,212 186,212 186,116 186,212 186,212 186,212 lexgen 290,931 290,931 290,819 290,819 290,819 290,931 290,819 290,819 290,819 life 141,113 141,113 141,065 141,065 141,065 141,113 141,065 141,065 141,065 logic 197,417 197,417 197,273 197,273 197,273 197,417 197,273 197,273 197,273 mandelbrot 117,273 117,273 117,177 117,177 117,177 117,273 117,177 117,177 117,177 matrix-multiply 119,577 119,577 119,417 119,417 119,417 119,577 119,417 119,417 119,417 md5 144,676 144,676 144,500 144,500 144,500 144,676 144,500 144,500 144,500 merge 118,953 118,953 118,857 118,857 118,857 118,953 118,857 118,857 118,857 mlyacc 643,555 643,555 643,651 643,651 643,651 643,555 643,651 643,651 643,475 model-elimination 796,054 796,054 793,958 793,798 793,798 796,054 794,166 792,246 792,246 mpuz 123,545 123,545 123,481 123,481 123,481 123,545 123,481 123,481 123,481 nucleic 297,249 297,249 297,233 297,233 297,233 297,249 297,233 297,233 297,233 output1 151,768 151,768 149,848 149,848 149,848 151,768 149,848 149,848 149,848 peek 150,164 150,164 150,132 150,132 150,132 150,164 150,132 150,132 150,132 psdes-random 121,545 121,545 121,401 121,401 121,401 121,545 121,401 121,401 121,401 ratio-regions 144,137 144,137 144,169 144,169 144,169 144,137 144,169 144,169 144,169 ray 250,058 250,058 250,218 250,218 250,218 250,058 249,818 249,066 249,066 raytrace 368,988 368,988 368,108 368,108 368,108 368,956 367,868 367,468 367,468 simple 345,205 345,205 345,381 345,381 345,381 329,557 329,557 329,317 329,317 smith-normal-form 279,837 279,837 279,645 279,645 279,645 279,837 279,341 279,341 279,341 string-concat 119,129 119,129 119,033 119,033 119,033 119,209 119,033 119,033 119,033 tailfib 117,273 117,273 117,177 117,177 117,177 117,273 117,177 117,177 117,177 tak 117,449 117,449 117,321 117,321 117,321 117,449 117,321 117,321 117,321 tensor 179,292 179,292 176,908 176,908 176,908 179,292 176,908 176,908 176,908 tsp 158,860 158,860 158,668 158,668 158,668 158,860 158,668 158,668 158,668 tyan 223,588 223,588 223,044 223,044 223,044 223,588 223,044 223,044 223,044 vector-rev 118,105 118,105 118,009 118,009 118,009 118,153 117,977 117,977 117,977 vector32-concat 118,297 118,297 118,201 118,201 118,201 118,329 118,185 118,185 118,185 vector64-concat 118,329 118,329 118,169 118,169 118,169 118,329 118,217 118,217 118,217 vliw 505,509 505,509 503,013 503,013 503,013 505,637 500,917 497,957 497,957 wc-input1 179,051 179,051 178,923 178,923 178,923 179,051 178,923 178,923 178,923 wc-scanStream 188,155 188,155 188,027 188,027 188,027 188,155 188,027 188,027 188,027 zebra 225,364 225,364 225,220 225,220 225,220 225,364 225,220 225,220 225,220 zern 153,241 153,241 152,521 152,521 152,521 153,241 152,585 152,585 152,585 compile time benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 DLXSimulator 3.19 3.14 3.45 3.51 3.02 3.27 3.18 3.45 3.24 barnes-hut 2.93 2.94 2.96 2.97 3.06 2.92 2.96 2.93 2.96 boyer 3.36 3.49 3.48 3.32 3.42 3.40 3.52 3.53 3.49 checksum 2.53 2.56 2.47 2.56 2.57 2.46 2.56 2.54 2.62 count-graphs 2.68 2.69 2.71 2.80 2.80 2.79 2.67 2.69 2.76 even-odd 2.45 2.56 2.46 2.56 2.48 2.50 2.48 2.55 2.57 fft 2.64 2.57 2.62 2.66 2.35 2.68 2.60 2.54 2.62 fib 2.46 2.46 2.58 2.48 2.55 2.45 2.55 2.54 2.46 flat-array 2.52 2.56 2.55 2.56 2.55 2.54 2.53 2.56 2.55 hamlet 15.27 15.74 14.48 14.35 14.48 15.63 15.21 14.88 15.04 imp-for 2.48 2.56 2.52 2.55 2.55 2.34 2.42 2.48 2.54 knuth-bendix 2.89 2.90 2.90 3.01 3.02 2.93 3.01 3.13 2.99 lexgen 3.40 3.88 3.76 3.82 3.74 3.62 3.50 3.81 3.71 life 2.66 2.71 2.70 2.74 2.72 2.64 2.74 2.68 2.60 logic 3.06 3.05 3.11 3.12 3.00 3.13 3.14 2.92 2.83 mandelbrot 2.51 2.54 2.53 2.58 2.57 2.53 2.56 2.55 2.45 matrix-multiply 2.48 2.50 2.52 2.49 2.57 2.60 2.60 2.57 2.53 md5 2.65 2.68 2.78 2.58 2.78 2.80 2.58 2.80 2.69 merge 2.47 2.49 2.58 2.57 2.55 2.49 2.57 2.50 2.52 mlyacc 7.85 7.94 8.00 7.98 7.90 7.69 8.05 8.09 7.54 model-elimination 7.08 7.98 7.62 7.11 7.40 8.24 7.84 7.96 8.15 mpuz 2.34 2.61 2.60 2.56 2.53 2.62 2.42 2.53 2.52 nucleic 4.06 4.17 4.07 4.05 4.08 4.06 4.28 4.33 4.14 output1 2.69 2.68 2.52 2.58 2.77 2.79 2.78 2.57 2.77 peek 2.74 2.78 2.58 2.80 2.80 2.73 2.70 2.68 2.72 psdes-random 2.56 2.53 2.49 2.48 2.50 2.48 2.53 2.64 2.50 ratio-regions 2.80 2.82 2.62 2.88 2.86 2.89 2.81 2.80 2.79 ray 3.34 3.62 3.69 3.47 3.45 3.60 3.48 3.37 3.40 raytrace 4.55 4.88 4.47 4.32 4.64 4.30 4.48 4.43 4.50 simple 4.01 4.07 4.04 4.00 3.91 3.83 3.77 3.81 3.74 smith-normal-form 3.75 3.60 3.82 3.58 3.40 3.79 3.60 3.58 3.58 string-concat 2.46 2.66 2.54 2.45 2.57 2.56 2.46 2.51 2.49 tailfib 2.44 2.54 2.53 2.57 2.50 2.57 2.57 2.36 2.54 tak 2.45 2.57 2.56 2.63 2.47 2.44 2.43 2.44 2.52 tensor 3.05 3.16 3.07 3.15 3.10 3.32 3.18 3.16 3.13 tsp 2.81 2.79 2.57 2.74 2.76 2.75 2.84 2.84 2.82 tyan 3.27 3.06 3.38 3.26 3.35 3.23 3.22 3.22 3.24 vector-rev 2.49 2.57 2.56 2.33 2.61 2.40 2.52 2.49 2.54 vector32-concat 2.53 2.49 2.52 2.47 2.55 2.63 2.49 2.49 2.51 vector64-concat 2.48 2.46 2.50 2.52 2.48 2.54 2.46 2.47 2.48 vliw 5.63 5.63 5.64 6.14 5.43 6.26 5.78 6.17 6.16 wc-input1 3.06 2.96 2.95 2.98 2.91 2.87 2.90 2.97 2.96 wc-scanStream 3.01 2.97 3.02 2.95 3.06 3.04 3.03 3.01 3.09 zebra 3.30 3.37 3.36 3.26 3.34 3.51 3.31 3.26 3.34 zern 2.69 2.74 2.62 2.64 2.72 2.74 2.73 2.43 2.70 run time benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 DLXSimulator 32.67 32.69 33.65 32.71 32.53 32.50 32.63 32.84 32.92 barnes-hut 28.33 28.63 28.64 28.85 28.59 28.50 28.71 28.78 28.61 boyer 55.98 56.01 56.81 56.75 56.82 56.08 56.29 56.72 56.76 checksum 25.35 25.39 28.74 28.79 28.64 25.39 28.42 28.68 28.87 count-graphs 39.96 39.98 39.51 40.23 39.38 40.13 40.16 39.92 39.75 even-odd 39.09 39.07 39.09 39.06 39.08 39.14 39.10 39.09 39.09 fft 30.56 30.69 30.75 31.05 31.20 30.98 30.86 30.62 30.77 fib 17.70 17.75 17.77 17.77 17.83 17.77 17.74 17.67 17.77 flat-array 23.60 23.54 23.65 23.70 23.82 28.18 28.09 28.02 28.01 hamlet 39.69 39.77 40.32 40.20 40.20 39.72 38.00 38.70 38.68 imp-for 24.40 24.44 25.78 25.74 25.79 24.51 25.73 25.98 25.84 knuth-bendix 34.10 34.09 35.05 34.97 35.03 34.16 35.02 35.11 35.07 lexgen 34.22 33.77 35.88 33.88 35.27 35.64 35.09 34.93 36.52 life 38.70 38.80 38.76 38.76 38.89 38.90 38.91 38.74 38.84 logic 35.01 34.93 34.44 34.82 34.82 35.06 34.47 34.77 34.93 mandelbrot 35.78 35.77 35.79 35.80 35.81 35.82 35.79 35.78 35.81 matrix-multiply 29.74 29.80 29.82 29.89 29.79 29.72 29.84 29.90 29.86 md5 28.38 28.39 28.04 28.05 28.01 28.42 28.12 28.14 27.98 merge 32.55 32.45 32.41 32.44 32.59 32.48 32.41 32.56 32.40 mlyacc 32.76 33.01 32.94 33.14 33.23 32.83 32.52 32.49 33.11 model-elimination 38.05 38.26 38.89 38.72 38.80 38.09 39.11 38.97 39.36 mpuz 29.94 29.86 29.88 29.91 29.89 29.91 29.89 29.86 29.90 nucleic 33.73 33.91 33.60 33.68 33.35 33.74 33.40 33.47 33.58 output1 30.01 30.01 29.99 30.01 30.02 30.06 29.99 30.10 29.91 peek 33.58 33.60 34.75 34.63 34.78 33.61 34.79 34.78 34.86 psdes-random 33.84 33.91 33.83 33.86 33.91 33.88 34.16 33.91 33.91 ratio-regions 49.08 48.63 48.33 48.33 49.40 49.45 48.70 48.95 49.45 ray 37.55 38.58 37.18 37.06 37.80 37.04 37.19 37.08 36.65 raytrace 34.20 35.34 34.35 34.94 35.23 34.00 34.08 34.61 34.52 simple 29.73 29.25 29.47 29.64 29.51 28.96 28.84 28.57 28.86 smith-normal-form 39.88 39.79 40.38 39.56 39.99 39.92 39.93 39.85 39.88 string-concat 91.43 91.56 91.68 91.40 91.65 91.58 91.75 91.50 91.36 tailfib 38.06 38.06 37.93 37.85 37.90 38.12 38.00 37.98 37.87 tak 30.73 30.77 31.96 31.60 33.85 30.94 31.41 33.22 30.87 tensor 39.62 39.59 39.67 39.70 39.70 39.54 39.71 39.76 39.67 tsp 37.88 37.80 37.84 37.79 37.75 37.62 37.86 37.99 37.82 tyan 30.48 30.45 30.55 30.86 30.79 30.78 30.74 30.85 30.79 vector-rev 27.00 26.60 26.61 26.44 26.55 26.37 26.68 26.53 26.39 vector32-concat 82.46 82.52 82.56 82.72 82.43 82.56 82.37 82.30 82.22 vector64-concat 91.66 91.86 91.57 91.77 91.63 91.13 91.54 91.42 91.13 vliw 28.91 28.43 28.94 29.05 28.90 28.82 28.29 28.27 28.44 wc-input1 43.95 43.81 43.87 43.95 43.86 43.93 43.95 43.89 43.79 wc-scanStream 21.63 21.64 22.96 23.03 23.02 21.65 22.51 22.89 23.17 zebra 30.39 30.46 30.32 30.29 30.28 30.37 30.37 30.36 30.39 zern 32.37 33.09 33.05 32.00 32.71 31.99 32.04 32.83 31.96

Bounce variables around forced stack allocations in RSSA loops See #218. In the translation from RSSA to Machine, each RSSA variable is assigned a location: either a stack slot or a register (i.e., local). An RSSA variable is assigned to a stack slot if it is live across a non-tail call or a `mayGC` ccall. Assigning an RSSA variable to a stack slot has a potential cost, because stack-slot accesses appear as memory reads and writes to the codegens and are not easily realized by hardware registers. This cost is multiplied when the stack-slot accesses are within a loop. For example, consider the tail recursive `revapp`: ``` standard-ml fun revapp (l, acc) = case l of [] => acc | h::t => revapp (t, h::acc) ``` This will be realized as an intraprocedural loop; however, due to the `h::acc` allocation in the loop, there will be a potential GC that forces `l` and `acc` to be assigned to stack slots (assuming the GC check occurs at the loop header). It would seem to be more efficient to assign `l` and `acc` to registers, moving them to and from stack slots at the GC, because a GC should occur on only a small fraction of the loop iterations. A new `BounceVars` RSSA pass attempts to split the live ranges of RSSA variables that are used within loops so that the within-loop instances of the variables are assigned registers (possibly being moved to and from stack slots around a GC). Unfortunately, it has not been easy to find good heuristics. The current defaults are biased towards not degrading performance. ## Benchmark results (cadmium) Specs: * 4 x AMD Opteron(tm) Processor 6172 (2.1GHz; 48 physical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.050 1.004 0.9939 0.9724 boyer 1.020 0.9547 0.9826 0.9940 checksum 1.007 1.034 0.9795 1.003 count-graphs 0.9735 1.008 0.9826 0.9722 DLXSimulator 0.9397 0.9566 0.9553 0.8925 even-odd 0.9185 1.019 1.000 0.9754 fft 1.061 1.041 1.039 1.021 fib 1.013 1.046 1.040 0.9884 flat-array 1.064 0.9138 0.9901 0.9247 hamlet 0.9442 0.9713 0.9664 0.9955 imp-for 1.040 0.9223 0.8787 1.005 knuth-bendix 0.9967 1.007 0.9924 0.9744 lexgen 1.005 1.025 0.9892 0.9574 life 0.9805 0.9813 0.9779 1.012 logic 1.062 0.9721 1.002 0.9641 mandelbrot 0.9995 1.003 1.005 1.000 matrix-multiply 0.9424 1.028 1.040 0.9958 md5 0.9946 0.9913 0.9929 1.053 merge 0.9942 0.9397 0.9797 0.9558 mlyacc 0.9956 0.9558 1.021 1.026 model-elimination 0.9816 0.9842 1.000 0.9769 mpuz 0.6748 0.8899 0.9354 1.084 nucleic 0.9955 0.9901 1.008 0.9746 output1 0.9880 0.8836 0.8840 0.9090 peek 1.010 1.075 0.9669 1.000 pidigits 0.9778 1.055 0.9936 1.018 psdes-random 0.9905 0.9943 1.054 1.000 ratio-regions 1.030 1.071 0.9943 0.9448 ray 1.016 0.9944 0.9993 1.030 raytrace 1.024 1.009 0.9994 1.017 simple 0.9900 1.017 1.019 1.032 smith-normal-form 1.015 1.017 1.080 1.061 string-concat 0.9619 1.004 0.9863 1.173 tailfib 0.9867 1.056 1.001 0.9989 tailmerge 1.005 1.002 0.9844 0.9697 tak 0.9886 1.025 0.9999 1.013 tensor 1.015 0.9728 1.006 0.9927 tsp 1.028 0.9878 0.9956 0.9745 tyan 1.029 1.011 1.012 0.9404 vector32-concat 0.7653 0.6395 0.8281 1.003 vector64-concat 0.8517 0.7782 0.8571 0.9619 vector-rev 0.9514 0.9847 0.8541 0.9161 vliw 1.028 0.9368 1.017 1.011 wc-input1 0.9170 0.9157 0.9084 0.9240 wc-scanStream 0.9929 1.007 1.146 1.038 zebra 0.9984 1.008 1.008 0.9941 zern 1.026 0.9929 1.005 0.9942 MIN 0.6748 0.6395 0.8281 0.8925 GMEAN 0.9811 0.9771 0.9846 0.9912 MAX 1.064 1.075 1.146 1.173 ``` Notes: * Overall, there is less run-time improvements than were hoped, although there are some "big" wins and no "big" losses. * The LLVM codegen seems to benefit the least; it may be that LLVM is able to realizes some stack-slot accesses as hardware registers (although it is unclear how it justifies doing so --- LLVM should not be able to "see" that the access of stack slots and heap objects do not alias). * The speedups with `vector32-concat` and `vector64-concat` may be related to the observation made in #288, where it was observed that moving a sequence from a register to a global had a significant slowdown (since the access through the global required an additional level of indirection). In this case, a sequence may be moving from a stack slot to a register, eliminating a level of indirection. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9721 0.9760 0.9171 0.9393 boyer 0.9054 0.9675 0.9988 1.024 checksum 0.9636 0.9630 0.9753 1.016 count-graphs 0.9822 1.002 0.9862 0.9151 DLXSimulator 0.9205 0.9248 0.9340 0.9704 even-odd 0.9808 0.9600 0.9705 1.015 fft 0.9938 0.9644 0.9817 0.9656 fib 0.9670 0.9777 0.9750 0.9473 flat-array 0.9781 0.9758 1.013 0.9805 hamlet 0.9942 1.104 1.029 1.000 imp-for 0.9380 0.9764 0.9429 0.9979 knuth-bendix 0.9409 0.9584 0.8814 0.8520 lexgen 0.9317 0.8911 0.8933 0.9525 life 0.9598 0.9610 0.9812 0.9046 logic 0.9704 0.9620 0.9474 0.9856 mandelbrot 0.9370 0.9926 1.011 0.9694 matrix-multiply 0.9510 0.9425 0.9634 0.9497 md5 0.9568 0.9609 0.9473 0.9282 merge 0.9896 0.9517 0.9780 0.9987 mlyacc 1.124 1.076 1.161 1.150 model-elimination 0.9912 0.9493 1.093 0.9879 mpuz 0.9568 0.9690 1.006 0.9763 nucleic 0.9010 0.9169 0.9384 1.079 output1 0.9668 0.9896 0.9598 0.9086 peek 0.9713 0.9736 0.9840 0.9626 pidigits 0.9720 0.9686 0.9494 0.9379 psdes-random 0.9761 1.004 1.002 0.9435 ratio-regions 0.9844 0.9421 0.9679 0.9262 ray 0.9605 0.8728 0.8533 0.8406 raytrace 0.9130 0.9501 0.8719 1.029 simple 0.8696 0.9130 0.9537 0.9602 smith-normal-form 0.9000 0.9400 0.9632 1.167 string-concat 0.9667 0.9934 0.9466 1.000 tailfib 0.9807 0.9598 0.9779 0.9713 tailmerge 0.9848 0.9871 0.9878 0.9870 tak 0.9720 1.0000 0.9728 0.9968 tensor 0.9739 0.9348 0.9057 0.8996 tsp 0.9955 1.001 0.9283 0.9518 tyan 0.9734 0.9536 0.9052 0.9488 vector32-concat 0.9527 0.9997 0.9673 0.9471 vector64-concat 0.9617 0.9722 0.9921 0.9885 vector-rev 0.9809 0.9846 0.9625 0.9955 vliw 1.020 1.004 1.111 1.086 wc-input1 0.9345 1.002 0.9502 0.9549 wc-scanStream 0.9890 0.9830 1.027 1.028 zebra 0.9712 0.9204 1.101 0.9702 zern 0.9416 1.013 1.098 0.8998 MIN 0.8696 0.8728 0.8533 0.8406 GMEAN 0.9635 0.9691 0.9740 0.9727 MAX 1.124 1.104 1.161 1.167 ``` Notes: * One concern with assigning more RSSA variables to registers is that increases the burden on the codegen (hardware) register allocator. * With the exception of `mlyacc`, there is a general improvement in compile time. Note that configs C04, C05, C06, and C07 are using a MLton compiled with `BounceVars` to compile programs with `BounceVars`. It isn't clear how much of the benefit is due to the effect of `BounceVars` on the compiled MLton and how much is due to the effect of `BounceVars` on the IR of the compiled programs. In any case, it seems that `BounceVars` "pays for itself", although the greatest improvements are not on the largest benchmark programs (`hamlet`, `mlyacc`, `model-elimination`, `nucleic`, `vliw`). ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9982 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.007 even-odd 1.005 1.001 0.9999 1.0000 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9993 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9990 1.003 1.001 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9999 1.000 0.9999 logic 1.004 1.003 1.001 1.002 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9993 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9994 0.9995 mlyacc 1.176 1.154 1.183 1.180 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9996 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9976 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.011 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9997 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tak 1.005 1.001 0.9997 0.9999 tensor 1.019 1.019 1.018 1.020 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.030 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9988 vliw 1.033 1.010 1.012 1.027 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.012 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.180 ``` Notes: * Consistent with the increase in compile-time for `mlyacc`, there is an increase in code size for `mlyacc`. * Otherwise, there is little change in code size. ## Benchmark results (sulfur) Specs: * 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores) * Ubuntu 16.04.6 LTS * gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) * gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0) * llvm: LLVM version 8.0.0 ``` text config command C00 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64 C01 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c C02 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8 C03 /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm C04 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64 C05 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c C06 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8 C07 /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm ``` ### Run-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.066 1.004 0.9963 0.8840 boyer 0.9890 0.9917 0.9902 0.9567 checksum 1.065 1.001 0.9935 0.9681 count-graphs 0.9580 0.9805 0.9665 0.9672 DLXSimulator 0.7266 0.7235 0.7374 0.7496 even-odd 1.009 0.9903 0.9873 0.9845 fft 0.9930 0.9983 0.9710 0.9682 fib 1.108 1.023 1.0000 0.9724 flat-array 0.9638 0.9581 0.9604 0.9620 hamlet 0.8991 0.9415 0.9295 0.9544 imp-for 0.9688 0.9359 0.9240 1.123 knuth-bendix 1.005 0.9915 1.025 1.005 lexgen 1.001 0.9954 0.9583 0.9974 life 0.9659 1.003 0.9652 0.9923 logic 0.9887 0.9938 0.9702 0.9919 mandelbrot 0.9989 1.015 1.024 1.008 matrix-multiply 1.017 0.9927 1.014 0.9881 md5 0.9846 1.006 0.9868 1.026 merge 1.071 1.077 1.039 1.074 mlyacc 0.9994 0.9986 1.012 1.011 model-elimination 0.9850 0.9656 0.9885 0.9879 mpuz 0.7695 0.8649 0.9525 1.018 nucleic 0.9267 0.9614 0.9359 0.9804 output1 0.9453 0.8869 0.8673 0.8782 peek 0.9988 1.000 1.011 0.9938 pidigits 0.9734 0.9961 1.003 0.9986 psdes-random 0.9902 0.9998 1.016 0.9956 ratio-regions 1.107 0.9887 0.9315 0.9852 ray 0.9905 0.9674 0.9930 0.9931 raytrace 1.008 1.032 1.005 1.057 simple 0.9659 0.9692 0.9842 0.9648 smith-normal-form 0.9858 0.9876 0.9712 0.9615 string-concat 0.9901 1.001 1.007 0.9731 tailfib 0.9968 1.015 1.000 0.9936 tailmerge 0.9893 0.9932 0.9810 0.8081 tak 0.9724 1.058 0.9762 0.9693 tensor 1.005 0.9982 0.9914 1.006 tsp 1.030 1.049 1.037 1.032 tyan 1.007 0.9887 0.9850 1.002 vector32-concat 0.3328 0.3212 0.6566 1.029 vector64-concat 0.3983 0.3806 0.7111 0.9922 vector-rev 0.9282 1.032 0.9473 0.9013 vliw 0.9068 0.8996 0.8662 0.9419 wc-input1 1.396 0.9463 1.120 1.348 wc-scanStream 0.9928 1.015 1.117 0.9642 zebra 0.9957 0.9966 0.9974 0.9967 zern 0.9958 1.013 1.000 0.9743 MIN 0.3328 0.3212 0.6566 0.7496 GMEAN 0.9468 0.9393 0.9640 0.9827 MAX 1.396 1.077 1.120 1.348 ``` Notes: * Again, there are some "big" wins and no "big" losses. * Interestingly, compared to `cadmium`, `vector32-concat` and `vector64-concat` see greater speedups, though `mpuz` sees a smaller speedup. ### Compile-Time Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 0.9401 0.9632 1.013 1.006 boyer 0.9659 1.000 0.9725 0.9629 checksum 1.022 0.9911 0.9966 0.9967 count-graphs 1.005 1.042 1.004 1.011 DLXSimulator 0.9910 0.9526 0.9833 0.9914 even-odd 0.9654 0.9591 1.009 0.9795 fft 0.9841 0.9713 0.9885 0.9923 fib 0.9417 0.9946 1.000 0.9976 flat-array 1.002 1.049 0.9716 1.012 hamlet 1.103 1.018 0.9917 0.9781 imp-for 0.9586 1.041 0.9824 1.032 knuth-bendix 0.9690 1.007 1.034 0.9347 lexgen 1.049 0.9718 0.9930 1.029 life 0.9490 0.9798 1.023 1.018 logic 0.9936 0.9303 0.9768 0.9715 mandelbrot 0.9959 0.9518 0.9339 0.9740 matrix-multiply 1.117 1.060 1.013 0.9176 md5 1.007 0.9454 0.9904 0.9973 merge 0.9922 0.9493 0.9793 1.015 mlyacc 1.113 1.125 1.136 1.143 model-elimination 0.9702 1.030 1.025 1.017 mpuz 1.015 0.9249 1.053 0.9401 nucleic 0.9813 1.007 0.9919 1.043 output1 0.9285 0.9746 0.9603 1.049 peek 0.9547 0.9915 0.9410 0.9543 pidigits 0.9569 0.9757 0.9745 0.9973 psdes-random 0.9687 0.9835 1.006 0.9596 ratio-regions 0.9640 1.021 1.019 1.036 ray 1.079 0.9156 0.9577 0.9714 raytrace 1.008 0.9028 1.010 1.052 simple 1.025 1.154 0.9570 1.068 smith-normal-form 0.9611 0.9399 0.9215 0.9196 string-concat 1.023 0.9968 0.9789 1.072 tailfib 0.9869 0.9748 0.9711 1.001 tailmerge 0.9812 0.9766 0.9731 0.9051 tak 0.9694 0.9863 0.9719 0.9681 tensor 0.9759 1.061 0.9847 1.030 tsp 0.9818 1.094 0.9497 1.066 tyan 1.018 1.004 0.9927 1.011 vector32-concat 0.9783 0.9755 0.9766 0.9792 vector64-concat 0.9820 0.9876 0.9830 0.9607 vector-rev 0.9648 0.9215 1.066 1.002 vliw 1.075 1.045 1.035 1.052 wc-input1 1.012 1.025 1.005 0.9931 wc-scanStream 0.9747 0.9883 0.9940 0.9761 zebra 0.9838 0.9624 0.9758 1.003 zern 0.8921 0.9604 0.9794 0.9880 MIN 0.8921 0.9028 0.9215 0.9051 GMEAN 0.9921 0.9920 0.9919 0.9985 MAX 1.117 1.154 1.136 1.143 ``` Notes: * Compared to `cadmium`, there is less general improvement in compile time, though (with the exception of `mlyacc`), no "across the board" slowdowns. ### Executable-Size Ratio ``` text program `C04/C00` `C05/C01` `C06/C02` `C07/C03` barnes-hut 1.005 1.005 1.009 0.9983 boyer 1.002 0.9940 0.9967 0.9972 checksum 1.005 1.001 1.001 1.000 count-graphs 1.007 1.004 1.005 1.001 DLXSimulator 1.015 1.005 1.006 1.008 even-odd 1.005 1.001 0.9999 0.9999 fft 1.009 1.006 1.004 1.003 fib 1.005 1.001 0.9998 1.000 flat-array 1.005 1.001 1.001 0.9995 hamlet 1.002 1.003 1.004 0.9999 imp-for 1.005 1.001 1.001 0.9991 knuth-bendix 1.004 0.9989 1.003 1.000 lexgen 1.025 1.001 1.002 1.004 life 1.003 0.9998 1.000 0.9997 logic 1.004 1.003 1.001 0.9997 mandelbrot 1.005 1.001 1.001 0.9994 matrix-multiply 1.005 1.000 1.002 0.9996 md5 1.005 1.002 1.002 1.007 merge 1.004 1.000 0.9993 0.9994 mlyacc 1.176 1.154 1.183 1.179 model-elimination 1.006 1.009 1.011 1.003 mpuz 1.006 0.9998 1.001 1.001 nucleic 1.000 0.9996 0.9998 0.9985 output1 1.006 1.002 1.002 1.006 peek 1.005 1.001 1.001 0.9994 pidigits 1.003 0.9998 1.001 1.005 psdes-random 1.005 1.001 1.002 0.9978 ratio-regions 1.020 1.008 1.007 1.009 ray 1.023 1.021 1.018 1.013 raytrace 1.008 1.003 1.003 1.013 simple 1.009 1.033 1.034 1.055 smith-normal-form 1.005 1.005 1.006 1.003 string-concat 1.004 0.9995 1.002 0.9980 tailfib 1.005 1.000 1.000 0.9998 tailmerge 1.004 0.9995 1.001 0.9998 tak 1.005 1.001 0.9997 1.000 tensor 1.019 1.019 1.018 1.019 tsp 1.005 1.002 1.004 1.007 tyan 1.034 1.026 1.025 1.031 vector32-concat 1.005 1.000 1.001 0.9984 vector64-concat 1.005 1.000 1.001 0.9982 vector-rev 1.005 1.001 1.001 0.9987 vliw 1.033 1.010 1.012 1.026 wc-input1 1.012 1.009 1.008 1.010 wc-scanStream 1.011 1.006 1.007 1.007 zebra 1.004 1.001 1.002 1.005 zern 1.006 1.003 1.001 1.002 MIN 1.000 0.9940 0.9967 0.9972 GMEAN 1.011 1.007 1.008 1.008 MAX 1.176 1.154 1.183 1.179 ``` Notes: * As they should be (same versions of MLton, gcc, llvm), executable sizes are identical with `cadmium`.

MatthewFluet added Compiler enhancement labels Nov 6, 2017

daemanos mentioned this issue Oct 15, 2018

Add new overflow-checking primitives #273

Merged

MatthewFluet mentioned this issue Jan 16, 2019

Alternate strategies for globalization in ConstantPropagation #288

Merged

MatthewFluet mentioned this issue May 13, 2019

Slightly improved rssa shrink #298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate alternate stack- and register-allocation strategies in backend #218

Investigate alternate stack- and register-allocation strategies in backend #218

MatthewFluet commented Nov 6, 2017

Investigate alternate stack- and register-allocation strategies in backend #218

Investigate alternate stack- and register-allocation strategies in backend #218

Comments

MatthewFluet commented Nov 6, 2017