Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate alternate stack- and register-allocation strategies in backend #218

Open
MatthewFluet opened this issue Nov 6, 2017 · 0 comments

Comments

@MatthewFluet
Copy link
Member

#217 updated the stack-allocation strategy in the RSSA to Machine translation:

  • Do not force Cont block arguments to stack (7ec42a1)
  • Improve Allocation.Stack.get (31b6e29)

Unfortunately, the improved Allocation.Stack.get has degraded the hamlet benchmark (but not others):

MLton0 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen amd64
MLton1 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen c
MLton2 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen llvm
MLton3 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen amd64
MLton4 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen c
MLton5 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen llvm
MLton6 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen amd64
MLton7 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen c
MLton8 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen llvm
MLton9 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen amd64
MLton10 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen c
MLton11 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen llvm
run time ratio
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet      1.00   2.53   2.27   0.99   2.39   2.29   1.18   2.61   2.44   1.16    2.70    2.46
size
benchmark    MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8    MLton9   MLton10   MLton11
hamlet    1,458,564 1,501,276 1,529,948 1,454,516 1,497,868 1,526,700 1,445,700 1,490,156 1,519,524 1,449,492 1,492,732 1,523,316
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     12.82  23.79  40.39  11.96  23.34  41.01  12.11  23.10  40.84  12.66   23.58   41.44
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     33.70  85.28  76.34  33.37  80.64  77.29  39.73  87.84  82.11  39.26   90.87   82.93

(20171105.113517-gfd0d8e4 is 31b6e29 with 7ec42a1 reverted.)

The improved Allocation.Stack.get was not expected to have a negative effect on runtime; it should simply find different stack slots for variables and possibly decrease stack frame sizes.

One conjecture is that the improved Allocation.Stack.get causes variables that previously shared a stack slot to now have different stack slots. This could impact intraprocedural Goto transfers, where, previously, a formal and an actual argument were assigned the same stack slot (and the move of the actual to the formal was a noop), but, now, the formal and the actual are assigned different stack slots (and the move of the actual to the formal is an executed move).

In general, the current stack- and register-allocation strategy (i.e., the mapping of RSSA variables to Machine globals, stack-slots, or registers) in the RSSA to Machine translation is greedy. [Remember, in the Machine IR, "registers" are really local variables that are not live across a non-tail or runtime call. The hope is that they will map to hardware registers, but in the Machine IR, there can be an arbitrary number of registers.] The strategy is essentially to begin with the stack- and register-allocation dictated by the live-in of a block and then greedily assign stack-slots and registers to variables defined within the block. Allocations are not updated when variables go dead within a block (potentially freeing up stack-slots and/or registers). Nor is any attempt made to optimize the placement of variables to reduce moves.

For Machine IR registers, this is mostly acceptable, since they codegens will perform their own assignment of these Machine IR registers (remember, "local variables") to hardware registers and spill locations, presumably using coalescing and other heuristics.

For Machine IR stack-slots, though, this can be problematic, because the codegens will generally perform reads and writes to stack-slots as instructed; in particular, a stack-slot to stack-slot move can never be elided by the C and LLVM codegens (since they appear as reads and writes of memory) and aren't easily elided by the native x86/amd64 codegens (the -native-live-stack {false|true} option was meant to track the liveness of stack slots in the native codegens so as to allow a Machine IR stack slot to live entirely in hardware registers without requiring writing the value back to memory, but the management of liveness information in the native codeges is naive (and performs poorly) and the number of stack-allocated variables can be large, so it never performed well-enough on self-compiles to become the default).

In any case, one could investigate better allocation strategies for the RSSA to Machine translation. In particular, actually build an interference graph and recognize variable-to-variable moves and make assignments to avoid extraneous moves.

MatthewFluet added a commit that referenced this issue Oct 15, 2018
Add new overflow-checking primitives

Motivation

MLton currently implements checked arithmetic with specialized primitives.
These primitives are exposed to the Basis Library implementation as functions
that implicitly raise a primitive exception:

    val +! = _prim "WordS8_addCheck": int * int -> int;

In the XML IR, special care is taken to "remember" the primive exception
associated with these primitives in order to implement exceptions correctly.  In
the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers,
rather than statements.

This pull request implements a possibly better implementation of these
operations as simple primitives that return a boolean:

    val +! = _prim "Word8_add": int * int -> int;
    val +? = _prim "WordS8_addCheckP": int * int -> bool;
    val +$ = fn (x, y) => let val r = +! (x, y)
                          in  if +? (x, y) then raise Overflow else r
                          end

This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA,
and Machine IRs, where the primitives would be treated as statements.  Other
compilers provide overflow checking via boolean-returning functions:

 * https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
 * https://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics

Implementation

This patch refactors the primitive checked-arithmetic operations such that the
suffix `?` represents a new overflow-checking predicate, a suffix `!` represents
the non-overflow-checking variant, and a suffix `$` represents the
overflow-checking variant (mnemonic: `$` for "safe" or "expensive"). The
behavior of the new `$`-operations is controlled by a compile-time constant,
`MLton.newOverflow`.  When set to false (the default), the `$`-operations make
use of the old-style implicit-`Overflow` primitives.  When set to true, the
`$`-operations are implemented as an `if`-expression that branches on the result
of the corresponding `?`-operation and either raises the `Overflow` exception or
returns the result of the corresponding `!`-operation.  Finally, the bare
operation is aliased to either the `$`-form (with overflow detection enabled) or
the `!`-form (with overflow detection disabled).  Essentially:

    val +! = _prim "Word8_add": int * int -> int;
    val +? = _prim "WordS8_addCheckP": int * int -> bool;
    val +$ = if MLton.newOverflow
             then fn (x, y) => let val r = +! (x, y)
                               in  if +? (x, y) then raise Overflow else r
                               end
             else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y))
                               handle PrimOverflow => raise Overflow
    val + = if MLton.detectOverflow then +$ else +!

Note that the checked-arithmetic using `!`- and `$`-operations is careful to
perform the `!`-operation before the `$`-operation.  With the native-codegens, a
new peephole optimization combines the separate unchecked-arithmetic operation
and checked-arithmetic-predicate operation into a single instruction.  For the
C-codgen, the new checked-arithmetic-predicate primitives are translated to uses
of the `__builtin_{add,sub,mul}_overflow` intrinsics (which improves upon the
previous explicit conditional checking, but requires gcc 5 or greater).
Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives
are translated to uses of the `{sadd,uadd,smul,umul,ssub,usub}.with.overflow`
intrinsics.  For both the C- and LLVM-codegens, it is expected that these
intrinsics will be combined with the separate unchecked-arithmetic operation.

In addition, the `RedundantTests` optimization has been extended to eliminate
the overflow test when adding or subtracting 1 with the new primitives.

Performance

The native-codegen peephole optimization and `RedundantTests` have been mostly
sufficient to keep performance on par with the older checked-arithmetic
primitives, and in some cases performance has even significantly improved.  Below
is a summary of the exceptional runtime ratios in the different codegens (both
positive and negative):

| Benchmark       | Native | LLVM |    C |
|-----------------|--------|------|------|
| even-odd        |   1.00 | 1.00 | 1.09 |
| hamlet          |   0.98 | 0.90 | 0.93 |
| imp-for         |   0.99 | 1.50 | 0.46 |
| lexgen          |   0.92 | 1.31 | 1.24 |
| matrix-multiply |   0.99 | 1.00 | 0.87 |
| md5             |   1.06 | 1.01 | 0.97 |
| tensor          |   1.01 | 1.00 | 0.57 |

No benchmarks were consistently worse or better on all codegen, e.g., the
`imp-for` benchmark performed exceptionally badly on the LLVM codegen, but was
much faster on the C codegen and about even on the native codegen.  For this
particular benchmark, the cause of the slowdown with the LLVM codegen has yet to
be discovered.  Similarly, the cause of the slowdown in `lexgen` with the C- and
LLVM-codegens is unknown.  For the `md5` benchmark, on the other hand, the cause
of the slowdown with the native codegen seems to be a failure to eliminate
common subexpressions in certain circumstances, which can potentially be
improved in the future.  Improvements in the C-codegen are likely to be due to
the better overflow checking with builtins.

Future work

The `CommonSubexp` optimization currently handles `Arith` transfers specially;
in particular, the result of an `Arith` transfer can be used in common `Arith`
transfers that it dominates.  This was done so that code like:

    (n + m) + (n + m)

can be transformed to

    let val r = n + m in r + r end

With the new checked-arithmetic-predicate primitives, the computation of the
boolean value may be common-subexpr eliminated, but `Case` discrimination will
not.  This forces the boolean value to be reified and to be discriminated
multiple times (though, perhaps `KnownCase` could eliminate subsequent
discriminations).  Extending `CommonSubexpr` to propagate flow-sensitive
relationships at `Case` transfers to the dominated targets could improve the
performance `md5` with `MLton.newOverflow true` and potentially improve
performance elsewhere as well (e.g., by propagating the results of comparisons
as well).

Once all performance slowdowns with `MLton.newOverflow true` have been
eliminated, it would be desirable to remove the old-style implicit-`Overflow`
primitives and `Arith` transfers.  This would eliminate many previous instances
of special-case code to handle these primitives and transfers.

Finally, it may be worth investigating an implementation of the
checked-operations via

    val +$ = fn (x, y) => let val b = +? (x, y)
                              val r = +! (x, y)
                          in  if b then raise Overflow else r
                          end

rather than

    val +$ = fn (x, y) => let val r = +! (x, y)
                          in  if +? (x, y) then raise Overflow else r
                          end

The advantage of calculating the boolean first is that when `x` (or `y`) is a
loop variable and `r` will be passed as the value for the next iteration, then
both `x` and `r` could be assigned the same location (pseudo-register or stack
slot).  `x` and `r` cannot share the same location when the boolean is
calculated second, because `x` and `r` are both live at the calculation of the
boolean.  See #218.  This would require reworking the native-codegen
peephole optimization.
MatthewFluet added a commit that referenced this issue Jan 16, 2019
Alternate strategies for globalization in ConstantPropagation

Extend globalization aspect of ConstantPropagation to support
globalization of arrays and to support different "small type"
strategies.

Closes #206.

Space-safety prohibits ConstantPropagation from globalizing all arrays
and refs that are allocated at most once by a program. In particular,
because globals are live for the duration of the program, globalizing
an `int list ref` (for example) would not be safe-for-space: an
arbitrarily large list may be written to the reference and never be
garbage collected (whereas, when the `int list ref` is not globalized,
it will be garbage collected when it is no longer live). On the other
hand, globalizing an `int ref` is safe-for-space.

However, MLton previously used only a very conservative estimation for
space safety. Only "small" types may be globalized, where smallness is
defined as:

    fun isSmall t =
        case dest t of
           Array _ => false
         | Datatype _ => false
         | Ref t => isSmall t
         | Tuple ts => Vector.forall (ts, isSmall)
         | Vector _ => false
         | _ => true

Note that no `Datatype` is small; this is conservative (since a
recursive datatype could represent unbounded data), but prevents
globalizing `bool ref`. Also, no `Array` is small; this is correct
(because an `int array ref` should not be globalized), but the
globalization of a `val a: t array = Array_alloc[t] (l)` was
conditioned on the smallness of `t array`, not the smallness of
`t`. It is correct to globalize an array if `t` were small; note that
to globalize `val a: t array = Array_alloc[t] (l)`, `l` (the length)
must be globalized and must, therefore, be a constant and the array is
of constant size. (This is Stephen Weeks's relaxed notion of
safe-for-space, where the constant factor blowup can be chosen per
program.)

This pull request adds support for alternate globalization strategies:
 * `-globalize-arrays {false|true}`: globalize arrays
 * `-globalize-refs {true|false}`: globalize refs
 * `-globalize-small-int-inf {true|false}`:globalize `IntInf` as a
   small type
 * `-globalize-small-type {1|0|2|3|4|9}`: strategies for classifying a
   type as "small":
   * `0`: constant `false` function (no types considered small)
   * `1`: no `Datatype` is considered small (original strategy)
   * `2`: `Datatype`s with all nullary constructors are considered
     small
   * `3`: `Datatype`s with all constructor arguments considered small
     according to strategy `2` are considered small
   * `4`: Fixed-point analysis of `Datatype`s to determine smallness
   * `9`: constant `true` function (all types considered small; not
     safe-for-space)

The defaults correspond to the previous behavior.

Unfortunately, additional globalization has little to no (positive)
effect on benchmarks:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07

Note that `MLton0` and `MLton1` generate identical code (modulo the
random magic number), so the slowdowns in `ray` and `raytrace` are
noise, which also suggests that slowdowns/speedups of <= 3% are also
likely noise.

The slowdown in `flat-array` with `-globalize-array true` is explained
as follows.  The `flat-array` benchmark uses `Vector.tabulate` to
allocate a vector that is used for all iterations of the benchmark.
With `-globalize-array false`, the array is not globalized, and in
SSA/SSA2, we have:

    x_1212: ((word32, word32) tuple) array =
      prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (x_1212)
...
    x_1287: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1283)

but with `-globalize-array true`, the array is globalized, and in
SSA/SS2, we have:

global_490: ((word32, word32) tuple) array =
  prim Array_alloc((word32, word32) tuple) (global_138 (*0xF4240*))
...
    x_757: ((word32, word32) tuple) vector =
      prim Array_toVector((word32, word32) tuple) (global_490)
...
    x_1286: (word32, word32) tuple =
      prim Vector_sub((word32, word32) tuple) (x_757, x_1282)

At RSSA, the `Array_toVector` becomes a header update and the array
variable is cast/copy-propagated for the vector variable;
with `-globalize-arrays false`, we have

  L_531 (x_1212: Objptr (opt_11)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (x_1212, ~8): Word64 := opt_12
...
    x_1354: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 0)
    x_1353: Word32 = XW32 (Cast (x_1212, Objptr (opt_12)), x_1283, 8, 4)


but with `-globalize-arrays true`, we have

  L_488 (global_490: Objptr (opt_7)) CReturn {func = {..., target = GC_sequenceAllocate}} =
...
    OW64 (global_490, ~8): Word64 := opt_12
...
    x_1353: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 0)
    x_1352: Word32 = XW32 (Cast (global_490, Objptr (opt_12)), x_1282, 8, 4)

Finally, with `-globalize-arrays false`, `x_1212` becomes a local
(because the loops to initialize and use the vector are
non-allocating):

    RW32(2): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (RP(0): Objptr (opt_11), Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

but with `-globalize-arrays true`:

    RW32(2): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 0): Word32
    RW32(3): Word32
     = XW32 (Cast (glob {index = 1, isRoot = true, ty = Objptr (opt_7)},
                   Objptr (opt_12)),
             RW64(0): Word64, 8, 4): Word32

The innermost loop of the benchmark goes from indexing a sequence
stored in a local (`RP(0)`) to indexing a sequence stored in a global
(`GP(1)`).  All of the codegens should implement the former by using a
hardware register for `RP(0)`, but will implement the latter with a
memory read.

In light of the above, and related to #218, it may be
beneficial to "deglobalize" object pointer globals; that is, in RSSA
functions that have multiple accesses through the same object pointer
global (particularly within loops) could be translated to copy the
global to a local.

The slowdown in `checksum` is less easily explained.  The only new
objects globalized with `-globalize-small-type 2` as compared to
`-globalize-small-type 1` are two `bool ref` objects, corresponding to
the `exiting` flag of `basis-library/mlton/exit.sml` and the
`staticIsInUse` flag of `basis-library/util/one.sml` used by
`Int.fmt`.  That small change seems to lead to code layout and cache
effects that result in the slowdown, because the assembly code is not
substantial different.  With `-enable-pass machineSuffle` and
`-seed-rand <w>`, one can perturb the code layout and observe that the
slowdowns are not universal:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton9 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton10 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton11 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton12 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
MLton13 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton14 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand 42424242
MLton15 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1 -enable-pass machineShuffle -seed-rand deadbeef
MLton16 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton17 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand 42424242
MLton18 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2 -enable-pass machineShuffle -seed-rand deadbeef
MLton19 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton20 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand 42424242
MLton21 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3 -enable-pass machineShuffle -seed-rand deadbeef
MLton22 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
MLton23 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand 42424242
MLton24 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4 -enable-pass machineShuffle -seed-rand deadbeef
run time ratio
benchmark     MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13 MLton14 MLton15 MLton16 MLton17 MLton18 MLton19 MLton20 MLton21 MLton22 MLton23 MLton24
checksum        1.00   1.00   1.00   1.01   1.14   1.01   1.00   1.14   1.01   1.01    1.14    1.00    1.04    1.00    1.00    1.00    1.14    1.00    1.01    1.16    1.00    1.01    1.15    1.00    1.01
flat-array      1.00   1.01   1.02   1.00   1.00   1.01   1.01   1.03   1.01   1.01    1.01    1.01    1.01    1.23    1.19    1.19    1.19    1.20    1.19    1.20    1.20    1.19    1.20    1.19    1.20
hamlet          1.00   0.99   1.00   0.99   1.01   1.01   1.00   1.01   1.03   1.01    1.01    1.02    1.01    1.00    1.00    0.99    0.95    0.95    0.94    0.97    0.96    0.96    0.98    0.98    0.99
imp-for         1.00   1.00   1.05   1.05   1.05   1.02   1.00   1.05   0.99   1.00    1.05    1.00    1.01    1.00    1.05    1.05    1.05    0.99    1.00    1.05    1.00    1.00    1.05    1.00    0.99
lexgen          1.00   0.97   1.00   0.97   1.03   1.03   1.01   1.04   0.99   0.95    0.99    0.99    0.95    0.97    0.98    0.95    0.96    1.01    0.98    1.00    1.04    0.95    0.96    1.00    0.95
peek            1.00   1.00   1.00   1.01   1.03   1.01   1.04   1.03   1.01   1.04    1.03    1.01    1.03    1.00    1.00    1.01    1.04    1.01    1.05    1.03    1.01    1.04    1.04    1.00    1.04
simple          1.00   1.01   1.01   1.00   1.00   0.99   1.02   1.00   0.99   1.00    1.00    1.00    1.00    0.98    0.97    0.99    0.98    0.97    0.99    0.97    0.99    0.98    0.98    0.98    0.99
tak             1.00   0.99   0.90   1.00   1.05   0.90   0.99   1.02   0.89   0.99    1.04    0.90    1.00    0.99    0.89    0.99    0.99    0.90    0.99    1.01    0.90    1.00    1.00    0.90    1.00
wc-scanStream   1.00   1.01   1.01   1.03   1.06   1.02   1.03   1.05   1.02   1.02    1.04    1.01    1.00    1.01    1.01    1.00    1.06    1.01    1.00    1.07    1.01    1.00    1.05    1.03    1.02

Note that while `checksum` with MLton4 has a slowdown, `checksum` with
MLton5 and MLton6 (which are identical up to shuffling of the
functions and basic blocks at the MachineIR) do not have a slowdown.
Similarly `tak` with MLton0 and MLton1 have similar running time, but
`tak` with MLton3 has a speedup.  On the other hand, `flat-array`'s
slowdowns with `-globalize-arrays true` are not due to code layout
effects.

`hamlet` may have a slight speedup with `-globalize-arrays true`, but
that is significantly outweighted by the slowdown in `flat-array`.

The conclusion is to leave the defaults corresponding to the original
behavior.


Full benchmark results:

MLton0 -- ~/devel/mlton/builds/20190106.115052-gfe996d4/bin/mlton
MLton1 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 1
MLton2 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 2
MLton3 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 3
MLton4 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays false -globalize-refs true -globalize-small-type 4
MLton5 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 1
MLton6 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 2
MLton7 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 3
MLton8 -- ~/devel/mlton/builds/20190111.182738-g0847620/bin/mlton -globalize-arrays true -globalize-refs true -globalize-small-type 4
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        1.00   1.00   1.03   1.00   1.00   0.99   1.00   1.01   1.01
barnes-hut          1.00   1.01   1.01   1.02   1.01   1.01   1.01   1.02   1.01
boyer               1.00   1.00   1.01   1.01   1.02   1.00   1.01   1.01   1.01
checksum            1.00   1.00   1.13   1.14   1.13   1.00   1.12   1.13   1.14
count-graphs        1.00   1.00   0.99   1.01   0.99   1.00   1.00   1.00   0.99
even-odd            1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
fft                 1.00   1.00   1.01   1.02   1.02   1.01   1.01   1.00   1.01
fib                 1.00   1.00   1.00   1.00   1.01   1.00   1.00   1.00   1.00
flat-array          1.00   1.00   1.00   1.00   1.01   1.19   1.19   1.19   1.19
hamlet              1.00   1.00   1.02   1.01   1.01   1.00   0.96   0.97   0.97
imp-for             1.00   1.00   1.06   1.05   1.06   1.00   1.05   1.06   1.06
knuth-bendix        1.00   1.00   1.03   1.03   1.03   1.00   1.03   1.03   1.03
lexgen              1.00   0.99   1.05   0.99   1.03   1.04   1.03   1.02   1.07
life                1.00   1.00   1.00   1.00   1.00   1.01   1.01   1.00   1.00
logic               1.00   1.00   0.98   0.99   0.99   1.00   0.98   0.99   1.00
mandelbrot          1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
matrix-multiply     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00
md5                 1.00   1.00   0.99   0.99   0.99   1.00   0.99   0.99   0.99
merge               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
mlyacc              1.00   1.01   1.01   1.01   1.01   1.00   0.99   0.99   1.01
model-elimination   1.00   1.01   1.02   1.02   1.02   1.00   1.03   1.02   1.03
mpuz                1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
nucleic             1.00   1.01   1.00   1.00   0.99   1.00   0.99   0.99   1.00
output1             1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
peek                1.00   1.00   1.03   1.03   1.04   1.00   1.04   1.04   1.04
psdes-random        1.00   1.00   1.00   1.00   1.00   1.00   1.01   1.00   1.00
ratio-regions       1.00   0.99   0.98   0.98   1.01   1.01   0.99   1.00   1.01
ray                 1.00   1.03   0.99   0.99   1.01   0.99   0.99   0.99   0.98
raytrace            1.00   1.03   1.00   1.02   1.03   0.99   1.00   1.01   1.01
simple              1.00   0.98   0.99   1.00   0.99   0.97   0.97   0.96   0.97
smith-normal-form   1.00   1.00   1.01   0.99   1.00   1.00   1.00   1.00   1.00
string-concat       1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tailfib             1.00   1.00   1.00   0.99   1.00   1.00   1.00   1.00   1.00
tak                 1.00   1.00   1.04   1.03   1.10   1.01   1.02   1.08   1.00
tensor              1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
tsp                 1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   1.00
tyan                1.00   1.00   1.00   1.01   1.01   1.01   1.01   1.01   1.01
vector-rev          1.00   0.99   0.99   0.98   0.98   0.98   0.99   0.98   0.98
vector32-concat     1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
vector64-concat     1.00   1.00   1.00   1.00   1.00   0.99   1.00   1.00   0.99
vliw                1.00   0.98   1.00   1.01   1.00   1.00   0.98   0.98   0.98
wc-input1           1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
wc-scanStream       1.00   1.00   1.06   1.06   1.06   1.00   1.04   1.06   1.07
zebra               1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
zern                1.00   1.02   1.02   0.99   1.01   0.99   0.99   1.01   0.99
size
benchmark            MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8
DLXSimulator        209,076   209,076   209,140   209,140   209,140   209,076   208,836   208,340   208,340
barnes-hut          176,199   176,199   176,071   176,071   176,071   176,199   176,071   176,071   176,071
boyer               243,369   243,369   243,289   243,289   243,289   243,369   243,289   243,289   243,289
checksum            117,561   117,561   117,433   117,433   117,433   117,561   117,433   117,433   117,433
count-graphs        145,065   145,065   145,017   145,017   145,017   145,065   144,937   144,937   144,937
even-odd            117,529   117,529   117,433   117,433   117,433   117,529   117,433   117,433   117,433
fft                 142,307   142,307   141,315   141,315   141,315   142,307   141,315   141,315   141,315
fib                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
flat-array          117,177   117,177   117,049   117,049   117,049   117,193   117,081   117,081   117,081
hamlet            1,434,228 1,434,228 1,433,220 1,433,220 1,433,220 1,434,228 1,432,564 1,427,956 1,427,396
imp-for             117,241   117,241   117,145   117,145   117,145   117,241   117,145   117,145   117,145
knuth-bendix        186,116   186,116   186,212   186,212   186,212   186,116   186,212   186,212   186,212
lexgen              290,931   290,931   290,819   290,819   290,819   290,931   290,819   290,819   290,819
life                141,113   141,113   141,065   141,065   141,065   141,113   141,065   141,065   141,065
logic               197,417   197,417   197,273   197,273   197,273   197,417   197,273   197,273   197,273
mandelbrot          117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
matrix-multiply     119,577   119,577   119,417   119,417   119,417   119,577   119,417   119,417   119,417
md5                 144,676   144,676   144,500   144,500   144,500   144,676   144,500   144,500   144,500
merge               118,953   118,953   118,857   118,857   118,857   118,953   118,857   118,857   118,857
mlyacc              643,555   643,555   643,651   643,651   643,651   643,555   643,651   643,651   643,475
model-elimination   796,054   796,054   793,958   793,798   793,798   796,054   794,166   792,246   792,246
mpuz                123,545   123,545   123,481   123,481   123,481   123,545   123,481   123,481   123,481
nucleic             297,249   297,249   297,233   297,233   297,233   297,249   297,233   297,233   297,233
output1             151,768   151,768   149,848   149,848   149,848   151,768   149,848   149,848   149,848
peek                150,164   150,164   150,132   150,132   150,132   150,164   150,132   150,132   150,132
psdes-random        121,545   121,545   121,401   121,401   121,401   121,545   121,401   121,401   121,401
ratio-regions       144,137   144,137   144,169   144,169   144,169   144,137   144,169   144,169   144,169
ray                 250,058   250,058   250,218   250,218   250,218   250,058   249,818   249,066   249,066
raytrace            368,988   368,988   368,108   368,108   368,108   368,956   367,868   367,468   367,468
simple              345,205   345,205   345,381   345,381   345,381   329,557   329,557   329,317   329,317
smith-normal-form   279,837   279,837   279,645   279,645   279,645   279,837   279,341   279,341   279,341
string-concat       119,129   119,129   119,033   119,033   119,033   119,209   119,033   119,033   119,033
tailfib             117,273   117,273   117,177   117,177   117,177   117,273   117,177   117,177   117,177
tak                 117,449   117,449   117,321   117,321   117,321   117,449   117,321   117,321   117,321
tensor              179,292   179,292   176,908   176,908   176,908   179,292   176,908   176,908   176,908
tsp                 158,860   158,860   158,668   158,668   158,668   158,860   158,668   158,668   158,668
tyan                223,588   223,588   223,044   223,044   223,044   223,588   223,044   223,044   223,044
vector-rev          118,105   118,105   118,009   118,009   118,009   118,153   117,977   117,977   117,977
vector32-concat     118,297   118,297   118,201   118,201   118,201   118,329   118,185   118,185   118,185
vector64-concat     118,329   118,329   118,169   118,169   118,169   118,329   118,217   118,217   118,217
vliw                505,509   505,509   503,013   503,013   503,013   505,637   500,917   497,957   497,957
wc-input1           179,051   179,051   178,923   178,923   178,923   179,051   178,923   178,923   178,923
wc-scanStream       188,155   188,155   188,027   188,027   188,027   188,155   188,027   188,027   188,027
zebra               225,364   225,364   225,220   225,220   225,220   225,364   225,220   225,220   225,220
zern                153,241   153,241   152,521   152,521   152,521   153,241   152,585   152,585   152,585
compile time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator        3.19   3.14   3.45   3.51   3.02   3.27   3.18   3.45   3.24
barnes-hut          2.93   2.94   2.96   2.97   3.06   2.92   2.96   2.93   2.96
boyer               3.36   3.49   3.48   3.32   3.42   3.40   3.52   3.53   3.49
checksum            2.53   2.56   2.47   2.56   2.57   2.46   2.56   2.54   2.62
count-graphs        2.68   2.69   2.71   2.80   2.80   2.79   2.67   2.69   2.76
even-odd            2.45   2.56   2.46   2.56   2.48   2.50   2.48   2.55   2.57
fft                 2.64   2.57   2.62   2.66   2.35   2.68   2.60   2.54   2.62
fib                 2.46   2.46   2.58   2.48   2.55   2.45   2.55   2.54   2.46
flat-array          2.52   2.56   2.55   2.56   2.55   2.54   2.53   2.56   2.55
hamlet             15.27  15.74  14.48  14.35  14.48  15.63  15.21  14.88  15.04
imp-for             2.48   2.56   2.52   2.55   2.55   2.34   2.42   2.48   2.54
knuth-bendix        2.89   2.90   2.90   3.01   3.02   2.93   3.01   3.13   2.99
lexgen              3.40   3.88   3.76   3.82   3.74   3.62   3.50   3.81   3.71
life                2.66   2.71   2.70   2.74   2.72   2.64   2.74   2.68   2.60
logic               3.06   3.05   3.11   3.12   3.00   3.13   3.14   2.92   2.83
mandelbrot          2.51   2.54   2.53   2.58   2.57   2.53   2.56   2.55   2.45
matrix-multiply     2.48   2.50   2.52   2.49   2.57   2.60   2.60   2.57   2.53
md5                 2.65   2.68   2.78   2.58   2.78   2.80   2.58   2.80   2.69
merge               2.47   2.49   2.58   2.57   2.55   2.49   2.57   2.50   2.52
mlyacc              7.85   7.94   8.00   7.98   7.90   7.69   8.05   8.09   7.54
model-elimination   7.08   7.98   7.62   7.11   7.40   8.24   7.84   7.96   8.15
mpuz                2.34   2.61   2.60   2.56   2.53   2.62   2.42   2.53   2.52
nucleic             4.06   4.17   4.07   4.05   4.08   4.06   4.28   4.33   4.14
output1             2.69   2.68   2.52   2.58   2.77   2.79   2.78   2.57   2.77
peek                2.74   2.78   2.58   2.80   2.80   2.73   2.70   2.68   2.72
psdes-random        2.56   2.53   2.49   2.48   2.50   2.48   2.53   2.64   2.50
ratio-regions       2.80   2.82   2.62   2.88   2.86   2.89   2.81   2.80   2.79
ray                 3.34   3.62   3.69   3.47   3.45   3.60   3.48   3.37   3.40
raytrace            4.55   4.88   4.47   4.32   4.64   4.30   4.48   4.43   4.50
simple              4.01   4.07   4.04   4.00   3.91   3.83   3.77   3.81   3.74
smith-normal-form   3.75   3.60   3.82   3.58   3.40   3.79   3.60   3.58   3.58
string-concat       2.46   2.66   2.54   2.45   2.57   2.56   2.46   2.51   2.49
tailfib             2.44   2.54   2.53   2.57   2.50   2.57   2.57   2.36   2.54
tak                 2.45   2.57   2.56   2.63   2.47   2.44   2.43   2.44   2.52
tensor              3.05   3.16   3.07   3.15   3.10   3.32   3.18   3.16   3.13
tsp                 2.81   2.79   2.57   2.74   2.76   2.75   2.84   2.84   2.82
tyan                3.27   3.06   3.38   3.26   3.35   3.23   3.22   3.22   3.24
vector-rev          2.49   2.57   2.56   2.33   2.61   2.40   2.52   2.49   2.54
vector32-concat     2.53   2.49   2.52   2.47   2.55   2.63   2.49   2.49   2.51
vector64-concat     2.48   2.46   2.50   2.52   2.48   2.54   2.46   2.47   2.48
vliw                5.63   5.63   5.64   6.14   5.43   6.26   5.78   6.17   6.16
wc-input1           3.06   2.96   2.95   2.98   2.91   2.87   2.90   2.97   2.96
wc-scanStream       3.01   2.97   3.02   2.95   3.06   3.04   3.03   3.01   3.09
zebra               3.30   3.37   3.36   3.26   3.34   3.51   3.31   3.26   3.34
zern                2.69   2.74   2.62   2.64   2.72   2.74   2.73   2.43   2.70
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8
DLXSimulator       32.67  32.69  33.65  32.71  32.53  32.50  32.63  32.84  32.92
barnes-hut         28.33  28.63  28.64  28.85  28.59  28.50  28.71  28.78  28.61
boyer              55.98  56.01  56.81  56.75  56.82  56.08  56.29  56.72  56.76
checksum           25.35  25.39  28.74  28.79  28.64  25.39  28.42  28.68  28.87
count-graphs       39.96  39.98  39.51  40.23  39.38  40.13  40.16  39.92  39.75
even-odd           39.09  39.07  39.09  39.06  39.08  39.14  39.10  39.09  39.09
fft                30.56  30.69  30.75  31.05  31.20  30.98  30.86  30.62  30.77
fib                17.70  17.75  17.77  17.77  17.83  17.77  17.74  17.67  17.77
flat-array         23.60  23.54  23.65  23.70  23.82  28.18  28.09  28.02  28.01
hamlet             39.69  39.77  40.32  40.20  40.20  39.72  38.00  38.70  38.68
imp-for            24.40  24.44  25.78  25.74  25.79  24.51  25.73  25.98  25.84
knuth-bendix       34.10  34.09  35.05  34.97  35.03  34.16  35.02  35.11  35.07
lexgen             34.22  33.77  35.88  33.88  35.27  35.64  35.09  34.93  36.52
life               38.70  38.80  38.76  38.76  38.89  38.90  38.91  38.74  38.84
logic              35.01  34.93  34.44  34.82  34.82  35.06  34.47  34.77  34.93
mandelbrot         35.78  35.77  35.79  35.80  35.81  35.82  35.79  35.78  35.81
matrix-multiply    29.74  29.80  29.82  29.89  29.79  29.72  29.84  29.90  29.86
md5                28.38  28.39  28.04  28.05  28.01  28.42  28.12  28.14  27.98
merge              32.55  32.45  32.41  32.44  32.59  32.48  32.41  32.56  32.40
mlyacc             32.76  33.01  32.94  33.14  33.23  32.83  32.52  32.49  33.11
model-elimination  38.05  38.26  38.89  38.72  38.80  38.09  39.11  38.97  39.36
mpuz               29.94  29.86  29.88  29.91  29.89  29.91  29.89  29.86  29.90
nucleic            33.73  33.91  33.60  33.68  33.35  33.74  33.40  33.47  33.58
output1            30.01  30.01  29.99  30.01  30.02  30.06  29.99  30.10  29.91
peek               33.58  33.60  34.75  34.63  34.78  33.61  34.79  34.78  34.86
psdes-random       33.84  33.91  33.83  33.86  33.91  33.88  34.16  33.91  33.91
ratio-regions      49.08  48.63  48.33  48.33  49.40  49.45  48.70  48.95  49.45
ray                37.55  38.58  37.18  37.06  37.80  37.04  37.19  37.08  36.65
raytrace           34.20  35.34  34.35  34.94  35.23  34.00  34.08  34.61  34.52
simple             29.73  29.25  29.47  29.64  29.51  28.96  28.84  28.57  28.86
smith-normal-form  39.88  39.79  40.38  39.56  39.99  39.92  39.93  39.85  39.88
string-concat      91.43  91.56  91.68  91.40  91.65  91.58  91.75  91.50  91.36
tailfib            38.06  38.06  37.93  37.85  37.90  38.12  38.00  37.98  37.87
tak                30.73  30.77  31.96  31.60  33.85  30.94  31.41  33.22  30.87
tensor             39.62  39.59  39.67  39.70  39.70  39.54  39.71  39.76  39.67
tsp                37.88  37.80  37.84  37.79  37.75  37.62  37.86  37.99  37.82
tyan               30.48  30.45  30.55  30.86  30.79  30.78  30.74  30.85  30.79
vector-rev         27.00  26.60  26.61  26.44  26.55  26.37  26.68  26.53  26.39
vector32-concat    82.46  82.52  82.56  82.72  82.43  82.56  82.37  82.30  82.22
vector64-concat    91.66  91.86  91.57  91.77  91.63  91.13  91.54  91.42  91.13
vliw               28.91  28.43  28.94  29.05  28.90  28.82  28.29  28.27  28.44
wc-input1          43.95  43.81  43.87  43.95  43.86  43.93  43.95  43.89  43.79
wc-scanStream      21.63  21.64  22.96  23.03  23.02  21.65  22.51  22.89  23.17
zebra              30.39  30.46  30.32  30.29  30.28  30.37  30.37  30.36  30.39
zern               32.37  33.09  33.05  32.00  32.71  31.99  32.04  32.83  31.96
MatthewFluet added a commit that referenced this issue May 31, 2019
Bounce variables around forced stack allocations in RSSA loops

See #218.

In the translation from RSSA to Machine, each RSSA variable is
assigned a location: either a stack slot or a register (i.e., local).
An RSSA variable is assigned to a stack slot if it is live across a
non-tail call or a `mayGC` ccall.  Assigning an RSSA variable to a
stack slot has a potential cost, because stack-slot accesses appear as
memory reads and writes to the codegens and are not easily realized by
hardware registers.  This cost is multiplied when the stack-slot
accesses are within a loop.  For example, consider the tail recursive
`revapp`:

``` standard-ml
fun revapp (l, acc) =
  case l of
     [] => acc
   | h::t => revapp (t, h::acc)
```

This will be realized as an intraprocedural loop; however, due to the
`h::acc` allocation in the loop, there will be a potential GC that
forces `l` and `acc` to be assigned to stack slots (assuming the GC
check occurs at the loop header).  It would seem to be more efficient
to assign `l` and `acc` to registers, moving them to and from stack
slots at the GC, because a GC should occur on only a small fraction of
the loop iterations.

A new `BounceVars` RSSA pass attempts to split the live ranges of RSSA
variables that are used within loops so that the within-loop instances
of the variables are assigned registers (possibly being moved to and
from stack slots around a GC).

Unfortunately, it has not been easy to find good heuristics.  The
current defaults are biased towards not degrading performance.

## Benchmark results (cadmium)

Specs:
* 4 x AMD Opteron(tm) Processor 6172 (2.1GHz; 48 physical cores)
* Ubuntu 16.04.6 LTS
* gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
* gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0)
* llvm: LLVM version 8.0.0

``` text
config command                                                               
C00    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64      
C01    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c          
C02    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8
C03    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm       
C04    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64      
C05    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c          
C06    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8
C07    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm       
```

### Run-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           1.050     1.004     0.9939    0.9724
boyer                1.020     0.9547    0.9826    0.9940
checksum             1.007     1.034     0.9795    1.003 
count-graphs         0.9735    1.008     0.9826    0.9722
DLXSimulator         0.9397    0.9566    0.9553    0.8925
even-odd             0.9185    1.019     1.000     0.9754
fft                  1.061     1.041     1.039     1.021 
fib                  1.013     1.046     1.040     0.9884
flat-array           1.064     0.9138    0.9901    0.9247
hamlet               0.9442    0.9713    0.9664    0.9955
imp-for              1.040     0.9223    0.8787    1.005 
knuth-bendix         0.9967    1.007     0.9924    0.9744
lexgen               1.005     1.025     0.9892    0.9574
life                 0.9805    0.9813    0.9779    1.012 
logic                1.062     0.9721    1.002     0.9641
mandelbrot           0.9995    1.003     1.005     1.000 
matrix-multiply      0.9424    1.028     1.040     0.9958
md5                  0.9946    0.9913    0.9929    1.053 
merge                0.9942    0.9397    0.9797    0.9558
mlyacc               0.9956    0.9558    1.021     1.026 
model-elimination    0.9816    0.9842    1.000     0.9769
mpuz                 0.6748    0.8899    0.9354    1.084 
nucleic              0.9955    0.9901    1.008     0.9746
output1              0.9880    0.8836    0.8840    0.9090
peek                 1.010     1.075     0.9669    1.000 
pidigits             0.9778    1.055     0.9936    1.018 
psdes-random         0.9905    0.9943    1.054     1.000 
ratio-regions        1.030     1.071     0.9943    0.9448
ray                  1.016     0.9944    0.9993    1.030 
raytrace             1.024     1.009     0.9994    1.017 
simple               0.9900    1.017     1.019     1.032 
smith-normal-form    1.015     1.017     1.080     1.061 
string-concat        0.9619    1.004     0.9863    1.173 
tailfib              0.9867    1.056     1.001     0.9989
tailmerge            1.005     1.002     0.9844    0.9697
tak                  0.9886    1.025     0.9999    1.013 
tensor               1.015     0.9728    1.006     0.9927
tsp                  1.028     0.9878    0.9956    0.9745
tyan                 1.029     1.011     1.012     0.9404
vector32-concat      0.7653    0.6395    0.8281    1.003 
vector64-concat      0.8517    0.7782    0.8571    0.9619
vector-rev           0.9514    0.9847    0.8541    0.9161
vliw                 1.028     0.9368    1.017     1.011 
wc-input1            0.9170    0.9157    0.9084    0.9240
wc-scanStream        0.9929    1.007     1.146     1.038 
zebra                0.9984    1.008     1.008     0.9941
zern                 1.026     0.9929    1.005     0.9942
MIN                  0.6748    0.6395    0.8281    0.8925
GMEAN                0.9811    0.9771    0.9846    0.9912
MAX                  1.064     1.075     1.146     1.173 
```

Notes:
* Overall, there is less run-time improvements than were hoped,
  although there are some "big" wins and no "big" losses.
* The LLVM codegen seems to benefit the least; it may be that LLVM is
  able to realizes some stack-slot accesses as hardware registers
  (although it is unclear how it justifies doing so --- LLVM should
  not be able to "see" that the access of stack slots and heap objects
  do not alias).
* The speedups with `vector32-concat` and `vector64-concat` may be
  related to the observation made in #288, where it was
  observed that moving a sequence from a register to a global had a
  significant slowdown (since the access through the global required
  an additional level of indirection).  In this case, a sequence may
  be moving from a stack slot to a register, eliminating a level of
  indirection.

### Compile-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           0.9721    0.9760    0.9171    0.9393
boyer                0.9054    0.9675    0.9988    1.024 
checksum             0.9636    0.9630    0.9753    1.016 
count-graphs         0.9822    1.002     0.9862    0.9151
DLXSimulator         0.9205    0.9248    0.9340    0.9704
even-odd             0.9808    0.9600    0.9705    1.015 
fft                  0.9938    0.9644    0.9817    0.9656
fib                  0.9670    0.9777    0.9750    0.9473
flat-array           0.9781    0.9758    1.013     0.9805
hamlet               0.9942    1.104     1.029     1.000 
imp-for              0.9380    0.9764    0.9429    0.9979
knuth-bendix         0.9409    0.9584    0.8814    0.8520
lexgen               0.9317    0.8911    0.8933    0.9525
life                 0.9598    0.9610    0.9812    0.9046
logic                0.9704    0.9620    0.9474    0.9856
mandelbrot           0.9370    0.9926    1.011     0.9694
matrix-multiply      0.9510    0.9425    0.9634    0.9497
md5                  0.9568    0.9609    0.9473    0.9282
merge                0.9896    0.9517    0.9780    0.9987
mlyacc               1.124     1.076     1.161     1.150 
model-elimination    0.9912    0.9493    1.093     0.9879
mpuz                 0.9568    0.9690    1.006     0.9763
nucleic              0.9010    0.9169    0.9384    1.079 
output1              0.9668    0.9896    0.9598    0.9086
peek                 0.9713    0.9736    0.9840    0.9626
pidigits             0.9720    0.9686    0.9494    0.9379
psdes-random         0.9761    1.004     1.002     0.9435
ratio-regions        0.9844    0.9421    0.9679    0.9262
ray                  0.9605    0.8728    0.8533    0.8406
raytrace             0.9130    0.9501    0.8719    1.029 
simple               0.8696    0.9130    0.9537    0.9602
smith-normal-form    0.9000    0.9400    0.9632    1.167 
string-concat        0.9667    0.9934    0.9466    1.000 
tailfib              0.9807    0.9598    0.9779    0.9713
tailmerge            0.9848    0.9871    0.9878    0.9870
tak                  0.9720    1.0000    0.9728    0.9968
tensor               0.9739    0.9348    0.9057    0.8996
tsp                  0.9955    1.001     0.9283    0.9518
tyan                 0.9734    0.9536    0.9052    0.9488
vector32-concat      0.9527    0.9997    0.9673    0.9471
vector64-concat      0.9617    0.9722    0.9921    0.9885
vector-rev           0.9809    0.9846    0.9625    0.9955
vliw                 1.020     1.004     1.111     1.086 
wc-input1            0.9345    1.002     0.9502    0.9549
wc-scanStream        0.9890    0.9830    1.027     1.028 
zebra                0.9712    0.9204    1.101     0.9702
zern                 0.9416    1.013     1.098     0.8998
MIN                  0.8696    0.8728    0.8533    0.8406
GMEAN                0.9635    0.9691    0.9740    0.9727
MAX                  1.124     1.104     1.161     1.167 
```

Notes:
* One concern with assigning more RSSA variables to registers is that
  increases the burden on the codegen (hardware) register allocator.
* With the exception of `mlyacc`, there is a general improvement in
  compile time.  Note that configs C04, C05, C06, and C07 are using a
  MLton compiled with `BounceVars` to compile programs with
  `BounceVars`.  It isn't clear how much of the benefit is due to the
  effect of `BounceVars` on the compiled MLton and how much is due to
  the effect of `BounceVars` on the IR of the compiled programs.  In
  any case, it seems that `BounceVars` "pays for itself", although the
  greatest improvements are not on the largest benchmark programs
  (`hamlet`, `mlyacc`, `model-elimination`, `nucleic`, `vliw`).

### Executable-Size Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut            1.005    1.005     1.009     0.9982
boyer                 1.002    0.9940    0.9967    0.9972
checksum              1.005    1.001     1.001     1.000 
count-graphs          1.007    1.004     1.005     1.001 
DLXSimulator          1.015    1.005     1.006     1.007 
even-odd              1.005    1.001     0.9999    1.0000
fft                   1.009    1.006     1.004     1.003 
fib                   1.005    1.001     0.9998    1.000 
flat-array            1.005    1.001     1.001     0.9993
hamlet                1.002    1.003     1.004     0.9999
imp-for               1.005    1.001     1.001     0.9991
knuth-bendix          1.004    0.9990    1.003     1.001 
lexgen                1.025    1.001     1.002     1.004 
life                  1.003    0.9999    1.000     0.9999
logic                 1.004    1.003     1.001     1.002 
mandelbrot            1.005    1.001     1.001     0.9994
matrix-multiply       1.005    1.000     1.002     0.9993
md5                   1.005    1.002     1.002     1.007 
merge                 1.004    1.000     0.9994    0.9995
mlyacc                1.176    1.154     1.183     1.180 
model-elimination     1.006    1.009     1.011     1.003 
mpuz                  1.006    0.9998    1.001     1.001 
nucleic               1.000    0.9996    0.9998    0.9985
output1               1.006    1.002     1.002     1.006 
peek                  1.005    1.001     1.001     0.9996
pidigits              1.003    0.9998    1.001     1.005 
psdes-random          1.005    1.001     1.002     0.9976
ratio-regions         1.020    1.008     1.007     1.009 
ray                   1.023    1.021     1.018     1.013 
raytrace              1.008    1.003     1.003     1.011 
simple                1.009    1.033     1.034     1.055 
smith-normal-form     1.005    1.005     1.006     1.003 
string-concat         1.004    0.9997    1.002     0.9980
tailfib               1.005    1.000     1.000     0.9998
tak                   1.005    1.001     0.9997    0.9999
tensor                1.019    1.019     1.018     1.020 
tsp                   1.005    1.002     1.004     1.007 
tyan                  1.034    1.026     1.025     1.030 
vector32-concat       1.005    1.000     1.001     0.9984
vector64-concat       1.005    1.000     1.001     0.9982
vector-rev            1.005    1.001     1.001     0.9988
vliw                  1.033    1.010     1.012     1.027 
wc-input1             1.012    1.009     1.008     1.010 
wc-scanStream         1.011    1.006     1.007     1.007 
zebra                 1.004    1.001     1.002     1.005 
zern                  1.006    1.003     1.001     1.002 
MIN                   1.000    0.9940    0.9967    0.9972
GMEAN                 1.012    1.007     1.008     1.008 
MAX                   1.176    1.154     1.183     1.180 
```

Notes:
* Consistent with the increase in compile-time for `mlyacc`, there is
  an increase in code size for `mlyacc`.
* Otherwise, there is little change in code size.


## Benchmark results (sulfur)

Specs:
* 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores)
* Ubuntu 16.04.6 LTS
* gcc: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
* gcc-8: gcc version 8.3.0 (Homebrew GCC 8.3.0)
* llvm: LLVM version 8.0.0

``` text
config command                                                               
C00    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen amd64      
C01    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c          
C02    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen c -cc gcc-8
C03    /home/mtf/devel/mlton/builds/g9ba73ad1e/bin/mlton -codegen llvm       
C04    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen amd64      
C05    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c          
C06    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen c -cc gcc-8
C07    /home/mtf/devel/mlton/builds/ga353d7851/bin/mlton -codegen llvm       
```

### Run-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           1.066     1.004     0.9963    0.8840
boyer                0.9890    0.9917    0.9902    0.9567
checksum             1.065     1.001     0.9935    0.9681
count-graphs         0.9580    0.9805    0.9665    0.9672
DLXSimulator         0.7266    0.7235    0.7374    0.7496
even-odd             1.009     0.9903    0.9873    0.9845
fft                  0.9930    0.9983    0.9710    0.9682
fib                  1.108     1.023     1.0000    0.9724
flat-array           0.9638    0.9581    0.9604    0.9620
hamlet               0.8991    0.9415    0.9295    0.9544
imp-for              0.9688    0.9359    0.9240    1.123 
knuth-bendix         1.005     0.9915    1.025     1.005 
lexgen               1.001     0.9954    0.9583    0.9974
life                 0.9659    1.003     0.9652    0.9923
logic                0.9887    0.9938    0.9702    0.9919
mandelbrot           0.9989    1.015     1.024     1.008 
matrix-multiply      1.017     0.9927    1.014     0.9881
md5                  0.9846    1.006     0.9868    1.026 
merge                1.071     1.077     1.039     1.074 
mlyacc               0.9994    0.9986    1.012     1.011 
model-elimination    0.9850    0.9656    0.9885    0.9879
mpuz                 0.7695    0.8649    0.9525    1.018 
nucleic              0.9267    0.9614    0.9359    0.9804
output1              0.9453    0.8869    0.8673    0.8782
peek                 0.9988    1.000     1.011     0.9938
pidigits             0.9734    0.9961    1.003     0.9986
psdes-random         0.9902    0.9998    1.016     0.9956
ratio-regions        1.107     0.9887    0.9315    0.9852
ray                  0.9905    0.9674    0.9930    0.9931
raytrace             1.008     1.032     1.005     1.057 
simple               0.9659    0.9692    0.9842    0.9648
smith-normal-form    0.9858    0.9876    0.9712    0.9615
string-concat        0.9901    1.001     1.007     0.9731
tailfib              0.9968    1.015     1.000     0.9936
tailmerge            0.9893    0.9932    0.9810    0.8081
tak                  0.9724    1.058     0.9762    0.9693
tensor               1.005     0.9982    0.9914    1.006 
tsp                  1.030     1.049     1.037     1.032 
tyan                 1.007     0.9887    0.9850    1.002 
vector32-concat      0.3328    0.3212    0.6566    1.029 
vector64-concat      0.3983    0.3806    0.7111    0.9922
vector-rev           0.9282    1.032     0.9473    0.9013
vliw                 0.9068    0.8996    0.8662    0.9419
wc-input1            1.396     0.9463    1.120     1.348 
wc-scanStream        0.9928    1.015     1.117     0.9642
zebra                0.9957    0.9966    0.9974    0.9967
zern                 0.9958    1.013     1.000     0.9743
MIN                  0.3328    0.3212    0.6566    0.7496
GMEAN                0.9468    0.9393    0.9640    0.9827
MAX                  1.396     1.077     1.120     1.348 
```

Notes:
* Again, there are some "big" wins and no "big" losses.
* Interestingly, compared to `cadmium`, `vector32-concat` and
  `vector64-concat` see greater speedups, though `mpuz` sees a smaller
  speedup.

### Compile-Time Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut           0.9401    0.9632    1.013     1.006 
boyer                0.9659    1.000     0.9725    0.9629
checksum             1.022     0.9911    0.9966    0.9967
count-graphs         1.005     1.042     1.004     1.011 
DLXSimulator         0.9910    0.9526    0.9833    0.9914
even-odd             0.9654    0.9591    1.009     0.9795
fft                  0.9841    0.9713    0.9885    0.9923
fib                  0.9417    0.9946    1.000     0.9976
flat-array           1.002     1.049     0.9716    1.012 
hamlet               1.103     1.018     0.9917    0.9781
imp-for              0.9586    1.041     0.9824    1.032 
knuth-bendix         0.9690    1.007     1.034     0.9347
lexgen               1.049     0.9718    0.9930    1.029 
life                 0.9490    0.9798    1.023     1.018 
logic                0.9936    0.9303    0.9768    0.9715
mandelbrot           0.9959    0.9518    0.9339    0.9740
matrix-multiply      1.117     1.060     1.013     0.9176
md5                  1.007     0.9454    0.9904    0.9973
merge                0.9922    0.9493    0.9793    1.015 
mlyacc               1.113     1.125     1.136     1.143 
model-elimination    0.9702    1.030     1.025     1.017 
mpuz                 1.015     0.9249    1.053     0.9401
nucleic              0.9813    1.007     0.9919    1.043 
output1              0.9285    0.9746    0.9603    1.049 
peek                 0.9547    0.9915    0.9410    0.9543
pidigits             0.9569    0.9757    0.9745    0.9973
psdes-random         0.9687    0.9835    1.006     0.9596
ratio-regions        0.9640    1.021     1.019     1.036 
ray                  1.079     0.9156    0.9577    0.9714
raytrace             1.008     0.9028    1.010     1.052 
simple               1.025     1.154     0.9570    1.068 
smith-normal-form    0.9611    0.9399    0.9215    0.9196
string-concat        1.023     0.9968    0.9789    1.072 
tailfib              0.9869    0.9748    0.9711    1.001 
tailmerge            0.9812    0.9766    0.9731    0.9051
tak                  0.9694    0.9863    0.9719    0.9681
tensor               0.9759    1.061     0.9847    1.030 
tsp                  0.9818    1.094     0.9497    1.066 
tyan                 1.018     1.004     0.9927    1.011 
vector32-concat      0.9783    0.9755    0.9766    0.9792
vector64-concat      0.9820    0.9876    0.9830    0.9607
vector-rev           0.9648    0.9215    1.066     1.002 
vliw                 1.075     1.045     1.035     1.052 
wc-input1            1.012     1.025     1.005     0.9931
wc-scanStream        0.9747    0.9883    0.9940    0.9761
zebra                0.9838    0.9624    0.9758    1.003 
zern                 0.8921    0.9604    0.9794    0.9880
MIN                  0.8921    0.9028    0.9215    0.9051
GMEAN                0.9921    0.9920    0.9919    0.9985
MAX                  1.117     1.154     1.136     1.143 
```

Notes:
* Compared to `cadmium`, there is less general improvement in compile
  time, though (with the exception of `mlyacc`), no "across the board"
  slowdowns.

### Executable-Size Ratio

``` text
program           `C04/C00` `C05/C01` `C06/C02` `C07/C03`
barnes-hut            1.005    1.005     1.009     0.9983
boyer                 1.002    0.9940    0.9967    0.9972
checksum              1.005    1.001     1.001     1.000 
count-graphs          1.007    1.004     1.005     1.001 
DLXSimulator          1.015    1.005     1.006     1.008 
even-odd              1.005    1.001     0.9999    0.9999
fft                   1.009    1.006     1.004     1.003 
fib                   1.005    1.001     0.9998    1.000 
flat-array            1.005    1.001     1.001     0.9995
hamlet                1.002    1.003     1.004     0.9999
imp-for               1.005    1.001     1.001     0.9991
knuth-bendix          1.004    0.9989    1.003     1.000 
lexgen                1.025    1.001     1.002     1.004 
life                  1.003    0.9998    1.000     0.9997
logic                 1.004    1.003     1.001     0.9997
mandelbrot            1.005    1.001     1.001     0.9994
matrix-multiply       1.005    1.000     1.002     0.9996
md5                   1.005    1.002     1.002     1.007 
merge                 1.004    1.000     0.9993    0.9994
mlyacc                1.176    1.154     1.183     1.179 
model-elimination     1.006    1.009     1.011     1.003 
mpuz                  1.006    0.9998    1.001     1.001 
nucleic               1.000    0.9996    0.9998    0.9985
output1               1.006    1.002     1.002     1.006 
peek                  1.005    1.001     1.001     0.9994
pidigits              1.003    0.9998    1.001     1.005 
psdes-random          1.005    1.001     1.002     0.9978
ratio-regions         1.020    1.008     1.007     1.009 
ray                   1.023    1.021     1.018     1.013 
raytrace              1.008    1.003     1.003     1.013 
simple                1.009    1.033     1.034     1.055 
smith-normal-form     1.005    1.005     1.006     1.003 
string-concat         1.004    0.9995    1.002     0.9980
tailfib               1.005    1.000     1.000     0.9998
tailmerge             1.004    0.9995    1.001     0.9998
tak                   1.005    1.001     0.9997    1.000 
tensor                1.019    1.019     1.018     1.019 
tsp                   1.005    1.002     1.004     1.007 
tyan                  1.034    1.026     1.025     1.031 
vector32-concat       1.005    1.000     1.001     0.9984
vector64-concat       1.005    1.000     1.001     0.9982
vector-rev            1.005    1.001     1.001     0.9987
vliw                  1.033    1.010     1.012     1.026 
wc-input1             1.012    1.009     1.008     1.010 
wc-scanStream         1.011    1.006     1.007     1.007 
zebra                 1.004    1.001     1.002     1.005 
zern                  1.006    1.003     1.001     1.002 
MIN                   1.000    0.9940    0.9967    0.9972
GMEAN                 1.011    1.007     1.008     1.008 
MAX                   1.176    1.154     1.183     1.179 
```

Notes:
* As they should be (same versions of MLton, gcc, llvm), executable
  sizes are identical with `cadmium`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant