Compress the MCTS tree #36

kblomdahl · 2018-11-19T21:35:55Z

Today a single instance of a Node is 10,688 bytes. For an average search tree, that contains 1,600 nodes, that is about 16 Mb. This is a bit excessive, and also causes some performance concerns during deallocation as:

When freeing a node we iterate over each of its child, of which each node has 362. A lot of these checks are unnecessary.
free() is pretty expensive, and we bulk free entire sub-trees at a time.

To avoid all of these issues I suggest that we change Node to a sparse representation, such that:

We store a lot of the f32 as f16.
We store some of the i32 as i16.
We store a smaller sparse version of many of the Node fields on the stack, or a full version on the heap.

trait NodeInner { ... }

struct SmallNodeImpl {
    count: [i32; 8],
    vcount: [i16; 8],
    value: [f16; 8],
    expanding: [bool; 8],
    children: [*mut Node; 8],
    indices: [i32; 8]
}

impl NodeInner for SmallNodeImpl { ... }

struct BigNodeImpl {
    count: [i32; 368],
    vcount: [i16; 368],
    value: [f16; 368],
    expanding: [bool; 362],
    children: [*mut Node; 362]
}

impl NodeInner for BigNodeImpl { ... }

enum NodeImpl {
    Small(SmallNodeImpl),
    Big(Box<BigNodeImpl>)
}

pub struct Node {
    lock: Mutex,
    pub color: Color,
    pub pass_count: i32,
    pub total_count: i32,
    pub vtotal_count: i32,
    pub prior: [f16; 368],

    pub inner: NodeImpl
}

This should reduce the average size to 928 bytes (a 10x reduction) and the run time of freeing a single Node by 46x. We will still need to bulk free a lot of these nodes at the same time, to solve this problem I suggest we keep a global object pool.

The text was updated successfully, but these errors were encountered:

kblomdahl · 2018-11-23T22:44:01Z

I pushed a branch bbc4464, that contains the suggested changes. Seems successful overall, but I am seeing a very high CPU utilization with most of the time spent in:

__extendhfsf2

This is the built-in f16 to f32 conversion routine in LLVM. Which is weird since the current processor (Zen+) that I am testing on should support the F16C instruction set. I will need to confirm if the same behavior is observed on an Intel processor before proceeding.

If LLVM does not recognize the F16C instruction set then I will need to re-write some of the code in inline assembly to get acceptable performance.

kblomdahl · 2018-11-27T21:04:19Z

A side-effect of the memory saving we also seem to gain about a 25% performance improvement (with a f32 prior and value).

Original

test mcts::tree::tests::bench_probe_insert           ... bench:   8,024,164 ns/iter (+/- 1,589,956)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,024,164 ns/iter (+/- 1,589,956)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,304,612 ns/iter (+/- 1,853,008)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,038,840 ns/iter (+/- 1,758,363)

Small Node

test mcts::tree::tests::bench_probe_insert           ... bench:   6,043,870 ns/iter (+/- 456,064)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,102,277 ns/iter (+/- 189,477)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,062,850 ns/iter (+/- 116,827)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,090,336 ns/iter (+/- 79,252)

kblomdahl closed this as completed Nov 27, 2018

kblomdahl mentioned this issue Jan 6, 2019

Remove all locks from the monte carlo tree search #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress the MCTS tree #36

Compress the MCTS tree #36

kblomdahl commented Nov 19, 2018

kblomdahl commented Nov 23, 2018

kblomdahl commented Nov 27, 2018

Compress the MCTS tree #36

Compress the MCTS tree #36

Comments

kblomdahl commented Nov 19, 2018

kblomdahl commented Nov 23, 2018

kblomdahl commented Nov 27, 2018

Original

Small Node