Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress the MCTS tree #36

Closed
kblomdahl opened this issue Nov 19, 2018 · 2 comments
Closed

Compress the MCTS tree #36

kblomdahl opened this issue Nov 19, 2018 · 2 comments

Comments

@kblomdahl
Copy link
Owner

Today a single instance of a Node is 10,688 bytes. For an average search tree, that contains 1,600 nodes, that is about 16 Mb. This is a bit excessive, and also causes some performance concerns during deallocation as:

  • When freeing a node we iterate over each of its child, of which each node has 362. A lot of these checks are unnecessary.
  • free() is pretty expensive, and we bulk free entire sub-trees at a time.

To avoid all of these issues I suggest that we change Node to a sparse representation, such that:

  • We store a lot of the f32 as f16.
  • We store some of the i32 as i16.
  • We store a smaller sparse version of many of the Node fields on the stack, or a full version on the heap.
trait NodeInner { ... }

struct SmallNodeImpl {
    count: [i32; 8],
    vcount: [i16; 8],
    value: [f16; 8],
    expanding: [bool; 8],
    children: [*mut Node; 8],
    indices: [i32; 8]
}

impl NodeInner for SmallNodeImpl { ... }

struct BigNodeImpl {
    count: [i32; 368],
    vcount: [i16; 368],
    value: [f16; 368],
    expanding: [bool; 362],
    children: [*mut Node; 362]
}

impl NodeInner for BigNodeImpl { ... }

enum NodeImpl {
    Small(SmallNodeImpl),
    Big(Box<BigNodeImpl>)
}

pub struct Node {
    lock: Mutex,
    pub color: Color,
    pub pass_count: i32,
    pub total_count: i32,
    pub vtotal_count: i32,
    pub prior: [f16; 368],

    pub inner: NodeImpl
}

This should reduce the average size to 928 bytes (a 10x reduction) and the run time of freeing a single Node by 46x. We will still need to bulk free a lot of these nodes at the same time, to solve this problem I suggest we keep a global object pool.

@kblomdahl
Copy link
Owner Author

I pushed a branch bbc4464, that contains the suggested changes. Seems successful overall, but I am seeing a very high CPU utilization with most of the time spent in:

This is the built-in f16 to f32 conversion routine in LLVM. Which is weird since the current processor (Zen+) that I am testing on should support the F16C instruction set. I will need to confirm if the same behavior is observed on an Intel processor before proceeding.

If LLVM does not recognize the F16C instruction set then I will need to re-write some of the code in inline assembly to get acceptable performance.

@kblomdahl
Copy link
Owner Author

A side-effect of the memory saving we also seem to gain about a 25% performance improvement (with a f32 prior and value).

Original

test mcts::tree::tests::bench_probe_insert           ... bench:   8,024,164 ns/iter (+/- 1,589,956)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,024,164 ns/iter (+/- 1,589,956)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,304,612 ns/iter (+/- 1,853,008)
test mcts::tree::tests::bench_probe_insert           ... bench:   8,038,840 ns/iter (+/- 1,758,363)

Small Node

test mcts::tree::tests::bench_probe_insert           ... bench:   6,043,870 ns/iter (+/- 456,064)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,102,277 ns/iter (+/- 189,477)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,062,850 ns/iter (+/- 116,827)
test mcts::tree::tests::bench_probe_insert           ... bench:   6,090,336 ns/iter (+/- 79,252)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant