### Monte-Carlo Tree Search with Neural Networks applied to play Checkers

#### Jair Taylor

Inspired in part by DeepMind's success with AlphaGo, we have written code that learns to play the game of Checkers using a somewhat similar methodology.  The algorithm uses a version of the Monte-Carlo Tree Search algorithm to learn find the proportion of wins that a good player should get from a given board state, and then trains a neural net to learn these probabilities.  

TODO:

1.  Create framework for evaluating strength of play. e.g., given an algorithm, what is its strength in terms of budget for a tree?
2.  Implement human-playable games.
3.  Create more general structure for DNN training.

In [1]:
from montecarlo_lib import *  # Here we have implemented the Monte-Carlo Tree Search algorithm 
                              # as well as defining the rules of Checkers.

import matplotlib.pyplot as plt

from ipywidgets import interact, IntSlider

from keras.models import Sequential
from keras.layers import Dense

%matplotlib inline

from pylab import rcParams

Using TensorFlow backend.


RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

We begin by using MCTS to generate a large number of games of checkers.  No human-played games are used in training.

In [2]:
games_list = []
winners_list = []
all_game_trees_list = []
max_turns = 50
total_budget = 50
num_games = 20

for game_num in range(num_games):
    game = checkers_state(board_size = 6, max_turns = max_turns, tiebreaker_rule = True)
    #game.show_board()

    game_states_list = []
    game_trees_list = []

    print "Game %d commencing." % game_num
    for i in range(max_turns):
        num_actions = game.num_actions()     
        game.notes = ''
        game.notes += 'Turn %d: Player %d now choosing action from board above.\n' %  (i, game.player)
        game_to_play = deepcopy(game)
        game_to_play.max_turns = 60 #This setting stops stalling-for-time strategies
            
        game_tree = MonteCarloTree( deepcopy(game_to_play), 
                       budget = total_budget, 
                       num_simulations = 10, 
                       max_steps_to_simulate = 60)
        
        action = game_tree.UCTSearch()

        if not game_tree.root.is_complete and num_actions > 1 and len(game_tree.tree) < game_tree.budget:
            raise ValueError("wtf")
        game_trees_list.append(deepcopy(game_tree))
        game.notes +=  'Size of tree: %d\n' % len(game_tree.tree)

        if action is None:
            game.notes +=  "No good move.  Taking random action.\n"
            action = game.random_action()
            
        game_states_list.append(deepcopy(game))
        (observation, reward, done, info) = game.step(action)        




        if done:
            winner = game.winner()
            if winner == 'draw':
                note = "Game is a draw."
            else:
                note = 'Player %d wins! (after %d moves)' % (game.winner(), i)
            game.notes = note
            print note
            winners_list.append(game.winner())
        #print game.notes
        #game.show_board()
        if done:
            game_states_list.append(deepcopy(game))
            break
    else:
        print "Game timed out."
        winners_list.append(None)
        
    games_list.append(deepcopy(game_states_list))
    all_game_trees_list.append(game_trees_list)
print winners_list

Game 0 commencing.
Game 1 commencing.
Game 2 commencing.
Game 3 commencing.
Game 4 commencing.
Game 5 commencing.
Game 6 commencing.
Game 7 commencing.
Game 8 commencing.
Game 9 commencing.
Game 10 commencing.
Game 11 commencing.
Game 12 commencing.
Game 13 commencing.
Game 14 commencing.
Game 15 commencing.
Game 16 commencing.
Game 17 commencing.
Game 18 commencing.
Game 19 commencing.
[1, 1, 1, 0, 'draw', 1, 'draw', 1, 1, 0, 'draw', 1, 'draw', 0, 0, 'draw', 1, 1, 'draw', 0]


Now we assemble the training set.  The training set consists of all nodes (that is, game states) in all the trees created above with depth below a fixed max_depth, together with the win/lose/draw probabilities for Player 0 as an output to be learned.



In [3]:
all_X = []
all_Y = []

for game_index in range(num_games):
    game_states_list = games_list[game_index]
    game_trees_list = all_game_trees_list[game_index]
    
    for turn_index in range(len(game_trees_list)):
        game_tree = game_trees_list[turn_index]
        for node in game_tree.tree.values():
            
            results = node.all_simulation_results
            if len(results) > 0 and node.depth < 4:
                num_victories = results.count(1)
                num_losses = results.count(0)
                num_draws = results.count(0.5)

                if game_tree.root_player != 0:
                    num_victories, num_losses = num_losses, num_victories
                x = get_single_board_vector(node.state)
                totes = float(len(results))
                y = [num_victories/totes, num_losses/totes, num_draws/totes]
                #each y is: proportion of player 0 victories, player 1 victories, draws
                
                all_X.append(x)
                all_Y.append(y)
    
    
train_indices = get_random_subset(len(all_X), .8)

train_X = np.array([all_X[i] for i in range(len(all_X)) if i in train_indices])
train_Y = np.array([all_Y[i] for i in range(len(all_X)) if i in train_indices])

test_X = np.array([all_X[i] for i in range(len(all_X)) if i not in train_indices])
test_Y = np.array([all_Y[i] for i in range(len(all_X)) if i not in train_indices])

print 'train: %d / %d states (%.f pct)' % (len(train_X), len(all_X), 100 * len(train_X)/ float( len(all_X)))

print 'test: %d / %d states (%.f pct)' % (len(test_X), len(all_X), 100 * len(test_X)/ float( len(all_X))) 

train: 20257 / 25281 states (80 pct)
test: 5024 / 25281 states (20 pct)


We train a neural net model using Keras to learn these probabilities.

In [4]:
layers = [30,20,10]
activations = ['relu', 'relu', 'relu']

model = Sequential()

for i in range(len(layers)):
    if activations is None:
        activation = 'relu'
    else:
        activation = activations[i]

    if i == 0:
        model.add(Dense(layers[i],  activation = activation, input_dim = train_X.shape[1]))
    else:
        model.add(Dense(layers[i],  activation = activation))
        
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['mean_squared_error'])




# model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
# Fit the model




model.fit(train_X, train_Y, epochs=1000, batch_size=100, verbose = 2)
# evaluate the model
scores = model.evaluate(test_X, test_Y)

Epoch 1/1000
 - 1s - loss: 0.9032 - mean_squared_error: 0.0513
Epoch 2/1000
 - 0s - loss: 0.8549 - mean_squared_error: 0.0401
Epoch 3/1000
 - 0s - loss: 0.8383 - mean_squared_error: 0.0357
Epoch 4/1000
 - 0s - loss: 0.8291 - mean_squared_error: 0.0332
Epoch 5/1000
 - 0s - loss: 0.8233 - mean_squared_error: 0.0317
Epoch 6/1000
 - 0s - loss: 0.8193 - mean_squared_error: 0.0307
Epoch 7/1000
 - 0s - loss: 0.8165 - mean_squared_error: 0.0300
Epoch 8/1000
 - 0s - loss: 0.8142 - mean_squared_error: 0.0294
Epoch 9/1000
 - 0s - loss: 0.8120 - mean_squared_error: 0.0288
Epoch 10/1000
 - 0s - loss: 0.8108 - mean_squared_error: 0.0285
Epoch 11/1000
 - 0s - loss: 0.8088 - mean_squared_error: 0.0280
Epoch 12/1000
 - 0s - loss: 0.8080 - mean_squared_error: 0.0278
Epoch 13/1000
 - 0s - loss: 0.8067 - mean_squared_error: 0.0275
Epoch 14/1000
 - 0s - loss: 0.8057 - mean_squared_error: 0.0272
Epoch 15/1000
 - 0s - loss: 0.8048 - mean_squared_error: 0.0270
Epoch 16/1000
 - 0s - loss: 0.8042 - mean_squared

Epoch 129/1000
 - 0s - loss: 0.7862 - mean_squared_error: 0.0225
Epoch 130/1000
 - 0s - loss: 0.7863 - mean_squared_error: 0.0226
Epoch 131/1000
 - 0s - loss: 0.7861 - mean_squared_error: 0.0225
Epoch 132/1000
 - 0s - loss: 0.7861 - mean_squared_error: 0.0226
Epoch 133/1000
 - 0s - loss: 0.7854 - mean_squared_error: 0.0224
Epoch 134/1000
 - 0s - loss: 0.7858 - mean_squared_error: 0.0225
Epoch 135/1000
 - 0s - loss: 0.7858 - mean_squared_error: 0.0225
Epoch 136/1000
 - 0s - loss: 0.7858 - mean_squared_error: 0.0225
Epoch 137/1000
 - 0s - loss: 0.7856 - mean_squared_error: 0.0224
Epoch 138/1000
 - 0s - loss: 0.7856 - mean_squared_error: 0.0224
Epoch 139/1000
 - 0s - loss: 0.7856 - mean_squared_error: 0.0224
Epoch 140/1000
 - 0s - loss: 0.7857 - mean_squared_error: 0.0224
Epoch 141/1000
 - 0s - loss: 0.7856 - mean_squared_error: 0.0224
Epoch 142/1000
 - 0s - loss: 0.7852 - mean_squared_error: 0.0223
Epoch 143/1000
 - 0s - loss: 0.7853 - mean_squared_error: 0.0224
Epoch 144/1000
 - 0s - lo

 - 0s - loss: 0.7823 - mean_squared_error: 0.0217
Epoch 256/1000
 - 0s - loss: 0.7821 - mean_squared_error: 0.0216
Epoch 257/1000
 - 0s - loss: 0.7822 - mean_squared_error: 0.0216
Epoch 258/1000
 - 0s - loss: 0.7825 - mean_squared_error: 0.0217
Epoch 259/1000
 - 0s - loss: 0.7821 - mean_squared_error: 0.0216
Epoch 260/1000
 - 0s - loss: 0.7824 - mean_squared_error: 0.0217
Epoch 261/1000
 - 0s - loss: 0.7820 - mean_squared_error: 0.0216
Epoch 262/1000
 - 0s - loss: 0.7821 - mean_squared_error: 0.0217
Epoch 263/1000
 - 0s - loss: 0.7822 - mean_squared_error: 0.0217
Epoch 264/1000
 - 0s - loss: 0.7823 - mean_squared_error: 0.0217
Epoch 265/1000
 - 0s - loss: 0.7822 - mean_squared_error: 0.0217
Epoch 266/1000
 - 0s - loss: 0.7824 - mean_squared_error: 0.0217
Epoch 267/1000
 - 0s - loss: 0.7820 - mean_squared_error: 0.0216
Epoch 268/1000
 - 0s - loss: 0.7817 - mean_squared_error: 0.0215
Epoch 269/1000
 - 0s - loss: 0.7818 - mean_squared_error: 0.0215
Epoch 270/1000
 - 0s - loss: 0.7821 - me

Epoch 382/1000
 - 0s - loss: 0.7806 - mean_squared_error: 0.0213
Epoch 383/1000
 - 0s - loss: 0.7806 - mean_squared_error: 0.0213
Epoch 384/1000
 - 0s - loss: 0.7805 - mean_squared_error: 0.0213
Epoch 385/1000
 - 0s - loss: 0.7806 - mean_squared_error: 0.0213
Epoch 386/1000
 - 0s - loss: 0.7807 - mean_squared_error: 0.0213
Epoch 387/1000
 - 0s - loss: 0.7807 - mean_squared_error: 0.0213
Epoch 388/1000
 - 0s - loss: 0.7804 - mean_squared_error: 0.0212
Epoch 389/1000
 - 0s - loss: 0.7805 - mean_squared_error: 0.0213
Epoch 390/1000
 - 0s - loss: 0.7807 - mean_squared_error: 0.0213
Epoch 391/1000
 - 0s - loss: 0.7805 - mean_squared_error: 0.0213
Epoch 392/1000
 - 0s - loss: 0.7807 - mean_squared_error: 0.0213
Epoch 393/1000
 - 0s - loss: 0.7804 - mean_squared_error: 0.0213
Epoch 394/1000
 - 0s - loss: 0.7805 - mean_squared_error: 0.0213
Epoch 395/1000
 - 0s - loss: 0.7804 - mean_squared_error: 0.0212
Epoch 396/1000
 - 0s - loss: 0.7804 - mean_squared_error: 0.0212
Epoch 397/1000
 - 0s - lo

 - 0s - loss: 0.7795 - mean_squared_error: 0.0211
Epoch 509/1000
 - 0s - loss: 0.7793 - mean_squared_error: 0.0210
Epoch 510/1000
 - 0s - loss: 0.7796 - mean_squared_error: 0.0211
Epoch 511/1000
 - 0s - loss: 0.7795 - mean_squared_error: 0.0210
Epoch 512/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 513/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 514/1000
 - 0s - loss: 0.7796 - mean_squared_error: 0.0211
Epoch 515/1000
 - 0s - loss: 0.7796 - mean_squared_error: 0.0211
Epoch 516/1000
 - 0s - loss: 0.7792 - mean_squared_error: 0.0210
Epoch 517/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 518/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 519/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 520/1000
 - 0s - loss: 0.7792 - mean_squared_error: 0.0210
Epoch 521/1000
 - 0s - loss: 0.7793 - mean_squared_error: 0.0210
Epoch 522/1000
 - 0s - loss: 0.7794 - mean_squared_error: 0.0210
Epoch 523/1000
 - 0s - loss: 0.7792 - me

Epoch 635/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0208
Epoch 636/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0208
Epoch 637/1000
 - 0s - loss: 0.7790 - mean_squared_error: 0.0209
Epoch 638/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0208
Epoch 639/1000
 - 0s - loss: 0.7789 - mean_squared_error: 0.0209
Epoch 640/1000
 - 0s - loss: 0.7783 - mean_squared_error: 0.0208
Epoch 641/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0208
Epoch 642/1000
 - 0s - loss: 0.7785 - mean_squared_error: 0.0208
Epoch 643/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0208
Epoch 644/1000
 - 0s - loss: 0.7788 - mean_squared_error: 0.0209
Epoch 645/1000
 - 0s - loss: 0.7784 - mean_squared_error: 0.0208
Epoch 646/1000
 - 0s - loss: 0.7785 - mean_squared_error: 0.0208
Epoch 647/1000
 - 0s - loss: 0.7787 - mean_squared_error: 0.0209
Epoch 648/1000
 - 0s - loss: 0.7786 - mean_squared_error: 0.0209
Epoch 649/1000
 - 0s - loss: 0.7788 - mean_squared_error: 0.0209
Epoch 650/1000
 - 0s - lo

 - 0s - loss: 0.7782 - mean_squared_error: 0.0208
Epoch 762/1000
 - 0s - loss: 0.7780 - mean_squared_error: 0.0207
Epoch 763/1000
 - 0s - loss: 0.7780 - mean_squared_error: 0.0207
Epoch 764/1000
 - 0s - loss: 0.7782 - mean_squared_error: 0.0208
Epoch 765/1000
 - 0s - loss: 0.7782 - mean_squared_error: 0.0208
Epoch 766/1000
 - 0s - loss: 0.7785 - mean_squared_error: 0.0208
Epoch 767/1000
 - 0s - loss: 0.7783 - mean_squared_error: 0.0208
Epoch 768/1000
 - 0s - loss: 0.7782 - mean_squared_error: 0.0207
Epoch 769/1000
 - 0s - loss: 0.7782 - mean_squared_error: 0.0208
Epoch 770/1000
 - 0s - loss: 0.7781 - mean_squared_error: 0.0207
Epoch 771/1000
 - 0s - loss: 0.7780 - mean_squared_error: 0.0207
Epoch 772/1000
 - 0s - loss: 0.7781 - mean_squared_error: 0.0207
Epoch 773/1000
 - 0s - loss: 0.7782 - mean_squared_error: 0.0208
Epoch 774/1000
 - 0s - loss: 0.7781 - mean_squared_error: 0.0207
Epoch 775/1000
 - 0s - loss: 0.7781 - mean_squared_error: 0.0207
Epoch 776/1000
 - 0s - loss: 0.7782 - me

Epoch 888/1000
 - 0s - loss: 0.7778 - mean_squared_error: 0.0207
Epoch 889/1000
 - 0s - loss: 0.7777 - mean_squared_error: 0.0206
Epoch 890/1000
 - 0s - loss: 0.7777 - mean_squared_error: 0.0206
Epoch 891/1000
 - 0s - loss: 0.7776 - mean_squared_error: 0.0206
Epoch 892/1000
 - 0s - loss: 0.7775 - mean_squared_error: 0.0206
Epoch 893/1000
 - 0s - loss: 0.7778 - mean_squared_error: 0.0207
Epoch 894/1000
 - 0s - loss: 0.7776 - mean_squared_error: 0.0206
Epoch 895/1000
 - 0s - loss: 0.7775 - mean_squared_error: 0.0206
Epoch 896/1000
 - 0s - loss: 0.7774 - mean_squared_error: 0.0206
Epoch 897/1000
 - 0s - loss: 0.7778 - mean_squared_error: 0.0207
Epoch 898/1000
 - 0s - loss: 0.7777 - mean_squared_error: 0.0206
Epoch 899/1000
 - 0s - loss: 0.7775 - mean_squared_error: 0.0206
Epoch 900/1000
 - 0s - loss: 0.7774 - mean_squared_error: 0.0206
Epoch 901/1000
 - 0s - loss: 0.7777 - mean_squared_error: 0.0206
Epoch 902/1000
 - 0s - loss: 0.7780 - mean_squared_error: 0.0207
Epoch 903/1000
 - 0s - lo

In [5]:
scores = model.evaluate(test_X, test_Y, verbose = 0)

for i in range(len(model.metrics_names)):
    
    print "%s on test set: %f" % (model.metrics_names[i], scores[i])
    

loss on test set: 0.807098
mean_squared_error on test set: 0.025627


Here we visualize the progress of a particular game of checkers and the associated win/lose/draw probabilites computed by our trained neural net.

In [6]:
game_to_view = 0
game_states_list = games_list[game_to_view]
rcParams['figure.figsize'] = 4,4



def game_slider(i):
    game_states_list[i].show_board()
    
    v = get_single_board_vector(game_states_list[i])
    X = np.array([v])
    model_output = list(100 * model.predict(X)[0])
    print game_states_list[i].notes
    print "Player 0 victory: ", model_output[0], '%'
    print "Player 1 victory: ", model_output[1], '%'
    print "Draw:             ", model_output[2], '%'
    #return 'Player: %d' % game_states_list[i].player

interact(game_slider, i = IntSlider(min=0,max=len(game_states_list)-1,step=1,value=0)  )
None

A Jupyter Widget

Here we have a single step of a more general training algorithm to be implemented.  The idea is to simultaneously train the neural net while using the output from this model to improve the MCTS algorithm.

- At each step, we have previously determined a weight $\alpha \in [0,1]$ and a neural net f that takes as input a board state $B$ and outputs a triple $(w,l,d)$ corresponding to its estimate of winning, losing or drawing assuming "near-ideal" play.
- Play a large number of games using Monte-Carlo Tree Search.  
- During backpropogation, two 3-tuples are backpropogated:
    - $Q = (w, l, d)$ corresponding to the average number of wins, losses, and draws.
    - $\hat{Q}  = (\hat{w}, \hat{l}, \hat{d}) = f(B)$ corresponding to output of the previously-neural net $f$.
- For each node, set a value y corresponding to that node with $y = \alpha Q + (1-\alpha) \hat{Q}$ for some fixed $\alpha$. (Initially, set $\alpha = 1$; all weight is given to $Q$ when neural net is not yet trained.)
- The best child in MCTS is chosen according to the value of $y$.
- After many games have been played, create training set for neural net consisting of:
   - inputs $X$, game states corresponding to all nodes of depth < max_depth in all game trees
   - outputs $Y$, the the $y$ values given to these game states.
- Train neural net.
- Periodically, re-set the value of $\alpha$.  To find it, play a number of games between the following strategies: 
    - Strategy A: Choose moves by performing MCTS using random policy.
    - Strategy B: Choose moves by performing MCTS using scoring by $f$.

    Choose $\alpha$ to be the proportion of games won by Strategy A (plus half the proportion of games drawn.)
    
If the algorithm is succeeding, the value of $\alpha$ should gradually decrease on average.  Eventually, if $\alpha$ becomes small enough, set $\alpha = 0$ to save time (since evaluation of $\hat{Q}$ should be much faster than evaluation of $Q$.)

Periodically, the 'skill level' of neural net $f$ should be measured by the number of games won by Strategy A vs. Strategy B above, but with possibly with B given a handicap by having a large budget.  e.g., if $f$ is very accurate, Strategy A with a budget of 20 may be able to beat Strategy B with a budget of 100.



Other notes / questions - 

- Should at first play with small board size to see if training is working
- Over time, should increase the MCTS budget as $f$ becomes stronger
- Should Q-values of game trees be shared across turns or games? If so, how?