### Monte-Carlo Tree Search with Neural Networks applied to play Checkers

#### Jair Taylor

Inspired in part by DeepMind's success with AlphaGo, we have written code that learns to play the game of Checkers using a somewhat similar methodology.  The algorithm uses a version of the Monte-Carlo Tree Search algorithm to learn find the proportion of wins that a good player should get from a given board state, and then trains a neural net to learn these probabilities.  

TODO:

1.  Create framework for evaluating strength of play. e.g., given an algorithm, what is its strength in terms of budget for a tree?
2.  Implement human-playable games.
3.  Create more general structure for DNN training.

In [1]:
from montecarlo_lib import *  # Here we have implemented the Monte-Carlo Tree Search algorithm 
                              # as well as defining the rules of Checkers.

import matplotlib.pyplot as plt

from ipywidgets import interact, IntSlider

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Activation



import tensorflow as tf 


%matplotlib inline

from pylab import rcParams

import time

Using TensorFlow backend.


In [2]:
class PolicyChooser:
    def __init__(self, starting_game, policy0, policy1, num_games = 5):
        self.policy0 = policy0
        self.policy1 = policy1
        self.starting_game = starting_game
        self.num_games = num_games
        
    def BestPolicy(self, verbose = False):
        games_list = []
        winners_list = []
        all_game_trees_list = []
        for game_num in range(self.num_games):
            game = deepcopy(self.starting_game)
            #game = checkers_state(board_size = 6, max_turns = max_turns, tiebreaker_rule = True)
            #game.show_board()

            game_states_list = []
            game_trees_list = []
            if verbose:
                print("Game %d commencing." % game_num)
            
            
            for i in range(max_turns):
                num_actions = game.num_actions()     
                game.notes = ''
                game.notes += 'Turn %d: Player %d now choosing action from board above.\n' %  (i, game.player)
                game_to_play = deepcopy(game)
                game_to_play.max_turns = 60 #This setting stops stalling-for-time strategies

                if game.player == 0:
                    policy = self.policy0
                elif game.player == 1:
                    policy = self.policy1
                
                action = policy.play(game)

                if action is None:
                    game.notes +=  "No good move.  Taking random action.\n"
                    #print(game.notes)
                    action = game.random_action()

                game_states_list.append(deepcopy(game))
                (observation, reward, done, info) = game.step(action)        


                if done:
                    winner = game.winner()

                    if winner == 'draw':
                        note = "Game is a draw."
                    else:
                        note = 'Player %d wins! (after %d moves)' % (game.winner(), i)
                    game.notes = note
                    if verbose:
                        print(note)
                    winners_list.append(game.winner())
                #print(game.notes)
                #game.show_board()
                if done:
                    game_states_list.append(deepcopy(game))
                    break
            else:
                if verbose:
                    print("Game timed out.")
                winners_list.append(None)

            games_list.append(deepcopy(game_states_list))
            all_game_trees_list.append(game_trees_list)
        if verbose:
            print(winners_list)
        
        
        self.games_list = games_list
        self.all_game_trees_list = all_game_trees_list
        self.winners_list = winners_list
        
        
        if winners_list.count(0) != winners_list.count(1):
            best_policy = argmax({0:winners_list.count(0), 1:winners_list.count(1)})
            print("Winner is policy %d with %d:%d wins" % (best_policy, winners_list.count(best_policy), winners_list.count(1-best_policy)   ))
        else:
            print("Tie!!")
            best_policy = np.random.randint(2)
            
        self.best_policy = best_policy
        return best_policy

We begin by using MCTS to generate a large number of games of checkers.  No human-played games are used in training.

In [3]:

max_turns = 50  #max number of turns allowed in a single game. (after that winner is the player with most pieces)
total_budget = 100
num_games = 20
num_training_steps = 10
simulation_policies = [RandomPolicy(), RandomPolicy()]
orderliness = 30
board_size = 6
num_simulations = 3
epochs = 300
layers = [30,30,30]
activations = ['relu', 'relu', 'relu']

symmetrize = True






model = Sequential()

starting_game = checkers_state(board_size = board_size, max_turns = max_turns, tiebreaker_rule = True)
input_dim = len(get_single_board_vector(starting_game))

for i in range(len(layers)):
    if activations is None:
        activation = 'relu'
    else:
        activation = activations[i]

    if i == 0:
        model.add(Dense(layers[i],  activation = activation, input_dim = input_dim))
    else:
        model.add(Dense(layers[i],  activation = activation))

model.add(Dense(3))
#model.add(Activation('softmax'))

model.add(Activation(tf.nn.softmax))

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['mean_squared_error'])






for training_step in range(num_training_steps):
    print ("\n\n\n*************************")
    print ("BEGINNING TRAINING STEP %d" % training_step)
    print ("*************************\n\n\n")
    
    
    games_list = []
    winners_list = []
    all_game_trees_list = []

    for game_num in range(num_games):
        game = checkers_state(board_size = board_size, max_turns = max_turns, tiebreaker_rule = True)
        #game.show_board()

        game_states_list = []
        game_trees_list = []

        print("Game %d commencing." % game_num)
        for i in range(max_turns):
            num_actions = game.num_actions()     
            game.notes = ''
            game.notes += 'Turn %d: Player %d now choosing action from board above.\n' %  (i, game.player)
            game_to_play = deepcopy(game)
            game_to_play.max_turns = 60 #This setting stops stalling-for-time strategies

            total_budget = (game_num + 1) * 10
            
            game_tree = MonteCarloTree( deepcopy(game_to_play), 
                           budget = total_budget, 
                           num_simulations = num_simulations, 
                           max_steps_to_simulate = 60,
                           simulation_policy = simulation_policies[-1 - game.player])

            action = game_tree.play()

            if not game_tree.root.is_complete and num_actions > 1 and len(game_tree.tree) < game_tree.budget:
                raise ValueError("Game tree not fully built")
            game_trees_list.append(game_tree)
            game.notes +=  'Size of tree: %d\n' % len(game_tree.tree)

            if action is None:
                game.notes +=  "No good move.  Taking random action.\n"
                action = game.random_action()

            game_states_list.append(deepcopy(game))
            (observation, reward, done, info) = game.step(action)        




            if done:
                winner = game.winner()

                if winner == 'draw':
                    note = "Game is a draw."
                else:
                    note = 'Player %d wins! (after %d moves)' % (game.winner(), i)
                game.notes = note
                print(note)
                winners_list.append(game.winner())
            #print(game.notes)
            #game.show_board()
            if done:
                game_states_list.append(deepcopy(game))
                break
        else:
            print("Game timed out.")
            winners_list.append(None)

        games_list.append(deepcopy(game_states_list))
        all_game_trees_list.append(game_trees_list)
    print(winners_list)
    
    if winners_list.count(0) != winners_list.count(1):
        best_policy = argmax({0:winners_list.count(0), 1:winners_list.count(1)})
        print("Winner is policy %d with %d:%d wins" % (best_policy, winners_list.count(best_policy), winners_list.count(1-best_policy)   ))
    else:
        print("Tie!!")
    if training_step > 0:
        if best_policy != 1:
            print ("hmm.. older version was better...?")
        else:
            print("New policy seems to be better...")






    all_X = []
    all_Y = []

    for game_index in range(num_games):
        game_states_list = games_list[game_index]
        game_trees_list = all_game_trees_list[game_index]

        for turn_index in range(len(game_trees_list)):
            game_tree = game_trees_list[turn_index]
            for node in game_tree.tree.values():

                results = node.all_simulation_results
                if len(results) > 0 and node.depth < 4:

                    num_victories = results.count(1)
                    num_losses = results.count(0)
                    num_draws = results.count(0.5)

                    if game_tree.root_player != 0 and not symmetrize:
                        num_victories, num_losses = num_losses, num_victories


                    x = get_single_board_vector(node.state, symmetrize = symmetrize)

                    totes = float(len(results))
                    y = [num_victories/totes, num_losses/totes, num_draws/totes]
                    #each y is: proportion of current player victories, current player losses, draws

                    all_X.append(x)
                    all_Y.append(y)


    train_indices = get_random_subset(len(all_X), .8)

    train_X = np.array([all_X[i] for i in range(len(all_X)) if i in train_indices])
    train_Y = np.array([all_Y[i] for i in range(len(all_X)) if i in train_indices])

    test_X = np.array([all_X[i] for i in range(len(all_X)) if i not in train_indices])
    test_Y = np.array([all_Y[i] for i in range(len(all_X)) if i not in train_indices])

    print('train: %d / %d states (%.f pct)' % (len(train_X), len(all_X), 100 * len(train_X)/ float( len(all_X)))  )

    print('test: %d / %d states (%.f pct)' % (len(test_X), len(all_X), 100 * len(test_X)/ float( len(all_X)))  )





    # model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
    # Fit the model


    model.fit(train_X, train_Y, epochs=epochs, batch_size=100, verbose = 2)
    # evaluate the model

    scores = model.evaluate(test_X, test_Y, verbose = 0)

    for i in range(len(model.metrics_names)):

        print( "%s on test set: %f" % (model.metrics_names[i], scores[i]) )
     
    

    simulation_policy = NeuralNetPolicy(model, orderliness = orderliness)
    
    simulation_policies.append(simulation_policy)
    
    
    #starting_game = checkers_state(board_size = board_size, max_turns = max_turns, tiebreaker_rule = True)

#     policy0 = MonteCarloTree(starting_game, 
#                            budget = total_budget, 
#                            num_simulations = 1, 
#                            max_steps_to_simulate = 60,
#                            simulation_policy = simulation_policies[-2])

#     policy1 = MonteCarloTree(starting_game, 
#                              budget = total_budget, 
#                              num_simulations = 1, 
#                              max_steps_to_simulate = 60,
#                              simulation_policy = simulation_policies[-1])

#     policy_chooser = PolicyChooser(starting_game, simulation_policies[-2], simulation_policies[-1], num_games = 10)

#     policy_num = policy_chooser.BestPolicy()

#     print ("Winning policy:", policy_num)






*************************
BEGINNING TRAINING STEP 0
*************************



Game 0 commencing.
Player 0 wins! (after 49 moves)
Game 1 commencing.
Player 0 wins! (after 33 moves)
Game 2 commencing.
Player 0 wins! (after 22 moves)
Game 3 commencing.
Player 0 wins! (after 34 moves)
Game 4 commencing.
Player 1 wins! (after 46 moves)
Game 5 commencing.
Player 0 wins! (after 32 moves)
Game 6 commencing.
Player 1 wins! (after 41 moves)
Game 7 commencing.
Player 0 wins! (after 49 moves)
Game 8 commencing.
Player 1 wins! (after 26 moves)
Game 9 commencing.
Game is a draw.
Game 10 commencing.
Game is a draw.
Game 11 commencing.
Game is a draw.
Game 12 commencing.
Player 0 wins! (after 49 moves)
Game 13 commencing.
Player 1 wins! (after 22 moves)
Game 14 commencing.
Player 1 wins! (after 49 moves)
Game 15 commencing.
Game is a draw.
Game 16 commencing.
Game is a draw.
Game 17 commencing.
Player 1 wins! (after 36 moves)
Game 18 commencing.
Player 1 wins! (after 19 moves)
Game 19 commencing

Epoch 112/300
 - 1s - loss: 0.8391 - mean_squared_error: 0.0552
Epoch 113/300
 - 1s - loss: 0.8390 - mean_squared_error: 0.0552
Epoch 114/300
 - 1s - loss: 0.8391 - mean_squared_error: 0.0552
Epoch 115/300
 - 1s - loss: 0.8391 - mean_squared_error: 0.0552
Epoch 116/300
 - 1s - loss: 0.8390 - mean_squared_error: 0.0552
Epoch 117/300
 - 1s - loss: 0.8391 - mean_squared_error: 0.0552
Epoch 118/300
 - 1s - loss: 0.8388 - mean_squared_error: 0.0551
Epoch 119/300
 - 1s - loss: 0.8389 - mean_squared_error: 0.0552
Epoch 120/300
 - 1s - loss: 0.8385 - mean_squared_error: 0.0550
Epoch 121/300
 - 1s - loss: 0.8389 - mean_squared_error: 0.0552
Epoch 122/300
 - 1s - loss: 0.8387 - mean_squared_error: 0.0551
Epoch 123/300
 - 1s - loss: 0.8387 - mean_squared_error: 0.0551
Epoch 124/300
 - 1s - loss: 0.8385 - mean_squared_error: 0.0550
Epoch 125/300
 - 1s - loss: 0.8383 - mean_squared_error: 0.0550
Epoch 126/300
 - 1s - loss: 0.8384 - mean_squared_error: 0.0550
Epoch 127/300
 - 1s - loss: 0.8384 - mea

 - 1s - loss: 0.8337 - mean_squared_error: 0.0539
Epoch 241/300
 - 1s - loss: 0.8336 - mean_squared_error: 0.0539
Epoch 242/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0540
Epoch 243/300
 - 1s - loss: 0.8332 - mean_squared_error: 0.0538
Epoch 244/300
 - 1s - loss: 0.8337 - mean_squared_error: 0.0539
Epoch 245/300
 - 1s - loss: 0.8338 - mean_squared_error: 0.0540
Epoch 246/300
 - 1s - loss: 0.8336 - mean_squared_error: 0.0539
Epoch 247/300
 - 1s - loss: 0.8333 - mean_squared_error: 0.0538
Epoch 248/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0539
Epoch 249/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0539
Epoch 250/300
 - 1s - loss: 0.8334 - mean_squared_error: 0.0539
Epoch 251/300
 - 1s - loss: 0.8334 - mean_squared_error: 0.0539
Epoch 252/300
 - 1s - loss: 0.8338 - mean_squared_error: 0.0540
Epoch 253/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0539
Epoch 254/300
 - 1s - loss: 0.8336 - mean_squared_error: 0.0539
Epoch 255/300
 - 1s - loss: 0.8334 - mean_squared_erro

Epoch 48/300
 - 1s - loss: 0.8272 - mean_squared_error: 0.0597
Epoch 49/300
 - 1s - loss: 0.8268 - mean_squared_error: 0.0596
Epoch 50/300
 - 1s - loss: 0.8265 - mean_squared_error: 0.0596
Epoch 51/300
 - 1s - loss: 0.8265 - mean_squared_error: 0.0596
Epoch 52/300
 - 1s - loss: 0.8260 - mean_squared_error: 0.0594
Epoch 53/300
 - 1s - loss: 0.8258 - mean_squared_error: 0.0594
Epoch 54/300
 - 1s - loss: 0.8263 - mean_squared_error: 0.0595
Epoch 55/300
 - 1s - loss: 0.8258 - mean_squared_error: 0.0594
Epoch 56/300
 - 1s - loss: 0.8256 - mean_squared_error: 0.0593
Epoch 57/300
 - 1s - loss: 0.8254 - mean_squared_error: 0.0593
Epoch 58/300
 - 1s - loss: 0.8253 - mean_squared_error: 0.0593
Epoch 59/300
 - 1s - loss: 0.8251 - mean_squared_error: 0.0592
Epoch 60/300
 - 1s - loss: 0.8247 - mean_squared_error: 0.0591
Epoch 61/300
 - 1s - loss: 0.8247 - mean_squared_error: 0.0591
Epoch 62/300
 - 1s - loss: 0.8243 - mean_squared_error: 0.0590
Epoch 63/300
 - 1s - loss: 0.8246 - mean_squared_error:

Epoch 177/300
 - 1s - loss: 0.8162 - mean_squared_error: 0.0569
Epoch 178/300
 - 1s - loss: 0.8161 - mean_squared_error: 0.0570
Epoch 179/300
 - 1s - loss: 0.8162 - mean_squared_error: 0.0570
Epoch 180/300
 - 1s - loss: 0.8163 - mean_squared_error: 0.0570
Epoch 181/300
 - 1s - loss: 0.8159 - mean_squared_error: 0.0569
Epoch 182/300
 - 1s - loss: 0.8162 - mean_squared_error: 0.0570
Epoch 183/300
 - 1s - loss: 0.8159 - mean_squared_error: 0.0569
Epoch 184/300
 - 1s - loss: 0.8163 - mean_squared_error: 0.0570
Epoch 185/300
 - 1s - loss: 0.8159 - mean_squared_error: 0.0569
Epoch 186/300
 - 1s - loss: 0.8162 - mean_squared_error: 0.0569
Epoch 187/300
 - 1s - loss: 0.8161 - mean_squared_error: 0.0570
Epoch 188/300
 - 1s - loss: 0.8160 - mean_squared_error: 0.0569
Epoch 189/300
 - 1s - loss: 0.8156 - mean_squared_error: 0.0568
Epoch 190/300
 - 1s - loss: 0.8156 - mean_squared_error: 0.0568
Epoch 191/300
 - 1s - loss: 0.8159 - mean_squared_error: 0.0569
Epoch 192/300
 - 1s - loss: 0.8154 - mea

Player 0 wins! (after 49 moves)
Game 3 commencing.
Player 0 wins! (after 39 moves)
Game 4 commencing.
Game is a draw.
Game 5 commencing.
Player 1 wins! (after 49 moves)
Game 6 commencing.
Player 1 wins! (after 49 moves)
Game 7 commencing.
Player 1 wins! (after 49 moves)
Game 8 commencing.
Player 1 wins! (after 49 moves)
Game 9 commencing.
Player 0 wins! (after 49 moves)
Game 10 commencing.
Player 0 wins! (after 49 moves)
Game 11 commencing.
Player 0 wins! (after 27 moves)
Game 12 commencing.
Game is a draw.
Game 13 commencing.
Player 0 wins! (after 37 moves)
Game 14 commencing.
Player 1 wins! (after 49 moves)
Game 15 commencing.
Game is a draw.
Game 16 commencing.
Player 1 wins! (after 49 moves)
Game 17 commencing.
Player 1 wins! (after 49 moves)
Game 18 commencing.
Game is a draw.
Game 19 commencing.
Player 1 wins! (after 49 moves)
[1, 0, 0, 0, 'draw', 1, 1, 1, 1, 0, 0, 0, 'draw', 0, 1, 'draw', 1, 1, 'draw', 1]
Winner is policy 1 with 9:7 wins
New policy seems to be better...
train: 3

 - 1s - loss: 0.8679 - mean_squared_error: 0.0618
Epoch 115/300
 - 1s - loss: 0.8676 - mean_squared_error: 0.0617
Epoch 116/300
 - 1s - loss: 0.8677 - mean_squared_error: 0.0617
Epoch 117/300
 - 1s - loss: 0.8674 - mean_squared_error: 0.0616
Epoch 118/300
 - 1s - loss: 0.8675 - mean_squared_error: 0.0617
Epoch 119/300
 - 1s - loss: 0.8676 - mean_squared_error: 0.0617
Epoch 120/300
 - 1s - loss: 0.8672 - mean_squared_error: 0.0616
Epoch 121/300
 - 1s - loss: 0.8673 - mean_squared_error: 0.0616
Epoch 122/300
 - 1s - loss: 0.8674 - mean_squared_error: 0.0616
Epoch 123/300
 - 1s - loss: 0.8673 - mean_squared_error: 0.0617
Epoch 124/300
 - 1s - loss: 0.8674 - mean_squared_error: 0.0617
Epoch 125/300
 - 1s - loss: 0.8669 - mean_squared_error: 0.0615
Epoch 126/300
 - 1s - loss: 0.8670 - mean_squared_error: 0.0616
Epoch 127/300
 - 1s - loss: 0.8670 - mean_squared_error: 0.0616
Epoch 128/300
 - 1s - loss: 0.8669 - mean_squared_error: 0.0615
Epoch 129/300
 - 1s - loss: 0.8669 - mean_squared_erro

Epoch 243/300
 - 1s - loss: 0.8633 - mean_squared_error: 0.0607
Epoch 244/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0606
Epoch 245/300
 - 1s - loss: 0.8631 - mean_squared_error: 0.0607
Epoch 246/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0606
Epoch 247/300
 - 1s - loss: 0.8632 - mean_squared_error: 0.0607
Epoch 248/300
 - 1s - loss: 0.8628 - mean_squared_error: 0.0606
Epoch 249/300
 - 1s - loss: 0.8634 - mean_squared_error: 0.0607
Epoch 250/300
 - 1s - loss: 0.8628 - mean_squared_error: 0.0606
Epoch 251/300
 - 1s - loss: 0.8626 - mean_squared_error: 0.0605
Epoch 252/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0606
Epoch 253/300
 - 1s - loss: 0.8628 - mean_squared_error: 0.0606
Epoch 254/300
 - 1s - loss: 0.8626 - mean_squared_error: 0.0605
Epoch 255/300
 - 1s - loss: 0.8632 - mean_squared_error: 0.0607
Epoch 256/300
 - 1s - loss: 0.8628 - mean_squared_error: 0.0606
Epoch 257/300
 - 1s - loss: 0.8630 - mean_squared_error: 0.0606
Epoch 258/300
 - 1s - loss: 0.8626 - mea

Epoch 51/300
 - 1s - loss: 0.8404 - mean_squared_error: 0.0620
Epoch 52/300
 - 1s - loss: 0.8400 - mean_squared_error: 0.0619
Epoch 53/300
 - 1s - loss: 0.8400 - mean_squared_error: 0.0619
Epoch 54/300
 - 1s - loss: 0.8398 - mean_squared_error: 0.0619
Epoch 55/300
 - 0s - loss: 0.8400 - mean_squared_error: 0.0619
Epoch 56/300
 - 1s - loss: 0.8397 - mean_squared_error: 0.0619
Epoch 57/300
 - 1s - loss: 0.8394 - mean_squared_error: 0.0618
Epoch 58/300
 - 0s - loss: 0.8395 - mean_squared_error: 0.0618
Epoch 59/300
 - 0s - loss: 0.8391 - mean_squared_error: 0.0617
Epoch 60/300
 - 1s - loss: 0.8390 - mean_squared_error: 0.0617
Epoch 61/300
 - 1s - loss: 0.8385 - mean_squared_error: 0.0616
Epoch 62/300
 - 0s - loss: 0.8384 - mean_squared_error: 0.0615
Epoch 63/300
 - 1s - loss: 0.8381 - mean_squared_error: 0.0614
Epoch 64/300
 - 1s - loss: 0.8380 - mean_squared_error: 0.0614
Epoch 65/300
 - 1s - loss: 0.8382 - mean_squared_error: 0.0615
Epoch 66/300
 - 1s - loss: 0.8378 - mean_squared_error:

Epoch 180/300
 - 1s - loss: 0.8294 - mean_squared_error: 0.0594
Epoch 181/300
 - 1s - loss: 0.8292 - mean_squared_error: 0.0594
Epoch 182/300
 - 1s - loss: 0.8294 - mean_squared_error: 0.0594
Epoch 183/300
 - 1s - loss: 0.8295 - mean_squared_error: 0.0594
Epoch 184/300
 - 1s - loss: 0.8292 - mean_squared_error: 0.0594
Epoch 185/300
 - 0s - loss: 0.8297 - mean_squared_error: 0.0595
Epoch 186/300
 - 1s - loss: 0.8294 - mean_squared_error: 0.0594
Epoch 187/300
 - 1s - loss: 0.8297 - mean_squared_error: 0.0595
Epoch 188/300
 - 1s - loss: 0.8294 - mean_squared_error: 0.0594
Epoch 189/300
 - 0s - loss: 0.8293 - mean_squared_error: 0.0594
Epoch 190/300
 - 1s - loss: 0.8289 - mean_squared_error: 0.0593
Epoch 191/300
 - 1s - loss: 0.8289 - mean_squared_error: 0.0593
Epoch 192/300
 - 1s - loss: 0.8289 - mean_squared_error: 0.0593
Epoch 193/300
 - 1s - loss: 0.8285 - mean_squared_error: 0.0592
Epoch 194/300
 - 1s - loss: 0.8287 - mean_squared_error: 0.0593
Epoch 195/300
 - 1s - loss: 0.8287 - mea

Player 1 wins! (after 36 moves)
Game 7 commencing.
Game is a draw.
Game 8 commencing.
Player 1 wins! (after 22 moves)
Game 9 commencing.
Game is a draw.
Game 10 commencing.
Player 1 wins! (after 49 moves)
Game 11 commencing.
Player 0 wins! (after 36 moves)
Game 12 commencing.
Player 0 wins! (after 42 moves)
Game 13 commencing.
Player 1 wins! (after 49 moves)
Game 14 commencing.
Player 1 wins! (after 37 moves)
Game 15 commencing.
Player 1 wins! (after 49 moves)
Game 16 commencing.
Player 1 wins! (after 49 moves)
Game 17 commencing.
Player 0 wins! (after 40 moves)
Game 18 commencing.
Game is a draw.
Game 19 commencing.
Player 1 wins! (after 49 moves)
[1, 0, 0, 1, 1, 0, 1, 'draw', 1, 'draw', 1, 0, 0, 1, 1, 1, 1, 0, 'draw', 1]
Winner is policy 1 with 11:6 wins
New policy seems to be better...
train: 26826 / 33655 states (80 pct)
test: 6829 / 33655 states (20 pct)
Epoch 1/300
 - 1s - loss: 0.8842 - mean_squared_error: 0.0726
Epoch 2/300
 - 1s - loss: 0.8650 - mean_squared_error: 0.0687
Epoc

 - 1s - loss: 0.8329 - mean_squared_error: 0.0611
Epoch 118/300
 - 1s - loss: 0.8330 - mean_squared_error: 0.0611
Epoch 119/300
 - 1s - loss: 0.8329 - mean_squared_error: 0.0611
Epoch 120/300
 - 1s - loss: 0.8327 - mean_squared_error: 0.0610
Epoch 121/300
 - 1s - loss: 0.8326 - mean_squared_error: 0.0610
Epoch 122/300
 - 1s - loss: 0.8328 - mean_squared_error: 0.0611
Epoch 123/300
 - 1s - loss: 0.8325 - mean_squared_error: 0.0610
Epoch 124/300
 - 1s - loss: 0.8331 - mean_squared_error: 0.0611
Epoch 125/300
 - 1s - loss: 0.8327 - mean_squared_error: 0.0610
Epoch 126/300
 - 1s - loss: 0.8322 - mean_squared_error: 0.0609
Epoch 127/300
 - 1s - loss: 0.8325 - mean_squared_error: 0.0610
Epoch 128/300
 - 1s - loss: 0.8321 - mean_squared_error: 0.0609
Epoch 129/300
 - 1s - loss: 0.8322 - mean_squared_error: 0.0609
Epoch 130/300
 - 1s - loss: 0.8325 - mean_squared_error: 0.0610
Epoch 131/300
 - 1s - loss: 0.8319 - mean_squared_error: 0.0608
Epoch 132/300
 - 1s - loss: 0.8322 - mean_squared_erro

Epoch 246/300
 - 1s - loss: 0.8284 - mean_squared_error: 0.0600
Epoch 247/300
 - 1s - loss: 0.8282 - mean_squared_error: 0.0600
Epoch 248/300
 - 1s - loss: 0.8282 - mean_squared_error: 0.0600
Epoch 249/300
 - 1s - loss: 0.8279 - mean_squared_error: 0.0599
Epoch 250/300
 - 1s - loss: 0.8278 - mean_squared_error: 0.0599
Epoch 251/300
 - 1s - loss: 0.8281 - mean_squared_error: 0.0599
Epoch 252/300
 - 1s - loss: 0.8278 - mean_squared_error: 0.0599
Epoch 253/300
 - 1s - loss: 0.8282 - mean_squared_error: 0.0600
Epoch 254/300
 - 1s - loss: 0.8279 - mean_squared_error: 0.0599
Epoch 255/300
 - 1s - loss: 0.8282 - mean_squared_error: 0.0600
Epoch 256/300
 - 1s - loss: 0.8278 - mean_squared_error: 0.0599
Epoch 257/300
 - 1s - loss: 0.8278 - mean_squared_error: 0.0599
Epoch 258/300
 - 1s - loss: 0.8280 - mean_squared_error: 0.0599
Epoch 259/300
 - 1s - loss: 0.8280 - mean_squared_error: 0.0600
Epoch 260/300
 - 1s - loss: 0.8276 - mean_squared_error: 0.0598
Epoch 261/300
 - 1s - loss: 0.8278 - mea

Epoch 54/300
 - 1s - loss: 0.8428 - mean_squared_error: 0.0632
Epoch 55/300
 - 1s - loss: 0.8426 - mean_squared_error: 0.0631
Epoch 56/300
 - 1s - loss: 0.8424 - mean_squared_error: 0.0631
Epoch 57/300
 - 1s - loss: 0.8423 - mean_squared_error: 0.0631
Epoch 58/300
 - 1s - loss: 0.8427 - mean_squared_error: 0.0631
Epoch 59/300
 - 1s - loss: 0.8418 - mean_squared_error: 0.0629
Epoch 60/300
 - 1s - loss: 0.8418 - mean_squared_error: 0.0630
Epoch 61/300
 - 1s - loss: 0.8419 - mean_squared_error: 0.0630
Epoch 62/300
 - 1s - loss: 0.8417 - mean_squared_error: 0.0629
Epoch 63/300
 - 1s - loss: 0.8418 - mean_squared_error: 0.0629
Epoch 64/300
 - 1s - loss: 0.8413 - mean_squared_error: 0.0628
Epoch 65/300
 - 1s - loss: 0.8414 - mean_squared_error: 0.0628
Epoch 66/300
 - 1s - loss: 0.8412 - mean_squared_error: 0.0628
Epoch 67/300
 - 1s - loss: 0.8411 - mean_squared_error: 0.0628
Epoch 68/300
 - 1s - loss: 0.8409 - mean_squared_error: 0.0627
Epoch 69/300
 - 1s - loss: 0.8410 - mean_squared_error:

Epoch 183/300
 - 1s - loss: 0.8338 - mean_squared_error: 0.0610
Epoch 184/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0610
Epoch 185/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0610
Epoch 186/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0610
Epoch 187/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0609
Epoch 188/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0610
Epoch 189/300
 - 1s - loss: 0.8337 - mean_squared_error: 0.0610
Epoch 190/300
 - 1s - loss: 0.8338 - mean_squared_error: 0.0610
Epoch 191/300
 - 1s - loss: 0.8339 - mean_squared_error: 0.0610
Epoch 192/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0609
Epoch 193/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0609
Epoch 194/300
 - 1s - loss: 0.8338 - mean_squared_error: 0.0610
Epoch 195/300
 - 1s - loss: 0.8330 - mean_squared_error: 0.0608
Epoch 196/300
 - 1s - loss: 0.8332 - mean_squared_error: 0.0608
Epoch 197/300
 - 1s - loss: 0.8335 - mean_squared_error: 0.0609
Epoch 198/300
 - 1s - loss: 0.8334 - mea

Game is a draw.
Game 11 commencing.
Player 1 wins! (after 16 moves)
Game 12 commencing.
Game is a draw.
Game 13 commencing.
Player 0 wins! (after 49 moves)
Game 14 commencing.
Player 0 wins! (after 18 moves)
Game 15 commencing.
Player 1 wins! (after 49 moves)
Game 16 commencing.
Game is a draw.
Game 17 commencing.
Player 0 wins! (after 41 moves)
Game 18 commencing.
Player 1 wins! (after 49 moves)
Game 19 commencing.
Game is a draw.
[0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 'draw', 1, 'draw', 0, 0, 1, 'draw', 0, 1, 'draw']
Tie!!
New policy seems to be better...
train: 28052 / 35056 states (80 pct)
test: 7004 / 35056 states (20 pct)
Epoch 1/300
 - 1s - loss: 0.9166 - mean_squared_error: 0.0685
Epoch 2/300
 - 1s - loss: 0.8999 - mean_squared_error: 0.0653
Epoch 3/300
 - 1s - loss: 0.8963 - mean_squared_error: 0.0644
Epoch 4/300
 - 1s - loss: 0.8944 - mean_squared_error: 0.0640
Epoch 5/300
 - 1s - loss: 0.8930 - mean_squared_error: 0.0637
Epoch 6/300
 - 1s - loss: 0.8918 - mean_squared_error: 0.0634


Epoch 121/300
 - 1s - loss: 0.8726 - mean_squared_error: 0.0590
Epoch 122/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 123/300
 - 1s - loss: 0.8723 - mean_squared_error: 0.0590
Epoch 124/300
 - 1s - loss: 0.8727 - mean_squared_error: 0.0591
Epoch 125/300
 - 1s - loss: 0.8725 - mean_squared_error: 0.0590
Epoch 126/300
 - 1s - loss: 0.8723 - mean_squared_error: 0.0590
Epoch 127/300
 - 1s - loss: 0.8723 - mean_squared_error: 0.0590
Epoch 128/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 129/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 130/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 131/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 132/300
 - 1s - loss: 0.8718 - mean_squared_error: 0.0589
Epoch 133/300
 - 1s - loss: 0.8720 - mean_squared_error: 0.0589
Epoch 134/300
 - 1s - loss: 0.8721 - mean_squared_error: 0.0589
Epoch 135/300
 - 1s - loss: 0.8719 - mean_squared_error: 0.0589
Epoch 136/300
 - 1s - loss: 0.8716 - mea

 - 1s - loss: 0.8687 - mean_squared_error: 0.0581
Epoch 250/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0580
Epoch 251/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0581
Epoch 252/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0581
Epoch 253/300
 - 1s - loss: 0.8687 - mean_squared_error: 0.0581
Epoch 254/300
 - 1s - loss: 0.8685 - mean_squared_error: 0.0581
Epoch 255/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0581
Epoch 256/300
 - 1s - loss: 0.8682 - mean_squared_error: 0.0580
Epoch 257/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0580
Epoch 258/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0581
Epoch 259/300
 - 1s - loss: 0.8682 - mean_squared_error: 0.0580
Epoch 260/300
 - 1s - loss: 0.8686 - mean_squared_error: 0.0581
Epoch 261/300
 - 1s - loss: 0.8685 - mean_squared_error: 0.0580
Epoch 262/300
 - 1s - loss: 0.8682 - mean_squared_error: 0.0580
Epoch 263/300
 - 1s - loss: 0.8684 - mean_squared_error: 0.0580
Epoch 264/300
 - 1s - loss: 0.8683 - mean_squared_erro

Epoch 57/300
 - 1s - loss: 0.8387 - mean_squared_error: 0.0620
Epoch 58/300
 - 1s - loss: 0.8386 - mean_squared_error: 0.0620
Epoch 59/300
 - 1s - loss: 0.8383 - mean_squared_error: 0.0619
Epoch 60/300
 - 1s - loss: 0.8384 - mean_squared_error: 0.0619
Epoch 61/300
 - 1s - loss: 0.8384 - mean_squared_error: 0.0619
Epoch 62/300
 - 1s - loss: 0.8383 - mean_squared_error: 0.0619
Epoch 63/300
 - 1s - loss: 0.8379 - mean_squared_error: 0.0618
Epoch 64/300
 - 1s - loss: 0.8380 - mean_squared_error: 0.0618
Epoch 65/300
 - 1s - loss: 0.8381 - mean_squared_error: 0.0619
Epoch 66/300
 - 1s - loss: 0.8376 - mean_squared_error: 0.0617
Epoch 67/300
 - 1s - loss: 0.8377 - mean_squared_error: 0.0617
Epoch 68/300
 - 1s - loss: 0.8378 - mean_squared_error: 0.0618
Epoch 69/300
 - 1s - loss: 0.8376 - mean_squared_error: 0.0617
Epoch 70/300
 - 1s - loss: 0.8375 - mean_squared_error: 0.0617
Epoch 71/300
 - 1s - loss: 0.8372 - mean_squared_error: 0.0616
Epoch 72/300
 - 1s - loss: 0.8373 - mean_squared_error:

Epoch 186/300
 - 1s - loss: 0.8312 - mean_squared_error: 0.0601
Epoch 187/300
 - 1s - loss: 0.8313 - mean_squared_error: 0.0601
Epoch 188/300
 - 1s - loss: 0.8310 - mean_squared_error: 0.0601
Epoch 189/300
 - 1s - loss: 0.8311 - mean_squared_error: 0.0600
Epoch 190/300
 - 1s - loss: 0.8309 - mean_squared_error: 0.0600
Epoch 191/300
 - 1s - loss: 0.8310 - mean_squared_error: 0.0600
Epoch 192/300
 - 1s - loss: 0.8314 - mean_squared_error: 0.0602
Epoch 193/300
 - 1s - loss: 0.8310 - mean_squared_error: 0.0600
Epoch 194/300
 - 1s - loss: 0.8307 - mean_squared_error: 0.0600
Epoch 195/300
 - 1s - loss: 0.8309 - mean_squared_error: 0.0600
Epoch 196/300
 - 1s - loss: 0.8309 - mean_squared_error: 0.0600
Epoch 197/300
 - 1s - loss: 0.8307 - mean_squared_error: 0.0600
Epoch 198/300
 - 1s - loss: 0.8310 - mean_squared_error: 0.0600
Epoch 199/300
 - 1s - loss: 0.8308 - mean_squared_error: 0.0600
Epoch 200/300
 - 1s - loss: 0.8307 - mean_squared_error: 0.0599
Epoch 201/300
 - 1s - loss: 0.8306 - mea

Player 0 wins! (after 21 moves)
Game 15 commencing.
Player 1 wins! (after 28 moves)
Game 16 commencing.
Player 0 wins! (after 22 moves)
Game 17 commencing.
Game is a draw.
Game 18 commencing.
Game is a draw.
Game 19 commencing.
Player 0 wins! (after 49 moves)
[1, 1, 1, 1, 1, 1, 0, 'draw', 1, 1, 1, 1, 'draw', 0, 0, 1, 0, 'draw', 'draw', 0]
Winner is policy 1 with 11:5 wins
New policy seems to be better...
train: 24565 / 30620 states (80 pct)
test: 6055 / 30620 states (20 pct)
Epoch 1/300
 - 1s - loss: 0.9076 - mean_squared_error: 0.0639
Epoch 2/300
 - 1s - loss: 0.8910 - mean_squared_error: 0.0612
Epoch 3/300
 - 1s - loss: 0.8880 - mean_squared_error: 0.0606
Epoch 4/300
 - 1s - loss: 0.8862 - mean_squared_error: 0.0602
Epoch 5/300
 - 1s - loss: 0.8850 - mean_squared_error: 0.0599
Epoch 6/300
 - 1s - loss: 0.8843 - mean_squared_error: 0.0598
Epoch 7/300
 - 1s - loss: 0.8834 - mean_squared_error: 0.0596
Epoch 8/300
 - 1s - loss: 0.8828 - mean_squared_error: 0.0594
Epoch 9/300
 - 1s - loss

 - 1s - loss: 0.8666 - mean_squared_error: 0.0557
Epoch 124/300
 - 1s - loss: 0.8665 - mean_squared_error: 0.0556
Epoch 125/300
 - 1s - loss: 0.8666 - mean_squared_error: 0.0556
Epoch 126/300
 - 1s - loss: 0.8666 - mean_squared_error: 0.0556
Epoch 127/300
 - 1s - loss: 0.8667 - mean_squared_error: 0.0557
Epoch 128/300
 - 1s - loss: 0.8665 - mean_squared_error: 0.0556
Epoch 129/300
 - 1s - loss: 0.8668 - mean_squared_error: 0.0557
Epoch 130/300
 - 1s - loss: 0.8667 - mean_squared_error: 0.0557
Epoch 131/300
 - 1s - loss: 0.8662 - mean_squared_error: 0.0556
Epoch 132/300
 - 1s - loss: 0.8662 - mean_squared_error: 0.0556
Epoch 133/300
 - 1s - loss: 0.8664 - mean_squared_error: 0.0556
Epoch 134/300
 - 1s - loss: 0.8663 - mean_squared_error: 0.0556
Epoch 135/300
 - 1s - loss: 0.8661 - mean_squared_error: 0.0556
Epoch 136/300
 - 1s - loss: 0.8663 - mean_squared_error: 0.0556
Epoch 137/300
 - 1s - loss: 0.8661 - mean_squared_error: 0.0555
Epoch 138/300
 - 1s - loss: 0.8661 - mean_squared_erro

Epoch 252/300
 - 1s - loss: 0.8634 - mean_squared_error: 0.0549
Epoch 253/300
 - 1s - loss: 0.8634 - mean_squared_error: 0.0549
Epoch 254/300
 - 1s - loss: 0.8632 - mean_squared_error: 0.0548
Epoch 255/300
 - 1s - loss: 0.8630 - mean_squared_error: 0.0548
Epoch 256/300
 - 1s - loss: 0.8633 - mean_squared_error: 0.0548
Epoch 257/300
 - 1s - loss: 0.8632 - mean_squared_error: 0.0548
Epoch 258/300
 - 1s - loss: 0.8630 - mean_squared_error: 0.0548
Epoch 259/300
 - 1s - loss: 0.8633 - mean_squared_error: 0.0548
Epoch 260/300
 - 1s - loss: 0.8632 - mean_squared_error: 0.0548
Epoch 261/300
 - 1s - loss: 0.8628 - mean_squared_error: 0.0547
Epoch 262/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0548
Epoch 263/300
 - 1s - loss: 0.8630 - mean_squared_error: 0.0548
Epoch 264/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0547
Epoch 265/300
 - 1s - loss: 0.8629 - mean_squared_error: 0.0547
Epoch 266/300
 - 1s - loss: 0.8631 - mean_squared_error: 0.0548
Epoch 267/300
 - 1s - loss: 0.8630 - mea

Epoch 60/300
 - 1s - loss: 0.8593 - mean_squared_error: 0.0664
Epoch 61/300
 - 1s - loss: 0.8593 - mean_squared_error: 0.0664
Epoch 62/300
 - 1s - loss: 0.8591 - mean_squared_error: 0.0663
Epoch 63/300
 - 1s - loss: 0.8590 - mean_squared_error: 0.0663
Epoch 64/300
 - 1s - loss: 0.8589 - mean_squared_error: 0.0663
Epoch 65/300
 - 1s - loss: 0.8588 - mean_squared_error: 0.0663
Epoch 66/300
 - 1s - loss: 0.8586 - mean_squared_error: 0.0662
Epoch 67/300
 - 1s - loss: 0.8583 - mean_squared_error: 0.0661
Epoch 68/300
 - 1s - loss: 0.8584 - mean_squared_error: 0.0662
Epoch 69/300
 - 1s - loss: 0.8582 - mean_squared_error: 0.0661
Epoch 70/300
 - 1s - loss: 0.8580 - mean_squared_error: 0.0661
Epoch 71/300
 - 1s - loss: 0.8578 - mean_squared_error: 0.0660
Epoch 72/300
 - 1s - loss: 0.8577 - mean_squared_error: 0.0660
Epoch 73/300
 - 1s - loss: 0.8579 - mean_squared_error: 0.0660
Epoch 74/300
 - 1s - loss: 0.8579 - mean_squared_error: 0.0660
Epoch 75/300
 - 1s - loss: 0.8577 - mean_squared_error:

Epoch 189/300
 - 1s - loss: 0.8523 - mean_squared_error: 0.0646
Epoch 190/300
 - 1s - loss: 0.8521 - mean_squared_error: 0.0645
Epoch 191/300
 - 1s - loss: 0.8521 - mean_squared_error: 0.0645
Epoch 192/300
 - 1s - loss: 0.8523 - mean_squared_error: 0.0646
Epoch 193/300
 - 1s - loss: 0.8522 - mean_squared_error: 0.0646
Epoch 194/300
 - 1s - loss: 0.8520 - mean_squared_error: 0.0645
Epoch 195/300
 - 1s - loss: 0.8522 - mean_squared_error: 0.0645
Epoch 196/300
 - 1s - loss: 0.8518 - mean_squared_error: 0.0644
Epoch 197/300
 - 1s - loss: 0.8517 - mean_squared_error: 0.0644
Epoch 198/300
 - 1s - loss: 0.8517 - mean_squared_error: 0.0644
Epoch 199/300
 - 1s - loss: 0.8516 - mean_squared_error: 0.0644
Epoch 200/300
 - 1s - loss: 0.8520 - mean_squared_error: 0.0645
Epoch 201/300
 - 1s - loss: 0.8518 - mean_squared_error: 0.0645
Epoch 202/300
 - 1s - loss: 0.8521 - mean_squared_error: 0.0646
Epoch 203/300
 - 1s - loss: 0.8516 - mean_squared_error: 0.0644
Epoch 204/300
 - 1s - loss: 0.8519 - mea

In [4]:
#     policy0 = MonteCarloTree(starting_game, 
#                            budget = 100, 
#                            num_simulations = 1, 
#                            max_steps_to_simulate = simulation_policies[-2])

#     policy1 = MonteCarloTree(starting_game, 
#                              budget = 100, 
#                              num_simulations = 1, 
#                              max_steps_to_simulate = 60,
#                              simulation_policy = simulation_policies[-1])

#     policy_chooser = PolicyChooser(starting_game, policy0, policy1, num_games = 10)

#     policy_num = policy_chooser.BestPolicy()

Now we assemble the training set.  The training set consists of all nodes (that is, game states) in all the trees created above with depth below a fixed max_depth, together with the win/lose/draw probabilities for Player 0 as an output to be learned.



We train a neural net model using Keras to learn these probabilities.


The model INPUTS:

 - A vector representing the current board state
 
The model OUTPUTS:
 
 - A vector (p0, p1, pdraw) where 
     p0 = Probability of Player 0 winning
     p1 = Probability of Player 1 winning
     pdraw = Probability of draw
     
   ASSUMING that Player 0 is the current player.
   
   (If Player 1 is the current player, we reverse p0 and p1.)

In [14]:
model.input_shape

(None, 72)

In [9]:
timestamp = time.strftime("%Y-%m-%d_%H:%M")

filename1 = 'data_%s_neural_net_model.h5' % timestamp
model.save(filename1)
print("Saved file %s" % filename1)

filename2 = 'data_%s_train_X.h5' % timestamp
train_X.tofile(filename2)
print("Saved file %s" % filename2)


filename3 = 'data_%s_train_Y.h5' % timestamp
train_Y.tofile(filename3)
print("Saved file %s" % filename3)

filename4 = 'data_%s_neural_net_model.json' % timestamp
model_json = model.to_json()
with open(filename4, "w") as json_file:
    json_file.write(model_json)
print("Saved file %s" % filename4)


Saved file data_2018-10-15_15:04_neural_net_model.h5
Saved file data_2018-10-15_15:04_train_X.h5
Saved file data_2018-10-15_15:04_train_Y.h5
Saved file data_2018-10-15_15:04_neural_net_model.json


Here we visualize the progress of a particular game of checkers and the associated win/lose/draw probabilites computed by our trained neural net.

In [6]:
game_to_view = 1
game_states_list = games_list[game_to_view]
rcParams['figure.figsize'] = 4,4

def game_slider(i):
    game_states_list[i].show_board()
    
    v = get_single_board_vector(game_states_list[i], symmetrize = symmetrize)
    X = np.array([v])
    model_output = list(100 * model.predict(X)[0])
    
    if game_states_list[i].player == 1 and symmetrize:
        model_output = [model_output[1], model_output[0], model_output[2]]
        
        
    print(game_states_list[i].notes)
    print("Chance of Player 0 victory: ", model_output[0], '%')
    print("Chance of Player 1 victory: ", model_output[1], '%')
    print("Chance of draw:             ", model_output[2], '%')
    return 'Player: %d' % game_states_list[i].player

interact(game_slider, i = IntSlider(min=0,max=len(game_states_list)-1,step=1,value=0))



interactive(children=(IntSlider(value=0, description='i', max=50), Output()), _dom_classes=('widget-interact',…

<function __main__.game_slider(i)>

In [7]:
# starting_game = checkers_state(board_size = board_size, max_turns = max_turns, tiebreaker_rule = True)

# policy0 = MonteCarloTree(starting_game, 
#                        budget = 100, 
#                        num_simulations = 1, 
#                        max_steps_to_simulate = 60)

# policy1 = MonteCarloTree(starting_game, 
#                          budget = 100, 
#                          num_simulations = 1, 
#                          max_steps_to_simulate = 60,
#                          simulation_policy = NeuralNetPolicy(model, orderliness = 80))

# # policy0 = NeuralNetPolicy(model, orderliness = 10)

# # policy1 = NeuralNetPolicy(model, deterministic = True)

# policy_chooser = PolicyChooser(starting_game, policy0, policy1, num_games = 10)

# policy_num = policy_chooser.BestPolicy()

# print ("Winning policy:", policy_num)


Here we have a single step of a more general training algorithm to be implemented.  The idea is to simultaneously train the neural net while using the output from this model to improve the MCTS algorithm.

- At each step, we have previously determined a weight $\alpha \in [0,1]$ and a neural net f that takes as input a board state $B$ and outputs a triple $(w,l,d)$ corresponding to its estimate of winning, losing or drawing assuming "near-ideal" play.
- Play a large number of games using Monte-Carlo Tree Search.  
- During backpropogation, two 3-tuples are backpropogated:
    - $Q = (w, l, d)$ corresponding to the average number of wins, losses, and draws.
    - $\hat{Q}  = (\hat{w}, \hat{l}, \hat{d}) = f(B)$ corresponding to output of the previously-neural net $f$.
- For each node, set a value y corresponding to that node with $y = \alpha Q + (1-\alpha) \hat{Q}$ for some fixed $\alpha$. (Initially, set $\alpha = 1$; all weight is given to $Q$ when neural net is not yet trained.)
- The best child in MCTS is chosen according to the value of $y$.
- After many games have been played, create training set for neural net consisting of:
   - inputs $X$, game states corresponding to all nodes of depth < max_depth in all game trees
   - outputs $Y$, the the $y$ values given to these game states.
- Train neural net.
- Periodically, re-set the value of $\alpha$.  To find it, play a number of games between the following strategies: 
    - Strategy A: Choose moves by performing MCTS using random policy.
    - Strategy B: Choose moves by performing MCTS using scoring by $f$.

    Choose $\alpha$ to be the proportion of games won by Strategy A (plus half the proportion of games drawn.)
    
If the algorithm is succeeding, the value of $\alpha$ should gradually decrease on average.  Eventually, if $\alpha$ becomes small enough, set $\alpha = 0$ to save time (since evaluation of $\hat{Q}$ should be much faster than evaluation of $Q$.)

Periodically, the 'skill level' of neural net $f$ should be measured by the number of games won by Strategy A vs. Strategy B above, but with possibly with B given a handicap by having a large budget.  e.g., if $f$ is very accurate, Strategy A with a budget of 20 may be able to beat Strategy B with a budget of 100.



Other notes / questions - 

- Should at first play with small board size to see if training is working
- Over time, should increase the MCTS budget as $f$ becomes stronger
- Should Q-values of game trees be shared across turns or games? If so, how?