# Selective Three-Ply

"To save computer time, the second ply of search was conducted only for candidate moves that were ranked highly after
the first ply, about four or five moves on average. Two-ply search affected only the moves selected; the
learning process proceeded exactly as before. The finnal versions of the program, TD-Gammon 3.0 and 3.1, used 160 hidden units and a selective three-ply search"

__Reinforcement Learning: An Introduction, Sutton & Barto, 2017__


Um 3-ply erträglich zu machen wird die Anzahl der zu durchsuchenden Züge auf der höchsten Ebene auf ein kleine Zahl begrenzt. Das reduziert zwar eventuell das Potenzial von 3-ply sollte aber auch Ergebnisse in absehbarer Zeit liefern können.

### SelectiveTPVPlayer
Aus offensichtlichen Gründen abgekürzt

In [1]:
%load_ext Cython

In [2]:
%%cython

from Player import ThreePlyValuePlayer

class SelectiveTPVPlayer(ThreePlyValuePlayer):

    # Sucht die 5 besten Züge mit 1-ply und untersucht diese dann mit 3-ply um
    # den Rechenaufwand zu reduzieren
    def get_action(self, actions, game):
        # Spielstatus speichern
        old_state = game.get_state()
        # Variablen initialisieren
        best_value = float("-inf")
        best_action = None
        moves_score = []
        # Alle Züge durchsuchen
        for a in actions:
            # Zug ausführen
            game.execute_moves(a, self.player)
            # Spielstatus bewerten
            value = self.value(game, self.player)
            #Zur liste hinzufügen
            moves_score.append((a, value))
            # Besten merken
            if value > best_value:
                best_value = value
                best_action = a
            # Spiel zurücksetzen
            game.reset_to_state(old_state)
        # Die besten 5 Züge weiter untersuchen und mit 2-ply/3-ply bewerten
        moves_score = sorted(moves_score, key=lambda tup: tup[1])
        top_five = moves_score[-5:]
        top_five_actions = [x for (x,_) in top_five]
        return ThreePlyValuePlayer.get_action(self, top_five_actions, game)
    
    def get_name(self):
        return "SelectiveTPVPlayer [" + self.value.__name__ + "]"
    
class SelectiveTPMPlayer(SelectiveTPVPlayer):

    def __init__(self, player, model):
        SelectiveTPVPlayer.__init__(self, player, self.get_value)
        self.model = model
        
    def get_value(self, game, player):
        features = game.extractFeatures(player)
        v = self.model.get_output(features)
        v = 1 - v if self.player == game.players[0] else v
        return v
    
    def get_name(self):
        return "SelectiveTPMPlayer [" + self.model.get_name() +"]"

#### 1. Testspiel

Wie lange dauert ein Spiel?

PlayerTest benutzt nun auch CythonBackgammon um das Spiel zu beschleunigen

In [5]:
import Player
import PlayerTest

players = [Player.ValuePlayer('black', Player.singleton), SelectiveTPVPlayer('white', Player.singleton)]

PlayerTest.test(players, 1)

Spiel 0 von 1 geht an ValuePlayer [singleton] ( black )

{'white': 0, 'black': 1}
1 Spiele in  233.90840005874634 Sekunden


__4 Minuten__ statt 25 pro Spiel sind schon deutlich besser!

In [6]:
PlayerTest.test(players, 10)

Spiel 0 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 1 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 2 von 10 geht an ValuePlayer [singleton] ( black )
Spiel 3 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 4 von 10 geht an ValuePlayer [singleton] ( black )
Spiel 5 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 6 von 10 geht an ValuePlayer [singleton] ( black )
Spiel 7 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 8 von 10 geht an SelectiveTPVPlayer [singleton] ( white )
Spiel 9 von 10 geht an ValuePlayer [singleton] ( black )

{'white': 6, 'black': 4}
10 Spiele in  2401.3864953517914 Sekunden


Schafft es der Three-ply singleton gegen TD-Gammon80?

#### TD-Gammon80 vs SelectiveTPVPlayer Singleton

In [9]:
import tensorflow as tf
from NeuralNetModel import TDGammonModel

graph = tf.Graph()
sess = tf.Session(graph=graph)
with sess.as_default(), graph.as_default():
    model = TDGammonModel(sess, hidden_size = 80, name = "TD-Gammon80", restore=True)
    model.test(games = 10, enemyPlayer = SelectiveTPVPlayer('white', Player.singleton))

Restoring checkpoint: checkpoints/TD-Gammon80/checkpoint.ckpt-1527600
INFO:tensorflow:Restoring parameters from checkpoints/TD-Gammon80/checkpoint.ckpt-1527600
[Game 0] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 1:0 of 1 games (100.00%)
[Game 1] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 2:0 of 2 games (100.00%)
[Game 2] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 3:0 of 3 games (100.00%)
[Game 3] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 4:0 of 4 games (100.00%)
[Game 4] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 4:1 of 5 games (80.00%)
[Game 5] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 5:1 of 6 games (83.33%)
[Game 6] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 6:1 of 7 games (85.71%)
[Game 7] ModelPlayer [TD-Gammon] (black) vs SelectiveTPVPlayer [singleton] (white) 6:2 

In [3]:
import Player
import PlayerTest
import tensorflow as tf
from NeuralNetModel import TDGammonModel

graph = tf.Graph()
sess = tf.Session(graph=graph)
with sess.as_default(), graph.as_default():
    model = TDGammonModel(sess, restore=True)
    players = [Player.ModelPlayer('black', model), SelectiveTPMPlayer('white', model)]
    PlayerTest.test(players, games=10)

Restoring checkpoint: checkpoints/TD-Gammon/checkpoint.ckpt-1593683
INFO:tensorflow:Restoring parameters from checkpoints/TD-Gammon/checkpoint.ckpt-1593683
Spiel 0 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 1 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 2 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 3 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 4 von 10 geht an ModelPlayer [TD-Gammon] ( black )
Spiel 5 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 6 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 7 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 8 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )
Spiel 9 von 10 geht an SelectiveTPMPlayer [TD-Gammon] ( white )

{'white': 9, 'black': 1}
ModelPlayer [TD-Gammon] vs. SelectiveTPMPlayer [TD-Gammon] : 10.0 %
10 Spiele in  74149.61713314056 Sekunden
