### 1.4 QNN (Q-Wertbasiertes Neuronales Netzwerk)

Für die Erweiterbarkeit unseres Codes ist uns aufgefallen, dass der Q-Algorithmus seine Grenzen bei Spielen mit einem beinahe unendlichen Zustandsraum aufweist. Spiele wie Schach oder Go haben einen derart riesigen Zustandsraum aller möglichen Züge, der ohne weitere Hilfe nicht einfach durch ausprobieren komplett erkundet werden kann. Um dem Prinzip des bestärkenden Lernens nahe zu kommen, müssen andere Methoden gefunden werden, um zukünftige Züge oder sogar Strategien bei unendlich wirkenden Zustandsräumen hervorzusagen. Hierfür soll ein neuronales Netzwerk mit mehreren Schichten zum Einsatz kommen.

Der Gedanke dahinter ist, dass man nicht mehr versucht alle Zustände zu erkunden und perfekt vorherzusagen, sondern eine Struktur oder sogar Strategie in gewissen Zügen zu erkennen. Der Fokus des neuronalen Netzes soll damit sein, Strukturen in zeitlich aufeinanderfolgender Züge zu erkennen und bei unbekannten Zuständen einen Schätzwert auf Basis bisher bekannter Strategien ausgeben.

Ein neuronales Netzwerk besteht zumeist aus Eingabevektoren, verschiedenste versteckte Schichten sowie Ausgabevektoren. Genauso wie die Q-Funktion, sollen dem QNN gewisse Parameter als Eingabevektoren mitgegeben werden. Dazu gehört der aktuelle Zustand des Bretts, die ausgewählte Aktion auf Basis vorheriger Werte aus dem QNN Modell und ggf. die Belohnung. Als Ausgabevektor soll, genauso wie bei der Q-Funktion, ein Q-Wert sein, der die maximale Belohnung des aktuellen Zustand-Aktion-Paares beschreibt.

Die Lerndaten erstellt der QNN Agent selbst, durch das explorative Erkunden mittels dem Explorationsfaktor namens "Theta". Anhand der explorierten Daten und erlangten Belohnungen wird das QNN selbst die besten Züge herausfinden.

In [1]:
from abc import abstractmethod
from pathlib import Path
from build.player.QPlayer import QPlayer
from build.player.qlearner.QLearner import QLearner
from build.Board import Board

import numpy as np
import keras.models as Km
import keras.layers as kl

import os
import random

In [2]:
class QNNLearner(QLearner):
    """
    QLearner specification for neural network learning
    """

    def __init__(self, model = None, learn_rate=0.1, discount_factor=0.8, batch_size=10):
        """
        :param model: used qnn model structure
        :param learn_rate: learning rate of this q learner
        :param discount_factor: discount factor of this q learner
        """
        self.possible_actions = {0:[1, 1], 1:[2, 1], 2:[3, 1], 3:[1, 2], 4:[2, 2], 5:[3, 2], 6:[1, 3], 7:[2, 3], 8:[3, 3]}
        self.learn_rate = learn_rate
        self.model = model
        self.discount_factor = discount_factor
        self.memory = []
        self.count_memory = 0
        self.batch_size = batch_size

    def update(self, prev_state, state, prev_move, reward):
        """
        Update q players knowledge by learning offline or online
        :param prev_state: previous known state
        :param state: new state
        :param prev_move: previous made move
        :param reward: previous reward
        """
        self.load_to_memory(prev_state, prev_move, state, reward)

        self.count_memory += 1

        #print(self.count_memory)
        if self.count_memory == self.batch_size:
            self.count_memory = 0
            # Offline training
            self.model.learn_batch(self.memory)
            # Online training
            #self.learn(self.prev_state, self.prev_move, state,  -1, self.reward)
            self.memory = []


    def load_to_memory(self, prev_state, prev_move, state, reward):
        """
        Load all q related things into memory to learn in batch
        :param prev_state: previous known state
        :param prev_move: previous made move
        :param state: new state
        :param reward: previous reward
        """
        self.memory.append([prev_state, prev_move, state, reward])

    def select_move(self, state, theta=0.1):
        """
        Select the best move or, if exploring, a random move
        :param state: current state
        :param theta: temperature value (optional)
        :return: chosen action
        """
        p = random.uniform(0, 1)

        if p > theta:
            action = self.choose_optimal_move(state)
            #action = self.possible_actions[idx]
        else:
            action = random.choice(self.possible_actions)

        return action # return choosen move


    def choose_optimal_move(self, state):
        """
        Choose optimal move based on the calculated and best predicted values of current state.
        :param state: current state
        :return: best move with highest value (randomly select for equal values)
        """
        v = -float('Inf') # most negative value (negative infinity float)
        v_list = [] # list of all calculated values
        idx = [] # move index for chosen move
        for move in self.possible_actions:
            value = self.model.calc_value(state, move)
            v_list.append(round(float(value), 5))

            if value > v:
                v = value
                idx = [move]
            elif v == value:
                idx.append(move)

        idx = random.choice(idx)
        return idx


In [3]:
class Model:
    """
    Model class for all 2 player based games with neural network training
    """

    def __init__(self, tag):
        """
        :param tag: used tag for neural network model (e.g. 1 for first player and -1 for second)
        """
        self.tag = tag
        self.epsilon = 0.1
        self.alpha = 0.5
        self.gamma = 1
        self.model = self.load_model()

    def load_model(self):
        """
        Loads previously saved model
        :return: loaded model
        """

        """
        if self.tag == 1:
            tag = '_first'
        else:
            tag = '_second'
            """
       # s = 'model_values' + tag + '.h5'
        s = 'model_values.h5'
        model_file = Path(s)

        if model_file.is_file():
            print('load model')
            model = Km.load_model(s)
            print('load model: ' + s)
        else:
            model = self.create_model()
        return model

    @abstractmethod
    def create_model(self):
        """
        Create new model with appropriate number of layers and network structure
        :return: created model
        """
        pass

    @abstractmethod
    def state_to_tensor(self, state, move):
        """
        Creates a tensor (2 dim array) based on a state and a move as input vector for nn
        :param state: current state
        :param move: current move
        :return: created tensor
        """
        pass

    def calc_value(self, state, move):
        """
        Calculate a tensor and predict the reward
        :param state: current state
        :param move: current move
        :return: most predicted value (predicted reward)
        """
        tensor = self.state_to_tensor(state, move)
        value = self.model.predict(tensor)
        # K.backend.clear_session()
        return value

    def calc_target(self, prev_state, prev_move, state, reward):
        """
        Calculate the target vector (q value or reward)
        :param prev_state: previous state
        :param prev_move: previous move
        :param state: current state
        :param reward: previous reward
        :return: calculated target value
        """

        qvalue = self.calc_value(prev_state, prev_move)
        v = []
        tensor = self.state_to_tensor(prev_state, prev_move)

        for move in range(len(tensor[:,0][0])):
            v.append(self.calc_value(state, move))

        if reward == 0:
            v_s_tag = self.gamma * np.max(v)
            target = np.array(qvalue + self.alpha * (reward + v_s_tag - qvalue))
        else:
            # v_s_tag = 0
            target = reward

        # target = np.array(v_s + self.alpha * (reward + v_s_tag - v_s))

        # if self.tag == 1:
        #     print('learn general')
        #     print(prev_state, prev_move, state, ava_moves, reward)
        # print('target: ', target)

        return target

    def train_model(self, prev_state, prev_move, target, epochs):
        """
        Train the model based on an input tensor
        :param prev_state: previous state
        :param prev_move: previous move
        :param target: calculated q value or reward (target vector)
        :param epochs: number of epochs
        """

        tensor = self.state_to_tensor(prev_state, prev_move)

        if target is not None:

            if self.tag == 1:
                print('value before training:', self.model.predict(tensor))
            self.model.fit(tensor, target, epochs=epochs, verbose=0)
            # K.backend.clear_session()

            if self.tag == 1:
                print('target:', target)
                print('value after training:', self.model.predict(tensor))

    def save_model(self):
        """
        save model as h5 file
        """
        if self.tag == 1:
            tag = '_first'
        else:
            tag = '_second'
        s = 'model_values' + tag + '.h5'

        try:
            os.remove(s)
        except:
            pass

        self.model.save(s)

    def learn_batch(self, memory):
        """
        Learn model with a batch of states and actions from memory
        :param memory: saved states, actions and rewards
        """
        print('start learning player', self.tag)
        print('data length:', len(memory))

        # build x_train
        ind = 0
        #x_train = np.zeros((len(memory), 7, 7, 1))
        x_train = np.zeros((len(memory), 2, 9))
        for v in memory:
            [prev_state, prev_move, _, _] = v
            sample = self.state_to_tensor(prev_state, prev_move)
            x_train[ind, :, :] = sample
            ind += 1

        # train with planning
        # for i in range(5):
        loss = 20
        count = 0
        while loss > 0.02 and count < 10:
            # tic()
            y_train = self.create_targets(memory)
            # toc()
            self.model.fit(x_train, y_train, epochs=5, batch_size=256, verbose=0)
            loss = self.model.evaluate(x_train, y_train, batch_size=256, verbose=0)[0]
            count += 1
            print('planning number:', count, 'loss', loss)

        loss = self.model.evaluate(x_train, y_train, batch_size=256, verbose=0)
        print('player:', self.tag, loss, 'loops', count)

        self.save_model()

    def create_targets(self, memory):
        """
        Create target vector for each state-action-pair in memory
        :param memory: saved states, actions and rewards
        :return: target vector
        """
        y_train_ = np.zeros((len(memory), 1))
        count_ = 0
        for v_ in memory:
            [prev_state_, prev_move_, state_, reward_] = v_
            target = self.calc_target(prev_state_, prev_move_, state_, reward_)
            y_train_[count_, :] = target
            count_ += 1

            # print('---------')
            # print('player', self.tag)
            # print('prev state', prev_state_)
            # print('prev move', prev_move_)
            # print('state', state_)
            # print('ava moves', ava_moves_)
            # print('reward', reward_)
            # print('target', target)
            #
            # value = self.calc_value(prev_state_, prev_move_)
            # print('value through net', value)
            # time.sleep(0.2)

        return y_train_

In [4]:
class TicTacToeModel(Model):
    """
    Special model for tic tac toe games.

    Consists of 2x9 input vector, dense network of 9 layers and a 9 sized target vector.
    Input vector consists an array with length 9 for the chosen move and an array for the state.
    Target vector consists of 9 sized array for 9 possible rewards (one for each action).
    """

    def __init__(self, tag):
        super().__init__(tag)
        pass

    def create_model(self):
        """
        Creates keras model
        :return: keras model
        """
        print('new model')

        #model = km.load_model("qnn_model")

        model = Km.Sequential()
        model.add(kl.Flatten(input_shape=(2, 9)))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(18))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(9))
        model.add(kl.LeakyReLU(alpha=0.3))
        model.add(kl.Dense(1, activation='linear'))

        # adam = ko.Adam(lr=0.001)

        model.compile(optimizer='Adam', loss='mean_absolute_error', metrics=['accuracy'])
        #model.save("qnn_model")

        model.summary()

        return model

    def state_to_tensor(self, state, move):
        """
        Generates a tensor of state and move index
        :param state: current state
        :param move: current move
        :return: tensor (2 dim array)
        """

        state = np.array(state)
        state = state.flatten() # flatten 3x3 matrix because of 1 length input vector for state
        state = self.one_hot_encode_state(state)

        a = np.zeros(9).astype('float32')
        # a = a.astype('float32')
        a[move] = 1 # one hot encoding for chosen action (1 for the chosen action an 0 for none)

        state = np.asarray(state).astype('float32')
        tensor = np.array((a, state))
        #print(tensor)
        tensor = tensor.reshape((1, 2, 9))

        return tensor

    def one_hot_encode_state(self, state):
        """
        One hot encoding for the state.
        Each field input of 3x3 matrix will be displayed with 0 (blank), 1 (player 1), -1 (player 2)
        :param state: state to encode
        :return: encoded state
        """
        for i in range(len(state)):
            if state[i] is None:
                state[i] = 0
            if state[i] == 'x':
                state[i] = 1
            if state[i] == 'o':
                state[i] = -1

        return state

In [5]:
model = TicTacToeModel(1)
value = np.ndarray(shape=[3,3])
pstate = np.ndarray(shape=[3,3])
state = [-1,0,-1,0,1,0,0,1,0]
for i in [0,1,2,3,4,5,6,7,8]:
    tensor = model.state_to_tensor(state, i)
    km = model.load_model()
    #print(tensor)
    value[int(i/3)][i%3] = km.predict(tensor)
    pstate[int(i/3)][i%3] = state[i]



new model
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 18)                0         
_________________________________________________________________
dense (Dense)                (None, 18)                342       
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 18)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 18)                342       
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 18)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 18)                342       
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 18)       

In [6]:
for i in [0,1,2,3,4,5,6,7,8]:
    pstate[int(i/3)][i%3] = state[i]
print(value)
print(pstate)
print(f"Optimal: {np.argmax(value)}")
newstate = pstate
newstate[int(np.argmax(value)/3)][np.argmax(value)%3] = 1
print(newstate)

[[ 0.004319   -0.05889302  0.0786997 ]
 [-0.02056507  0.11485811  0.00638656]
 [-0.02371169 -0.01418965  0.00853384]]
[[-1.  0. -1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]]
Optimal: 4
[[-1.  0. -1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]]


In [7]:
from build.Game import Game
from tqdm.notebook import trange

In [8]:
for j in trange(100):

    player1 = QPlayer('x', QNNLearner(model=TicTacToeModel(1)))
    player2 = QPlayer('o', QNNLearner(model=TicTacToeModel(-1)))
    players = [player1, player2]

    game = Game(players)

    for i in range(1000):
        game.run()

  0%|          | 0/100 [00:00<?, ?it/s]

new model
Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_10 (Flatten)         (None, 18)                0         
_________________________________________________________________
dense_90 (Dense)             (None, 18)                342       
_________________________________________________________________
leaky_re_lu_80 (LeakyReLU)   (None, 18)                0         
_________________________________________________________________
dense_91 (Dense)             (None, 18)                342       
_________________________________________________________________
leaky_re_lu_81 (LeakyReLU)   (None, 18)                0         
_________________________________________________________________
dense_92 (Dense)             (None, 18)                342       
_________________________________________________________________
leaky_re_lu_82 (LeakyReLU)   (None, 18)    

KeyboardInterrupt: 