### 1.4 QNN (Q-Wertbasiertes Neuronales Netzwerk)

Für die Erweiterbarkeit unseres Codes ist uns aufgefallen, dass der Q-Algorithmus seine Grenzen bei Spielen mit einem beinahe unendlichen Zustandsraum aufweist. Spiele wie Schach oder Go haben einen derart riesigen Zustandsraum aller möglichen Züge, der ohne weitere Hilfe nicht einfach durch ausprobieren komplett erkundet werden kann. Um dem Prinzip des bestärkenden Lernens nahe zu kommen, müssen andere Methoden gefunden werden, um zukünftige Züge oder sogar Strategien bei unendlich wirkenden Zustandsräumen hervorzusagen. Hierfür soll ein neuronales Netzwerk mit mehreren Schichten zum Einsatz kommen.

Der Gedanke dahinter ist, dass man nicht mehr versucht alle Zustände zu erkunden und perfekt vorherzusagen, sondern eine Struktur oder sogar Strategie in gewissen Zügen zu erkennen. Der Fokus des neuronalen Netzes soll damit sein, Strukturen in zeitlich aufeinanderfolgender Züge zu erkennen und bei unbekannten Zuständen einen Schätzwert auf Basis bisher bekannter Strategien ausgeben.

Ein neuronales Netzwerk besteht zumeist aus Eingabevektoren, verschiedenste versteckte Schichten sowie Ausgabevektoren. Genauso wie die Q-Funktion, sollen dem QNN gewisse Parameter als Eingabevektoren mitgegeben werden. Dazu gehört der aktuelle Zustand des Bretts, die ausgewählte Aktion auf Basis vorheriger Werte aus dem QNN Modell und ggf. die Belohnung. Als Ausgabevektor soll, genauso wie bei der Q-Funktion, ein Q-Wert sein, der die maximale Belohnung des aktuellen Zustand-Aktion-Paares beschreibt.

Die Lerndaten erstellt der QNN Agent selbst, durch das explorative Erkunden mittels dem Explorationsfaktor namens "Theta". Anhand der explorierten Daten und erlangten Belohnungen wird das QNN selbst die besten Züge herausfinden.

In [26]:
from abc import abstractmethod
from pathlib import Path
from build.player.QPlayer import QPlayer
from build.player.qlearner.QLearner import QLearner
from build.Board import Board

import numpy as np
import keras.models as Km
import keras.layers as kl

import os
import random

In [44]:
class QNNLearner(QLearner):
    """
    QLearner specification for neural network learning
    """

    def __init__(self, model = None, learn_rate=0.1, discount_factor=0.8, batch_size=10):
        """
        :param model: used qnn model structure
        :param learn_rate: learning rate of this q learner
        :param discount_factor: discount factor of this q learner
        """
        self.possible_actions = {0:[1, 1], 1:[2, 1], 2:[3, 1], 3:[1, 2], 4:[2, 2], 5:[3, 2], 6:[1, 3], 7:[2, 3], 8:[3, 3]}
        self.learn_rate = learn_rate
        self.model = model
        self.discount_factor = discount_factor
        self.memory = []
        self.count_memory = 0
        self.batch_size = batch_size

    def update(self, prev_state, state, prev_move, reward):
        """
        Update q players knowledge by learning offline or online
        :param prev_state: previous known state
        :param state: new state
        :param prev_move: previous made move
        :param reward: previous reward
        """
        self.model.train_model(prev_state,state,prev_move,reward,self.discount_factor)

    def select_move(self, state, theta=0.1):
        """
        Select the best move or, if exploring, a random move
        :param state: current state
        :param theta: temperature value (optional)
        :return: chosen action
        """
        p = random.uniform(0, 1)
        if p > theta and state is not None:
            action = np.argmax(self.model.predict_model(state))
            #action = self.possible_actions[idx]
        else:
            action = np.random.randint(0, len(self.possible_actions))

        return action # return choosen move


In [82]:
class Model:
    """
    Model class for all 2 player based games with neural network training
    """

    def __init__(self, tag, observation_space=9, action_space=9, discount_factor=0.8):
        """
        :param tag: used tag for neural network model (e.g. 1 for first player and -1 for second)
        """
        self.tag = tag
        self.epsilon = 0.1
        self.alpha = 0.5
        self.gamma = 1
        self.model = self.load_model()
        self.observation_space = observation_space
        self.action_space = action_space

    def load_model(self):
        """
        Loads previously saved model
        :return: loaded model
        """
        if self.tag == 1:
            tag = '_first'
        else:
            tag = '_second'

        s = 'model_values' + tag + '_performant.h5'
        model_file = Path(s)

        if model_file.is_file():
            print('load model')
            model = Km.load_model(s)
            print('load model: ' + s)
        else:
            model = self.create_model()
        return model

    def predict_model(self, state):
        state = self.one_hot_encode_state(state)
        target = self.model.predict(state)
        return target

    @abstractmethod
    def create_model(self):
        """
        Create new model with appropriate number of layers and network structure
        :return: created model
        """
        pass

    @abstractmethod
    def one_hot_encode_state(self, state):
        """
        Creates a tensor (2 dim array) based on a state and a move as input vector for nn
        :param state: current state
        :param move: current move
        :return: created tensor
        """
        pass

    def train_model(self, prev_state, new_state, prev_move, reward, discount_factor=0.8):
        """
        Train the model based on an input tensor
        :param prev_state: previous state
        :param prev_move: previous move
        :param target: calculated q value or reward (target vector)
        :param epochs: number of epochs
        """
        prev_state = self.one_hot_encode_state(prev_state)
        new_state = self.one_hot_encode_state(new_state)

        target = reward + discount_factor * np.max( self.model.predict(new_state))
        target_vector = self.model.predict(prev_state)[0]
        target_vector[prev_move] = target
        self.model.fit(prev_state, target_vector.reshape(-1, self.action_space), epochs=1, verbose=0)

       # loss = self.model.evaluate(prev_state, target_vector.reshape(-1, self.action_space), batch_size=256, verbose=0)[0]
        #print('planning number: loss', loss)

    def save_model(self):
        """
        save model as h5 file
        """
        if self.tag == 1:
            tag = '_first'
        else:
            tag = '_second'
        s = 'model_values' + tag + '.h5'

        try:
            os.remove(s)
        except:
            pass

        self.model.save(s)

In [83]:
class TicTacToeModel(Model):
    """
    Special model for tic tac toe games.

    Consists of 2x9 input vector, dense network of 9 layers and a 9 sized target vector.
    Input vector consists an array with length 9 for the chosen move and an array for the state.
    Target vector consists of 9 sized array for 9 possible rewards (one for each action).
    """

    def __init__(self, tag, observation_space=9, action_space=9, discount_factor=0.8):

        self.observation_space = observation_space
        self.action_space = action_space
        super().__init__(tag, self.observation_space, self.action_space, discount_factor)

        pass

    def create_model(self):
        """
        Creates keras model
        :return: keras model
        """
        print('new model')

        #model = km.load_model("qnn_model")

        model = Km.Sequential()
        model.add(kl.InputLayer(batch_input_shape=(1, self.observation_space)))
        model.add(kl.Dense(20, activation='relu'))
        model.add(kl.Dense(self.action_space, activation='linear'))
        model.compile(loss='mse', optimizer='adam', metrics=['mae'])

        # adam = ko.Adam(lr=0.001)

        # model.compile(optimizer='Adam', loss='mean_absolute_error', metrics=['accuracy'])
        #model.save("qnn_model")

        model.summary()

        return model

    def one_hot_encode_state(self, state):
        """
        One hot encoding for the state.
        Each field input of 3x3 matrix will be displayed with 0 (blank), 1 (player 1), -1 (player 2)
        :param state: state to encode
        :return: encoded state
        """
        state = state.flatten()
        for i in range(len(state)):
            if state[i] is None:
                state[i] = 0
            if state[i] == 'x':
                state[i] = 1
            if state[i] == 'o':
                state[i] = -1
        state = np.asarray(state).astype('int')
        state = state.reshape((1, 9))

        return state

In [84]:
from build.Game import Game
from tqdm.notebook import trange

In [None]:
for j in trange(100):

    model1 = TicTacToeModel(1)
    model2 = TicTacToeModel(-1)
    player1 = QPlayer('x', QNNLearner(model=model1))
    player2 = QPlayer('o', QNNLearner(model=model2))
    players = [player1, player2]

    game = Game(players)

    for i in range(10):
        game.run()

  0%|          | 0/100 [00:00<?, ?it/s]

new model
Model: "sequential_42"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_72 (Dense)             (1, 20)                   200       
_________________________________________________________________
dense_73 (Dense)             (1, 9)                    189       
Total params: 389
Trainable params: 389
Non-trainable params: 0
_________________________________________________________________
new model
Model: "sequential_43"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_74 (Dense)             (1, 20)                   200       
_________________________________________________________________
dense_75 (Dense)             (1, 9)                    189       
Total params: 389
Trainable params: 389
Non-trainable params: 0
_________________________________________________________________
new model
Mode

In [80]:
model = TicTacToeModel(1)
state = np.array([[None,"x","x"],[None,"o","o"],[None,None,None]])
model.predict_model(state)

new model
Model: "sequential_37"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_62 (Dense)             (1, 20)                   200       
_________________________________________________________________
dense_63 (Dense)             (1, 9)                    189       
Total params: 389
Trainable params: 389
Non-trainable params: 0
_________________________________________________________________


array([[-0.75077385,  0.39786673, -0.15094791,  0.4045996 , -0.13271517,
         0.5515286 ,  0.11903673, -0.11275794,  0.23266867]],
      dtype=float32)