# State of tictactoe game

This code defines a class `State` representing the state of a Tic-Tac-Toe game board.

1. **Initialization**: 
    - `__init__()` initializes the board dimensions, win streak length, the board itself (initialized as an empty numpy array), current player color, winning color, and a record of actions.
   
   
2. **String Representations**: 
    - `action2str()` converts a numerical action to its corresponding string representation (e.g., 0 to 'A1').
    - `str2action()` converts a string representation of an action to its numerical counterpart.
    - `record_string()` generates a string representation of the record of actions taken.


3. **Printable Representation**:
    - `__str__()` returns a string representing the current state of the board, with row and column labels, and marks for player positions.


4. **Game Mechanics**:
    - `play()` updates the board state with a given action, checks for win conditions, and switches player turns.
    - `terminal()` checks if the game has ended either by a win or by filling the board.
    - `terminal_reward()` returns the reward for the terminal state.
    - `legal_actions()` returns a list of legal actions available in the current state.


5. **Neural Network Features**:
    - `feature()` generates an input tensor representing the current state of the board for a neural network.
    - `action_feature()` generates an input tensor representing a single action for a neural network.

In [1]:
import numpy as np

class State:
    '''Board implementation of Tic-Tac-Toe'''
    def __init__(self, width=3, height=3, winstreak=3):
        self.width = width
        self.height = height
        self.winstreak = winstreak
        self.board = np.zeros((height, width)) # (y, x)
        self.color = 1
        self.win_color = 0
        self.record = []

    def action2str(self, a):
        return f"{chr(65 + a // self.width)}{a % self.width + 1}"

    def str2action(self, s):
        return (ord(s[0]) - 65) * self.width + int(s[1]) - 1

    def record_string(self):
        return ' '.join([self.action2str(a) for a in self.record])

    def __str__(self):
        s = '  ' + ' '.join(str(i + 1) for i in range(self.width)) + '\n'
        for i in range(self.height):
            s += chr(65 + i) + ' ' + ' '.join(['_' if cell == 0 else ('O' if cell == 1 else 'X') for cell in self.board[i]]) + '\n'
        s += 'record = ' + self.record_string()
        return s

    def play(self, action):
        if isinstance(action, str):
            for astr in action.split():
                self.play(self.str2action(astr))
            return self

        x, y = action % self.width, action // self.width
        self.board[y, x] = self.color

        for direction in [(0, 1), (1, 0), (1, 1), (1, -1)]:
            dx, dy = direction
            count = 0
            while 0 <= x < self.width and 0 <= y < self.height and self.board[y, x] == self.color:
                count += 1
                if count >= self.winstreak:
                    self.win_color = self.color
                    break
                x += dx
                y += dy
            x, y = action % self.width, action // self.width

            dx, dy = -direction[0], -direction[1]
            count = 0  # Adjusting for the initial stone counted twice
            while 0 <= x < self.width and 0 <= y < self.height and self.board[y, x] == self.color:
                count += 1
                if count >= self.winstreak:
                    self.win_color = self.color
                    break
                x += dx
                y += dy

        self.color = -self.color
        self.record.append(action)
        return self

    def terminal(self):
        return self.win_color != 0 or len(self.record) == self.width * self.height

    def terminal_reward(self):
        return self.win_color if self.color == 1 else -self.win_color

    def legal_actions(self):
        return [a for a in range(self.width * self.height) if self.board[a // self.width, a % self.width] == 0]
    
    def feature(self):
        # input tensor for neural net (state)
        return np.stack([self.board == self.color, self.board == -self.color]).astype(np.float32)

    def action_feature(self, action):
        # input tensor for neural net (action)
        a = np.zeros((1, self.height, self.width), dtype=np.float32)
        a[0, action // self.width, action % self.width] = 1
        return a

In [2]:
# examples

state = State().play('A1 B1 A2 B2')
print(state)
print(state.legal_actions())
print(state.terminal())
print()

state = State().play('A1 A2 B2 A3')
print(state)
print(state.legal_actions())
print(state.terminal())
print()

state = State().play('A1 B1 A2 B2 A3')
print(state)
print(state.legal_actions())
print(state.terminal())
print()

state = State().play('A1 A2 B2 A3 C3')
print(state)
print(state.legal_actions())
print(state.terminal())

  1 2 3
A O O _
B X X _
C _ _ _
record = A1 B1 A2 B2
[2, 5, 6, 7, 8]
False

  1 2 3
A O X X
B _ O _
C _ _ _
record = A1 A2 B2 A3
[3, 5, 6, 7, 8]
False

  1 2 3
A O O O
B X X _
C _ _ _
record = A1 B1 A2 B2 A3
[5, 6, 7, 8]
True

  1 2 3
A O X X
B _ O _
C _ _ O
record = A1 A2 B2 A3 C3
[3, 5, 6, 7]
True


In [3]:
state = State().play('A1 B1 A2 B2 A3')
print(state)
print(state.terminal())
print(state.legal_actions())

  1 2 3
A O O O
B X X _
C _ _ _
record = A1 B1 A2 B2 A3
True
[5, 6, 7, 8]


# Neural network architecture

This code defines a convolutional neural network architecture using PyTorch.

1. **Convolutional Layer** (`Conv` class):
   - `__init__()`: Initializes a 2D convolutional layer with the specified number of input and output channels (filters), kernel size, and optional batch normalization.
   - `forward()`: Performs the forward pass through the convolutional layer, applying convolution and batch normalization if specified.


2. **Residual Block** (`ResidualBlock` class):
   - `__init__()`: Initializes a residual block consisting of two convolutional layers with the same number of input and output channels, followed by ReLU activation.
   - `forward()`: Defines the forward pass of the residual block, adding the input to the output of the convolutional layer and applying ReLU activation.

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Conv(nn.Module):
    def __init__(self, filters0, filters1, kernel_size, bn=False):
        super().__init__()
        self.conv = nn.Conv2d(filters0, filters1, kernel_size, stride=1, padding=kernel_size//2, bias=False)
        self.bn = None
        if bn:
            self.bn = nn.BatchNorm2d(filters1)

    def forward(self, x):
        h = self.conv(x)
        if self.bn is not None:
            h = self.bn(h)
        return h

class ResidualBlock(nn.Module):
    def __init__(self, filters):
        super().__init__()
        self.conv = Conv(filters, filters, 3, True)

    def forward(self, x):
        return F.relu(x + (self.conv(x)))

This code defines a neural network architecture for a reinforcement learning agent using PyTorch.

1. **Representation Network** (`Representation` class):
   - This class converts observations into an inner abstract state.
   - `__init__()`: Initializes the network with convolutional layers and residual blocks.
   - `forward()`: Performs the forward pass through the network, applying ReLU activation after each convolutional layer.


2. **Prediction Network** (`Prediction` class):
   - This network predicts policy (action probabilities) and value (expected outcome) from the inner abstract state.
   - `__init__()`: Initializes the network with convolutional and fully connected layers.
   - `forward()`: Performs the forward pass, returning softmax probabilities for actions and a value estimate.


3. **Dynamics Network** (`Dynamics` class):
   - This network models abstract state transitions given the inner abstract state and an action.
   - `__init__()`: Initializes the network with convolutional layers and residual blocks.
   - `forward()`: Performs the forward pass, concatenating the inner abstract state and action before processing.


4. **Overall Network** (`Net` class):
   - Combines the representation, prediction, and dynamics networks into a single model.
   - `__init__()`: Initializes the entire network by creating instances of the representation, prediction, and dynamics networks.
   - `predict()`: Predicts policy and value from the original state and a sequence of actions.


Overall, this architecture is designed for a reinforcement learning agent to learn and make decisions in an environment, where the representation network encodes observations, the prediction network estimates policy and value, and the dynamics network models state transitions. The `predict()` method combines these components to provide predictions based on the current state and action sequences.

In [5]:
num_filters = 16
num_blocks = 4

class Representation(nn.Module):
    ''' Conversion from observation to inner abstract state '''
    def __init__(self, input_shape):
        super().__init__()
        self.input_shape = input_shape
        self.board_size = self.input_shape[1] * self.input_shape[2]

        self.layer0 = Conv(self.input_shape[0], num_filters, 3, bn=True)
        self.blocks = nn.ModuleList([ResidualBlock(num_filters) for _ in range(num_blocks)])

    def forward(self, x):
        h = F.relu(self.layer0(x))
        for block in self.blocks:
            h = block(h)
        return h

    def inference(self, x):
        self.eval()
        with torch.no_grad():
            rp = self(torch.from_numpy(x).unsqueeze(0))
        return rp.cpu().numpy()[0]

class Prediction(nn.Module):
    ''' Policy and value prediction from inner abstract state '''
    def __init__(self, action_shape):
        super().__init__()
        self.board_size = np.prod(action_shape[1:])
        self.action_size = action_shape[0] * self.board_size

        self.conv_p1 = Conv(num_filters, 4, 1, bn=True)
        self.conv_p2 = Conv(4, 1, 1)

        self.conv_v = Conv(num_filters, 4, 1, bn=True)
        self.fc_v = nn.Linear(self.board_size * 4, 1, bias=False)

    def forward(self, rp):
        h_p = F.relu(self.conv_p1(rp))
        h_p = self.conv_p2(h_p).view(-1, self.action_size)

        h_v = F.relu(self.conv_v(rp))
        h_v = self.fc_v(h_v.view(-1, self.board_size * 4))

        # range of value is -1 ~ 1
        return F.softmax(h_p, dim=-1), torch.tanh(h_v)

    def inference(self, rp):
        self.eval()
        with torch.no_grad():
            p, v = self(torch.from_numpy(rp).unsqueeze(0))
        return p.cpu().numpy()[0], v.cpu().numpy()[0][0]

class Dynamics(nn.Module):
    '''Abstract state transition'''
    def __init__(self, rp_shape, act_shape):
        super().__init__()
        self.rp_shape = rp_shape
        self.layer0 = Conv(rp_shape[0] + act_shape[0], num_filters, 3, bn=True)
        self.blocks = nn.ModuleList([ResidualBlock(num_filters) for _ in range(num_blocks)])

    def forward(self, rp, a):
        h = torch.cat([rp, a], dim=1)
        h = self.layer0(h)
        for block in self.blocks:
            h = block(h)
        return h

    def inference(self, rp, a):
        self.eval()
        with torch.no_grad():
            rp = self(torch.from_numpy(rp).unsqueeze(0), torch.from_numpy(a).unsqueeze(0))
        return rp.cpu().numpy()[0]

class Net(nn.Module):
    '''Whole net'''
    def __init__(self):
        super().__init__()
        state = State()
        input_shape = state.feature().shape
        action_shape = state.action_feature(0).shape
        rp_shape = (num_filters, *input_shape[1:])

        self.representation = Representation(input_shape)
        self.prediction = Prediction(action_shape)
        self.dynamics = Dynamics(rp_shape, action_shape)

    def predict(self, state0, path):
        '''Predict p and v from original state and path'''
        outputs = []
        x = state0.feature()
        rp = self.representation.inference(x)
        outputs.append(self.prediction.inference(rp))
        for action in path:
            a = state0.action_feature(action)
            rp = self.dynamics.inference(rp, a)
            outputs.append(self.prediction.inference(rp))
        return outputs

In [6]:
def show_net(net, state):
    '''Display policy (p) and value (v)'''
    print(state)
    p, v = net.predict(state, [])[-1]
    print('p = ')
    print((p * 1000).astype(int).reshape((-1, *net.representation.input_shape[1:3])))
    print('v = ', v)
    print()

#  Outputs before training
show_net(Net(), State())

  1 2 3
A _ _ _
B _ _ _
C _ _ _
record = 
p = 
[[[111 111 111]
  [111 111 111]
  [111 111 111]]]
v =  0.0



# Monte Carlo Tree Search (MCTS)

This code defines a class `Node` representing the search result of one abstract (or root) state in a tree search algorithm, such as Monte Carlo Tree Search (MCTS). Here's a breakdown of its components:


1. **Initialization**: 
    - `__init__()`: Initializes a node with policy (`p`) and value (`v`) estimates for the state. It also initializes arrays to store visit counts (`n`) and cumulative action values (`q_sum`). Additionally, it maintains overall visit count (`n_all`) and cumulative value (`q_sum_all`) for the node.


2. **Update Method**: 
    - `update()`: Updates the node statistics after selecting and evaluating an action. It takes the selected action index and the new value estimate (`q_new`) as inputs. It increments the visit count and adds the new value to the cumulative sum for the selected action. Then, it updates the overall visit count and cumulative value for the node.

Overall, this class is used to maintain statistics for actions taken from a specific state during the search process, enabling informed action selection based on exploration-exploitation trade-offs.

In [7]:
# Implementation of Monte Carlo Tree Search

class Node:
    '''Search result of one abstract (or root) state'''
    def __init__(self, p, v):
        self.p, self.v = p, v
        self.n, self.q_sum = np.zeros_like(p), np.zeros_like(p)
        self.n_all, self.q_sum_all = 1, v / 2 # prior

    def update(self, action, q_new):
        # Update
        self.n[action] += 1
        self.q_sum[action] += q_new

        # Update overall stats
        self.n_all += 1
        self.q_sum_all += q_new

This code implements a Monte Carlo Tree Search (MCTS) algorithm along with a supporting tree structure for reinforcement learning tasks. Here's a detailed breakdown:

1. **Node Class**
    - Purpose: Represents the search result of one abstract (or root) state in the tree search.
    - `__init__()`: Initializes the node with policy (`p`) and value (`v`) estimates. Also initializes arrays to store visit counts (`n`) and cumulative action values (`q_sum`). Additionally, maintains overall visit count (`n_all`) and cumulative value (`q_sum_all`) for the node.
    - `update()`: Updates the node statistics after selecting and evaluating an action.


2. **Tree Class (Monte Carlo Tree Search)**
    - Purpose: Implements the MCTS algorithm.
    - `__init__()`: Initializes the MCTS tree with a neural network (`net`) and an empty dictionary to store nodes.
    - `search()`: Conducts a search from the current state recursively. Updates nodes and selects actions based on exploration-exploitation trade-offs (PUCB formula).
    - `think()`: Performs a series of MCTS simulations to determine the best action given the current state. Optionally, it can display the search progress.
    - `pv()`: Returns the principal variation, i.e., the action sequence considered as the best.


3. **General Comments**
    - The MCTS algorithm iteratively builds a search tree by simulating trajectories from the current state and updating node statistics accordingly.
    - It balances exploration (trying new actions) and exploitation (favoring actions with high estimated value) to find the most promising actions.
    - The code is well-documented, making it clear and understandable. It's a crucial aspect, especially for complex algorithms like MCTS.

Overall, this code provides a robust implementation of MCTS, which can be integrated with various reinforcement learning tasks, especially in game-playing scenarios.

In [8]:
import time
import copy

class Tree:
    '''Monte Carlo Tree Search (MCTS)'''
    def __init__(self, net):
        self.net = net
        self.nodes = {}

    def search(self, state, path, rp, depth):
        # Return predicted value from new state
        key = state.record_string()
        if len(path) > 0:
            key += '|' + ' '.join(map(state.action2str, path))
        if key not in self.nodes:
            p, v = self.net.prediction.inference(rp)
            self.nodes[key] = Node(p, v)
            return v

        # State transition by an action selected from bandit
        node = self.nodes[key]
        p = node.p
        mask = np.zeros_like(p)
        if depth == 0:
            # Add noise to policy on the root node
            p = 0.75 * p + 0.25 * np.random.dirichlet([0.15] * len(p))
            # On the root node, we choose action only from legal actions
            mask[state.legal_actions()] = 1
            p *= mask
            p /= p.sum() + 1e-16

        n, q_sum = 1 + node.n, node.q_sum_all / node.n_all + node.q_sum
        ucb = q_sum / n + 2.0 * np.sqrt(node.n_all) * p / n + mask * 4 # PUCB formula
        best_action = np.argmax(ucb)

        # Search next state by recursively calling this function
        rp_next = self.net.dynamics.inference(rp, state.action_feature(best_action))
        path.append(best_action)
        q_new = -self.search(state, path, rp_next, depth + 1) # With the assumption of changing player by turn
        node.update(best_action, q_new)

        return q_new

    def think(self, state, num_simulations, temperature = 0, show=False):
        # End point of MCTS
        if show:
            print(state)
        start, prev_time = time.time(), 0
        for _ in range(num_simulations):
            self.search(state, [], self.net.representation.inference(state.feature()), depth=0)

            # Display search result on every second
            if show:
                tmp_time = time.time() - start
                if int(tmp_time) > int(prev_time):
                    prev_time = tmp_time
                    root, pv = self.nodes[state.record_string()], self.pv(state)
                    print('%.2f sec. best %s. q = %.4f. n = %d / %d. pv = %s'
                          % (tmp_time, state.action2str(pv[0]), root.q_sum[pv[0]] / root.n[pv[0]],
                             root.n[pv[0]], root.n_all, ' '.join([state.action2str(a) for a in pv])))

        #  Return probability distribution weighted by the number of simulations
        root = self.nodes[state.record_string()]
        n = root.n + 1
        n = (n / np.max(n)) ** (1 / (temperature + 1e-8))
        return n / n.sum()

    def pv(self, state):
        # Return principal variation (action sequence which is considered as the best)
        s, pv_seq = copy.deepcopy(state), []
        while True:
            key = s.record_string()
            if key not in self.nodes or self.nodes[key].n.sum() == 0:
                break
            best_action = sorted([(a, self.nodes[key].n[a]) for a in s.legal_actions()], key=lambda x: -x[1])[0][0]
            pv_seq.append(best_action)
            s.play(best_action)
        return pv_seq

In [9]:
# Search with initialized net

tree = Tree(Net())
tree.think(State(), 100, show=True)

tree = Tree(Net())
tree.think(State().play('A1 B1 A2 B2'), 200, show=True)

  1 2 3
A _ _ _
B _ _ _
C _ _ _
record = 
  1 2 3
A O O _
B X X _
C _ _ _
record = A1 B1 A2 B2


array([0., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)

# Training of neural network

This code defines a training function for a neural network used in reinforcement learning. Here's a breakdown of its components:

1. **`gen_target()` Function**:
   - **Purpose**: Generates inputs and targets for training by extracting information from episodes.
   - **Inputs**: `ep` (an episode), `k` (number of steps to consider).
   - **Output**: Returns inputs (`x`), action targets (`ax`), policy targets (`p_target`), and value targets (`v_target`) for training.


2. **`train()` Function**:
   - **Purpose**: Trains the neural network.
   - **Inputs**: `episodes` (list of episodes), `net` (neural network model), `opt` (optimizer).
   - **Steps**:
     - Samples a batch of training data from episodes using `gen_target()`.
     - Computes losses for policy and value predictions for each step in the batch.
     - Updates the neural network parameters based on the total loss.
     - Prints the average policy and value losses over the training data.

Overall, this function facilitates training the neural network model for reinforcement learning tasks, optimizing it to predict policies and values that maximize rewards.

In [10]:
import torch.optim as optim

batch_size = 32
num_steps = 100

def gen_target(ep, k):
    '''Generate inputs and targets for training'''
    # path, reward, observation, action, policy
    turn_idx = np.random.randint(len(ep[0]))
    ps, vs, ax = [], [], []
    for t in range(turn_idx, turn_idx + k + 1):
        if t < len(ep[0]):
            p = ep[4][t]
            a = ep[3][t]
        else: # state after finishing game
            # p is 0 (loss is 0)
            p = np.zeros_like(ep[4][-1])
            # random action selection
            a = np.zeros(np.prod(ep[3][-1].shape), dtype=np.float32)
            a[np.random.randint(len(a))] = 1
            a = a.reshape(ep[3][-1].shape)
        vs.append([ep[1] if t % 2 == 0 else -ep[1]])
        ps.append(p)
        ax.append(a)
        
    return ep[2][turn_idx], ax, ps, vs

def train(episodes, net, opt):
    '''Train neural net'''
    p_loss_sum, v_loss_sum = 0, 0
    net.train()
    k = 4
    for _ in range(num_steps):
        x, ax, p_target, v_target = zip(*[gen_target(episodes[np.random.randint(len(episodes))], k) for j in range(batch_size)])
        x = torch.from_numpy(np.array(x))
        ax = torch.from_numpy(np.array(ax))
        p_target = torch.from_numpy(np.array(p_target))
        v_target = torch.FloatTensor(np.array(v_target))

        # Change the order of axis as [time step, batch, ...]
        ax = torch.transpose(ax, 0, 1)
        p_target = torch.transpose(p_target, 0, 1)
        v_target = torch.transpose(v_target, 0, 1)

        # Compute losses for k (+ current) steps
        p_loss, v_loss = 0, 0
        for t in range(k + 1):
            rp = net.representation(x) if t == 0 else net.dynamics(rp, ax[t - 1])
            p, v = net.prediction(rp)
            p_loss += F.kl_div(torch.log(p), p_target[t], reduction='sum')
            v_loss += torch.sum(((v_target[t] - v) ** 2) / 2)

        p_loss_sum += p_loss.item()
        v_loss_sum += v_loss.item()

        optimizer.zero_grad()
        (p_loss + v_loss).backward()
        optimizer.step()

    num_train_datum = num_steps * batch_size
    print('p_loss %f v_loss %f' % (p_loss_sum / num_train_datum, v_loss_sum / num_train_datum))
    return net

This function simulates a specified number of games (default: 100) between a neural network-based player and a random player. Here's how it works:

- **Input**:
  - `net`: The neural network model used for making decisions.
  - `n`: The number of games to simulate.


- **Steps**:
  1. Initialize an empty dictionary `results` to store the outcome statistics of the games.
  2. Iterate over each game, determining the starting player based on the game index.
  3. Create a new game state.
  4. Play the game until it reaches a terminal state (win, lose, or draw).
  5. At each turn, if it's the neural network's turn, it predicts the policy for the current state and chooses the action with the highest predicted probability. Otherwise, the random player selects a random legal action.
  6. Update the game state with the chosen action and switch the turn to the other player.
  7. After the game ends, determine the final reward based on the terminal state and update the `results` dictionary with the outcome.


- **Output**:
  - `results`: A dictionary containing the frequency of different game outcomes (rewards) observed over the simulated games. Keys represent the final rewards, and values represent the number of occurrences.

In [11]:
# against random agents

def vs_random(net, n=100):
    results = {}
    for i in range(n):
        first_turn = i % 2 == 0
        turn = first_turn
        state = State()
        while not state.terminal():
            if turn:
                p, _ = net.predict(state, [])[-1]
                action = sorted([(a, p[a]) for a in state.legal_actions()], key=lambda x:-x[1])[0][0]
            else:
                action = np.random.choice(state.legal_actions())
            state.play(action)
            turn = not turn
        r = state.terminal_reward() if turn else -state.terminal_reward()
        results[r] = results.get(r, 0) + 1
    return results

# Training loop (using CPU)

This script trains a neural network model using reinforcement learning by generating and simulating episodes of gameplay. Here's a breakdown of its components:

- **Initialization**:
  - `num_games`, `num_games_one_epoch`, and `num_simulations` define the parameters for training and simulation.


- **Neural Network Initialization**:
  - `net` is initialized as a neural network model.
  - `optimizer` is initialized using stochastic gradient descent (SGD) with specified learning rate, weight decay, and momentum.


- **Simulating Games and Training**:
  - The script iterates over a specified number of games (`num_games`).
  - Within each game iteration:
    - An episode is generated by simulating gameplay using a combination of neural network predictions and random actions.
    - The episode's outcome and details are recorded.
    - If a certain number of games have been played (`num_games_one_epoch`), the recorded episodes are used to train the neural network using the `train()` function.
    - The results of training and the performance of the network against random play are printed periodically.


- **Output**:
  - The script prints various details during execution, such as the results of battles against random play, the distribution of episode outcomes, and the progress of training.
  - After training completes, a "finished" message is printed.


This script enables the iterative training of a neural network model for reinforcement learning tasks, improving its performance through simulated gameplay experiences.

In [12]:
# Main algorithm of MuZero

num_games = 100
num_games_one_epoch = 10
num_simulations = 40

"""
for 3x3 try:
num_games = 500
num_games_one_epoch = 20
num_simulations = 40
"""

net = Net()
optimizer = optim.SGD(net.parameters(), lr=3e-4, weight_decay=3e-5, momentum=0.8)

# Display battle results as {-1: lose 0: draw 1: win} (for episode generated for training, 1 means that the first player won)
vs_random_sum = vs_random(net)
print('vs_random = ', sorted(vs_random_sum.items()))

episodes = []
result_distribution = {1: 0, 0: 0, -1: 0}

for g in range(num_games):
    # Generate one episode
    record, p_targets, features, action_features = [], [], [], []
    state = State()
    # temperature using to make policy targets from search results
    temperature = 0.7

    while not state.terminal():
        tree = Tree(net)
        p_target = tree.think(state, num_simulations, temperature)
        p_targets.append(p_target)
        features.append(state.feature())

        # Select action with generated distribution, and then make a transition by that action
        action = np.random.choice(np.arange(len(p_target)), p=p_target)
        record.append(action)
        action_features.append(state.action_feature(action))
        state.play(action)
        temperature *= 0.8

    # reward seen from the first turn player
    reward = state.terminal_reward() * (1 if len(record) % 2 == 0 else -1)
    result_distribution[reward] += 1
    episodes.append((record, reward, features, action_features, p_targets))

    if g % num_games_one_epoch == 0:
        print('game ', end='')
    print(g, ' ', end='')

    # Training of neural net
    if (g + 1) % num_games_one_epoch == 0:
        # Show the result distributiuon of generated episodes
        print('generated = ', sorted(result_distribution.items()))
        net = train(episodes, net, optimizer)
        vs_random_once = vs_random(net)
        print('vs_random = ', sorted(vs_random_once.items()), end='')
        for r, n in vs_random_once.items():
            vs_random_sum[r] += n
        print(' sum = ', sorted(vs_random_sum.items()))
print('finished')

vs_random =  [(-1, 31), (0, 41), (1, 28)]
game 0  1  2  3  4  5  6  7  8  9  generated =  [(-1, 1), (0, 6), (1, 3)]
p_loss 1.978555 v_loss 0.378698
vs_random =  [(-1, 26), (0, 41), (1, 33)] sum =  [(-1, 57), (0, 82), (1, 61)]
game 10  11  12  13  14  15  16  17  18  19  generated =  [(-1, 2), (0, 10), (1, 8)]
p_loss 1.461495 v_loss 0.435645
vs_random =  [(-1, 29), (0, 40), (1, 31)] sum =  [(-1, 86), (0, 122), (1, 92)]
game 20  21  22  23  24  25  26  27  28  29  generated =  [(-1, 6), (0, 13), (1, 11)]
p_loss 0.767429 v_loss 0.416586
vs_random =  [(-1, 39), (0, 38), (1, 23)] sum =  [(-1, 125), (0, 160), (1, 115)]
game 30  31  32  33  34  35  36  37  38  39  generated =  [(-1, 8), (0, 18), (1, 14)]
p_loss 0.746043 v_loss 0.380497
vs_random =  [(-1, 25), (0, 37), (1, 38)] sum =  [(-1, 150), (0, 197), (1, 153)]
game 40  41  42  43  44  45  46  47  48  49  generated =  [(-1, 8), (0, 20), (1, 22)]
p_loss 0.676101 v_loss 0.419888
vs_random =  [(-1, 26), (0, 34), (1, 40)] sum =  [(-1, 176), (

In [13]:
# player first

state = State()
tree = Tree(net)
print(state)

while not state.terminal():
    print('---------------')
    
    user_action = input('player move = ')
    state.play(user_action)
    
    tree.think(state, 300)
    best_choise = tree.pv(state)
    bot_action = state.action2str(best_choise[0])
    print('ai move =', bot_action)
    state.play(bot_action)
    
    print(state)

  1 2 3
A _ _ _
B _ _ _
C _ _ _
record = 
---------------
player move = A1
ai move = C1
  1 2 3
A O _ _
B _ _ _
C X _ _
record = A1 C1
---------------
player move = B2
ai move = A2
  1 2 3
A O X _
B _ O _
C X _ _
record = A1 C1 B2 A2
---------------
player move = C3
ai move = B3
  1 2 3
A O X _
B _ O X
C X _ O
record = A1 C1 B2 A2 C3 B3


In [14]:
# ai first

state = State()
tree = Tree(net)

while not state.terminal():
    print('---------------')
    
    tree.think(state, 300)
    best_choise = tree.pv(state)
    bot_action = state.action2str(best_choise[0])
    state.play(bot_action)
    
    print(state)
    
    print('ai move = ', bot_action)
    user_action = input('player move = ')
    state.play(user_action)
    
print(state)

---------------
  1 2 3
A _ _ _
B O _ _
C _ _ _
record = B1
ai move =  B1
player move = B2
---------------
  1 2 3
A O _ _
B O X _
C _ _ _
record = B1 B2 A1
ai move =  A1
player move = C3
---------------
  1 2 3
A O O _
B O X _
C _ _ X
record = B1 B2 A1 C3 A2
ai move =  A2
player move = B2
---------------
  1 2 3
A O O _
B O X _
C O _ X
record = B1 B2 A1 C3 A2 B2 C1
ai move =  C1
player move = 
  1 2 3
A O O _
B O X _
C O _ X
record = B1 B2 A1 C3 A2 B2 C1


In [15]:
# cuda check
print(torch.cuda.is_available())
print(torch.cuda.device_count())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

True
1
cuda
