# Implementation Ass_5

### Implement tree search 

Alternatives to monte carlo_

There are several alternatives to Monte Carlo for tree search implementations to learn to win at Tic-tac-toe:

- Minimax algorithm: This is a classic algorithm for turn-based games. It works by recursively exploring all possible moves and their outcomes, and then selecting the move that maximizes the minimum outcome.

- Alpha-beta pruning: This is a variant of the minimax algorithm that eliminates branches of the search tree that are guaranteed to be worse than previously explored branches. This can significantly reduce the search space and speed up the algorithm.

- Monte Carlo Tree Search (MCTS) with a policy network: Instead of using random rollouts, MCTS can be augmented with a neural network that predicts the value of each move based on the current board state. This allows the algorithm to focus its search on promising moves, improving its efficiency.

- Reinforcement learning: This approach involves training a neural network to predict the optimal move for any given board state, using a combination of supervised learning and self-play. The network is then used to guide the search during gameplay.

Each of these approaches has its own strengths and weaknesses, and the best choice depends on the specific requirements of the task at hand. For example, if computation time is limited, then MCTS with a policy network might be the best choice, while if accurate evaluation of all possible moves is critical, then minimax with alpha-beta pruning might be the way to go.

Kai: tycker vi kör på MCTS (eventuellt med tillägget av en neural network policy) då den kräver mindre beräkningtid/kraft, skalar bättre och är en lite coolare implementation som används för modeller av shack eller go 🤩 MCTS har också förmågan att utforska nya strategier som kanske inte är uppenbara med traditionella sökmetoder som minimax med alfa-beta-sortering. ->På bekostnad av att den inte garanterar bästa valet. Just för 3x3 brädet så hade vi likväl kunnat garantera bästa valet.   

In [1]:
# imports
import numpy as np 

## Classes Node & MCTS

In [2]:
"""
Overall we need to define two classes: Node and MCTS. 
The Node class represents a node in the MCTS tree, and contains 
the state, parent node, child nodes, win count, and visit count. 

The MCTS class represents the MCTS algorithm itself, and contains the exploration constant, 
maximum number of iterations, and the methods for running the search and simulating games.
"""

class Node:
    def __init__(self, state, player):
        self.state = state
        self.player = player
        self.parent = None
        self.children = []
        self.wins = 0 # use for V = average reward? 
        self.visits = 0 # cannot divide by zero --> adjust the UCB expression with "1+visists" --> evalaute

    def select_child(self, exploration_constant = np.sqrt(2)):
        # experiment with different exploration_constant should be 
        # wiki: c is the exploration parameter—theoretically equal to √2; in practice usually chosen empirically
        # look over the first term  (V_i is the average reward/value of all nodes beneath this node)
        ucb_values = [child.wins/(1+child.visits) + exploration_constant * np.sqrt(np.log(self.visits) / (1+child.visits)) for child in self.children]
        return self.children[np.argmax(ucb_values)]

    def expand(self):
        possible_moves = [(i, j) for i in range(3) for j in range(3) if self.state[i][j] == 0]
        #print("possible moves are: ")
        #print(possible_moves)
        new_state = np.copy(self.state)
        move = possible_moves[np.random.randint(len(possible_moves))] # currently chooses the new state at random
        new_state[move] = self.player # sets currect player on new state
        child_node = Node(new_state, -self.player)
        child_node.parent = self
        self.children.append(child_node)
        return child_node

    """Implement better policy for opponent.
    For example, use a Minimax algorithm to choose the opponent's moves.
    If first move (middle), else if, the connecting positions"""

    def update(self, result):
        self.visits += 1
        if result == self.player:
            self.wins += 1       

class MCTS:
    def __init__(self, exploration_constant= np.sqrt(2), max_iterations=1000):
        self.exploration_constant = exploration_constant
        self.max_iterations = max_iterations

    def search(self, initial_state):
        root_node = Node(initial_state, 1)
        for i in range(self.max_iterations):
            node = root_node
            #assumes that the children all will be explored in every expansion?  
            while node.children: # stops when the current node has no more children, ie. in leaf node
                node = node.select_child(self.exploration_constant) # select child with highest UCB score
            child_node = node.expand() # expand leaf with new child
            result = self.run_simulations(child_node) # roll out form new child
            while node: #stops when at root node
                node.update(result) # back prop
                node = node.parent
        best_child = root_node.children[np.argmax([child.visits for child in root_node.children])]
        return self.select_action(best_child)

    def run_simulations(self, node):
        state = node.state
        player = node.player
        
        print("New simulation with: \n" +str(state)) #just test
        c =0
        while True:
            possible_moves = [(i, j) for i in range(3) for j in range(3) if state[i][j] == 0] # detected bug, it misses the certain values before in row?
            c+=1
            if not possible_moves: # if empty
                print("Possible moves is empty!!!")
                print(c)
                return 0 # no update, only on terminal states
            print("Possible moves NOT empty, but:" + str(possible_moves))
            
            move = possible_moves[np.random.randint(len(possible_moves))] # select random possible move --> best policy
            state[move] = player
            if self.check_winner(state, player):
                return player
            player = -player

    def select_action(self, node):
        for child in node.children:
            #Loop below is done to promote exploration and avoid always choosing the same child node
            # checks ofr 
            if child.visits == node.visits:
                #action is represented as the difference between the state 
                #of the best child node and the state of the current node.
                return child.state - node.state 
        # resturn the action that selects the most visited child
        return node.children[np.argmax([child.visits for child in node.children])].state - node.state

    def check_winner(self, state, current_player):
        for i in range(3): # check every row and colum for "straight wins"
            if state[i][0] == state[i][1] == state[i][2] == current_player:
                return True
            if state[0][i] == state[1][i] == state[2][i] == current_player:
                return True
        if state[0][0] == state[1][1] == state[2][2] == current_player: # "diagonal wins"
            return True
        if state[0][2] == state[1][1] == state[2][0] == current_player:
            return True
        return False

In [3]:
player = -1
state1 = np.zeros((3, 3))

possible_moves = [(i, j) for i in range(3) for j in range(3) if state1[i][j] == 0]
print(possible_moves)
move = possible_moves[np.random.randint(len(possible_moves))]
print(move)
state1[move] = player
print(state1)

[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
(0, 0)
[[-1.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


## Run an example

In [4]:
# create a new MCTS instance
mcts = MCTS(exploration_constant=1.0, max_iterations=1000)

# initialize the game state
game_state = np.zeros((3, 3))

# play the game
while True:
    print("Current borad: ")
    print(game_state)
    
    # human player's turn
    print("###human player's turn")
    row = int(input("Enter row: "))
    col = int(input("Enter col: "))
    game_state[row][col] = -1
    print(game_state)
    if mcts.check_winner(game_state, -1):
        print("Human player wins!")
        break
    
    # computer player's turn
    print("###Computer player's turn")
    action = mcts.search(game_state)
    game_state += action
    print(game_state)
    if mcts.check_winner(game_state, 1): # change nanme to termination control
        print("Computer player wins!")
        break
    if not np.any(game_state == 0):
        print("It's a tie!")
        break


Current borad: 
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
###human player's turn
[[ 0.  0.  0.]
 [ 0. -1.  0.]
 [ 0.  0.  0.]]
###Computer player's turn
New simulation with: 
[[ 0.  0.  0.]
 [ 0. -1.  0.]
 [ 0.  1.  0.]]
Possible moves NOT empty, but:[(0, 0), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 2)]
Possible moves NOT empty, but:[(0, 0), (0, 1), (0, 2), (1, 0), (2, 0), (2, 2)]
Possible moves NOT empty, but:[(0, 1), (0, 2), (1, 0), (2, 0), (2, 2)]
Possible moves NOT empty, but:[(0, 2), (1, 0), (2, 0), (2, 2)]
Possible moves NOT empty, but:[(0, 2), (2, 0), (2, 2)]
Possible moves NOT empty, but:[(2, 0), (2, 2)]
New simulation with: 
[[ 1. -1. -1.]
 [ 1. -1. -1.]
 [ 1.  1. -1.]]
Possible moves is empty!!!
1


ValueError: high <= 0

## Experimentation with exploration constant

Try different exploration constants and maximum iterations to see how they affect the performance of the algorithm.

# Evaluation

Evaluate your algorithm and comment on its pros and cons. For example, is it fast? Is it sample efficient? Is the learned policy competitive? Does it lose? Would you, as a human, beat it? Would it scale well to larger grids such as 4x4 or 5x5?

# Next gen TTT AI

We could try using a neural network to estimate the value of each state instead of simulating games to the end. This could speed up the search and improve the performance of the algorithm.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1262dda2-abb7-4af7-a1b6-72164064af5a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>