Tic Tac Toe
---

<img style="float:center" src="../images/tris.png" alt="drawing" width="200"/>

In [86]:
import numpy as np
import random
import pickle
from tqdm.auto import tqdm

## Board State
---
The ``TicTacToe`` class reflects the state of the board.
We use 1 to indicate player1 and -1 for player 2.

#### Parameter description
- ``board``: numpy array of dimension 3x3 that represents the game board.
- ``players``: contains the numbers that identify our players (1 is ``player1`` and -1 is ``player2``).
- ``current_player``: indicates who has to take the turn.
- ``winner``: indicates the winner of the game (1 if ``player1`` won, -1 if ``player2`` won).
- ``game_over``: boolean that indicate if the game has finished.

#### Methods description
- ``available_moves``: it returns an array with the list of possible moves (each element is a tuple of two integers that indicates where to play).
- ``make_move``: takes as input the location of where a player played and puts the value of the ``current player`` in that place on the board, i.e. who is playing at that moment. It also calls the functions ``check_winner`` and ``switch_player`` to control if that move makes a player win and to and gives the other player the turn. It returns the new board but not as a matrix but as a tuple of tuples using the ``convert_matrix_to_tuple`` function, just to be more comfortable with the QAgent dictionary implementation.
- ``switch_player``: It gives the other player the turn. If there is, it set the ``winner`` param with the player who won (1 or -1) and also set ``game_over`` to ``True``.
- ``check_winner``: It checks if there is a winner. 
- ``convert_matrix_to_tuple``: It converts a matrix (that will always be the board status) to a tuple of tuples. For example, a matrix [[1,0,0],[0,1,0],[0,0,1]] will become ((1,0,0), (0,1,0),(0,0,1)).
- ``reset``: It resets the state of the board by emptying the boxes with also all the other parameters.
- ``show_board``: It prints the board status. ``player1`` is the X and ``player2`` is the O.

In [87]:
class TicTacToe:
    def __init__(self):
        self.board = np.zeros((3, 3))
        self.players = [1, -1]
        self.current_player = 1
        self.winner = None
        self.game_over = False

    def available_moves(self):
        moves = []
        for i in range(3):
            for j in range(3):
                if self.board[i][j] == 0:
                    moves.append((i, j))
        return moves

    def make_move(self, move):
        if self.board[move[0]][move[1]] != 0:
            return False
        self.board[move[0]][move[1]] = self.current_player
        self.check_winner()
        self.switch_player()
        return self.convert_matrix_to_tuple(self.board)

    def switch_player(self):
        if self.current_player == self.players[0]:
            self.current_player = self.players[1]
        else:
            self.current_player = self.players[0]

    def check_winner(self):
        # Check rows
        for i in range(3):
            if self.board[i][0] == self.board[i][1] == self.board[i][2] != 0:
                self.winner = self.board[i][0]
                self.game_over = True
        # Check columns
        for j in range(3):
            if self.board[0][j] == self.board[1][j] == self.board[2][j] != 0:
                self.winner = self.board[0][j]
                self.game_over = True
        # Check diagonals
        if self.board[0][0] == self.board[1][1] == self.board[2][2] != 0:
            self.winner = self.board[0][0]
            self.game_over = True
        if self.board[0][2] == self.board[1][1] == self.board[2][0] != 0:
            self.winner = self.board[0][2]
            self.game_over = True
        # Check tie
        if len(self.available_moves())==0:
            self.winner = 0
            self.game_over = True
    
    def convert_matrix_to_tuple(self, board):
        current_board = tuple(tuple(riga) for riga in board)
        return current_board
    
    def reset(self):
        self.board = np.zeros((3, 3))
        self.current_player = 1
        self.winner = 0
        self.game_over = False
        
    def show_board(self):
        # p1: x  p2: o
        for i in range(0, 3):
            print('-------------')
            out = '| '
            for j in range(0, 3):
                if self.board[i, j] == 1:
                    token = 'x'
                if self.board[i, j] == -1:
                    token = 'o'
                if self.board[i, j] == 0:
                    token = ' '
                out += token + ' | '
            print(out)
        print('-------------')     
        print()

## Q Learning Player
---
The QLearningAgent represents the player that will be trained with Q-Learning. Q-learning is a reinforcement learning technique which is based on updating the action-value based on the difference between the current estimate and the actual rewards received.

We will represent the Q-values as a dictionary of state-action pairs, where each state is a tuple representing the current state of the board, and each action is a tuple representing the coordinates of the move. This state-action pair will be the key of our dictionary and the Q-values are the values. The initial Q-values will be set to zero. We update the action-value $Q(s_t, a_t)$ according to this formula:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha * (\gamma * R_{t+1} - Q(s_t, a_t))
$$

#### Parameters description
- ``Q``: Dictionary with state-action pairs as key the Q-values as values.
- ``alpha``: Learning rate.
- ``epsilon``: Probability of doing a random move instead of the action with max Q-value.
- ``discount_factor``: The exploration decay rate used during the training
- ``states``: All state-action pairs a player has seen during a single match. It is used at the end of each match to update the Q-values.

#### Methods description
- ``get_Q_value``: function that returns the Q-value given a state-action pair. If the key is not present in the dictionary, it creates it with Q-value equal to 0.
- ``add_state``: It adds to the ``states`` array a state-action pair.
- ``reset``: It reset the ``states`` array to be able to start a new game.
- ``choose_action``: It firstly adds to the dictionary every new state-action pair based on the new possible state of the board. It then chooses the action that can be random or based on the ``Q`` dictionary.
- ``update_Q_value``: This function is called at every end of each game. It updates the Q-values of the ``Q`` dictionary based on the states that the player has seen during the game and the reward that they have provided.
- ``save_policy``: It saves the ``Q`` dictionary that we have trained to a file.
- ``load_policy``: It loads the ``Q`` dictionary from a file.

In [88]:
class QLearningAgent:
    def __init__(self, alpha, epsilon, discount_factor):
        self.Q = {}
        self.alpha = alpha
        self.epsilon = epsilon
        self.discount_factor = discount_factor
        self.states = [] # record all positions taken + action

    def get_Q_value(self, state, action):
        if (state, action) not in self.Q:
            self.Q[(state, action)] = 0.0
        return self.Q[(state, action)]
    
    def add_state(self, state, action):
        self.states.append((state,action))
    
    def reset(self):
        self.states = []
        
    def choose_action(self, state, available_moves):
        Q_values = [self.get_Q_value(state, action) for action in available_moves]
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(available_moves)
        else:
            max_Q = max(Q_values)
            if Q_values.count(max_Q) > 1:
                best_moves = [i for i in range(len(available_moves)) if Q_values[i] == max_Q]
                i = random.choice(best_moves)
            else:
                i = Q_values.index(max_Q)
            return available_moves[i]

    def update_Q_value(self, reward):     
        for st in reversed(self.states):
            current_q_value = self.Q[(st[0], st[1])] # st[0] = board state st[1] = action
            reward = current_q_value + self.alpha * (self.discount_factor * reward - current_q_value)
            self.Q[(st[0], st[1])] = reward
            
    def save_policy(self):
        fw = open('policy_QL', 'wb')
        pickle.dump(self.Q, fw)
        fw.close()

    def load_policy(self, file):
        fr = open(file,'rb')
        self.Q = pickle.load(fr)
        fr.close()

## Train and Test functions
---
We used two agents that use Q-learning. They play against each other to train.

In [89]:
def train(player1, player2, num_episodes):
    state = TicTacToe()
    for epoch in tqdm(range(num_episodes)):
        state.reset()
        player1.reset()
        player2.reset()
        state_board = state.convert_matrix_to_tuple(state.board)
        while not state.game_over:
            #Player 1
            action = player1.choose_action(state_board, state.available_moves())
            player1.add_state(state_board, action)
            state_board = state.make_move(action)
            
            if state.winner is not None:
                if state.winner == 1:
                    player1.update_Q_value(1) #player 1 won, so give 1 reward
                    player2.update_Q_value(0)
                elif state.winner == -1:
                    player1.update_Q_value(0)
                    player2.update_Q_value(1)
                else:
                    player1.update_Q_value(0.1) #give a less reward because we don't want ties
                    player2.update_Q_value(0.5)
            
            else:
                #Player 2
                action = player2.choose_action(state_board, state.available_moves())
                player2.add_state(state_board, action)
                state_board = state.make_move(action)
                
                if state.winner is not None:
                    if state.winner == 1:
                        player1.update_Q_value(1) #player 1 won, so give 1 reward
                        player2.update_Q_value(0)
                    elif state.winner == -1:
                        player1.update_Q_value(0)
                        player2.update_Q_value(1)
                    else:
                        player1.update_Q_value(0.1) #give a less reward because we don't want ties
                        player2.update_Q_value(0.5)
    return player1, player2

def test(agent, num_games, print_board=False):
    num_wins = 0
    num_draws = 0
    for i in range(num_games):
        state = TicTacToe()
        state_board = state.convert_matrix_to_tuple(state.board)
        while not state.game_over:
            if state.current_player == 1:
                action = agent.choose_action(state_board, state.available_moves())
            else:
                action = random.choice(state.available_moves())               
            state_board = state.make_move(action)
            if print_board:
                state.show_board() 
        if state.winner == 1:
            num_wins += 1
        if state.winner == 0:
            num_draws += 1
    return num_wins, num_draws

## Hyperparameters
---
- ``epochs``: training epochs
- ``alpha``: learning rate
- ``epsilon``: probability of doing a random move instead of the action with max value
- ``discount_factor``: the discount rate of the Bellman equation
- ``num_games``: number of games for testing

In [129]:
epochs = 2000000
alpha = 0.2
epsilon = 0.2
discount_factor = 0.9
num_games = 1000

## Let's do some computation
---

In [None]:
player1 = QLearningAgent(alpha, epsilon, discount_factor)
player2 = QLearningAgent(alpha, epsilon, discount_factor)

Trained_player1, Trained_player2 = train(player1, player2, epochs)

In [95]:
# Trainer_player1 is the X
_ = test(agent=Trained_player1, num_games=1, print_board=True)

-------------
|   |   |   | 
-------------
|   | x |   | 
-------------
|   |   |   | 
-------------

-------------
|   | o |   | 
-------------
|   | x |   | 
-------------
|   |   |   | 
-------------

-------------
|   | o | x | 
-------------
|   | x |   | 
-------------
|   |   |   | 
-------------

-------------
| o | o | x | 
-------------
|   | x |   | 
-------------
|   |   |   | 
-------------

-------------
| o | o | x | 
-------------
|   | x |   | 
-------------
| x |   |   | 
-------------


In [151]:
num_wins, num_draws = test(agent=Trained_player1, num_games=num_games)
print(f"Over 1000 matches: {num_wins} wins, {1000 - num_wins - num_draws} losses, {num_draws} draws")
print(f"Wins + Draws percentage: {(num_wins + num_draws) / num_games * 100}")

Over 1000 matches: 911 wins, 47 losses, 42 draws
Wins + Draws percentage: 95.3
