# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: Sunday, December 17 ([CET](https://www.timeanddate.com/time/zones/cet))
* Reviews: Dies Natalis Solis Invicti ([CET](https://en.wikipedia.org/wiki/Sol_Invictus))

Notes:

* Reviews will be assigned  on Monday, December 4
* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)

## References
* _Sutton & Barto, Reinforcement Learning: An Introduction_ [2nd Edition]
* [_Reinforcement Learning - Developing Intelligent Agents_](https://www.youtube.com/playlist?list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv)

### General imports

In [1]:
import numpy as np
import pickle
from abc import ABC, abstractmethod
from copy import deepcopy
from itertools import combinations
from random import randint, random
from tqdm import trange
from typing import Literal

### Tic-Tac-Toe game definition

In [2]:
class TicTacToe:
    """
    Class representing the Tic-Tac-Toe game.
    """

    def __init__(self, board: np.ndarray = None) -> None:
        """
        Constructor of the Tic-Tac-Toe game.

        Args:
            board: the board of the game in a given state.

        Returns:
            None.
        """
        # if board is None
        if board is None:
            # create the starting board of the game
            board = np.ones((3, 3), dtype=np.uint8) * -1
        # define the board game
        self._board = board
        # define a board to ease the check winner computation
        self._eqv_board = np.array([[1, 6, 5], [8, 4, 0], [3, 2, 7]], dtype=np.uint8)
        # define a mapping for pretty printing
        self._id_to_block = {-1: '⬜️', 0: '❌', 1: '⭕️'}

    @property
    def board(self) -> np.ndarray:
        """
        Returns a copy of the game board.

        Args:
            None.

        Returns:
            A copy of the game board is returned.
        """
        # return a copy of the board so that the board cannot be modified from outside
        return deepcopy(self._board)

    def print(self) -> None:
        """
        Print the table in a pretty way.

        Args:
            None.

        Returns:
            None.
        """
        # define a board for pretty printing
        fancy_board = np.chararray(self._board.shape, itemsize=1, unicode=True)
        for i in range(fancy_board.shape[0]):
            for j in range(fancy_board.shape[1]):
                # fill the fancy board
                fancy_board[(i, j)] = self._id_to_block[self._board[(i, j)]]
        print(fancy_board)

    def check_winner(self) -> Literal[-1, 0, 1]:
        """
        Check the winner.

        Args:
            None.

        Returns:
            0 is returned if the first player has won;
            1 is returned if the second player has won;
            -1 is returned if no one has won.
        """
        # take the tiles belonging to the first player
        player1_tiles = self._board == 0
        # take the tiles belonging to the second player
        player2_tiles = self._board == 1
        # check if the first player has won
        if any(sum(h) == 12 for h in combinations(self._eqv_board[player1_tiles], 3)):
            return 0
        # check if the second player has won
        if any(sum(h) == 12 for h in combinations(self._eqv_board[player2_tiles], 3)):
            return 1
        # no player has won
        return -1

    def is_still_playable(self) -> bool:
        """
        Check if the board game contains not taken tiles.

        Args:
            None.

        Returns:
            A boolean representing if the game is still playable
            is returned.
        """
        # check if still there are not taken tiles
        return any((self._board == -1).flatten())

    def move(self, move: tuple[int, int], player_id: int) -> bool:
        """
        Perform a move for a given player if it is acceptable.

        Args:
            move: the move to play;
            player_id: the id of the moving player.

        Returns:
            The acceptability of the move is returned.
        """
        # if the player id is not valid
        if player_id >= 2 or player_id <= -1:
            return False
        # check if the move is acceptable
        acceptable = self.is_acceptable(move)
        # if it is
        if acceptable:
            # update the board
            self._board[move] = player_id
        return acceptable

    def is_acceptable(self, move: tuple[int, int]) -> bool:
        """
        Check if a move is acceptable.

        Args:
            move: the move to play.

        Returns:
            The acceptability of the move is returned.
        """
        acceptable: bool = move[0] >= 0 and move[0] <= 3 and move[1] >= 0 and move[1] <= 3 and self._board[move] < 0
        return acceptable

### Tic-Tac-Toe match definition

In [3]:
def play(game: 'TicTacToe', player1: 'Player', player2: 'Player', log: bool = False) -> Literal[-1, 0, 1]:
    """
    Play a game between two given players.

    Args:
        game: a Tic-Tac-Toe game instance;
        player1: the player who starts the game;
        player2: the second player of the game;
        log: a boolean flag to print the match log or not.

    Returns:
        0 is returned if the first player has won;
        1 is returned if the second player has won;
        -1 is returned if no one has won.
    """
    # if the user wants to see the full game
    if log:
        game.print()
    # define the players
    players = [player1, player2]
    # set the moving player index
    current_player_idx = 1
    # define a variable to indicate if there is a winner
    winner = -1
    # if we can still play
    while winner < 0 and game.is_still_playable():
        # update the current moving player index
        current_player_idx += 1
        current_player_idx %= len(players)
        # define a variable to check if the chosen move is ok or not
        ok = False
        # while the chosen move is not ok
        while not ok:
            # let the current player make a move
            move = players[current_player_idx].make_move(game)
            # check if now it is ok
            ok = game.move(move, current_player_idx)
        # if the user wants to see the full game
        if log:
            game.print()
        # check if there is a winner
        winner = game.check_winner()
    # if the user wants to see the full game
    if log:
        if winner == -1:
            print(f"Draw!")
        else:
            print(f"Winner: Player {winner}")
    # return the winner
    return winner

### Additional functions

The `show_statistics` function is used to print some statistics about a batch of matches played between two players.

In [4]:
def show_statistics(player_id: int, player1: 'Player', player2: 'Player', n_matches: int = 1_000) -> None:
    """
    Play a certain number of games between two players and print the statistics.

    Args:
        player_id: the player whose statistics we want to track;
        player1: the player who starts the game;
        player2: the second player of the game;
        n_matches: the number of matches to play.

    Return:
        None.
    """
    counter_wins = 0
    counter_losses = 0
    counter_draws = 0
    for _ in range(n_matches):
        game = TicTacToe()
        winner = play(game, player1, player2)
        counter_wins = counter_wins + 1 if winner == player_id else counter_wins
        counter_losses = counter_losses + 1 if winner == (player_id + 1) % 2 else counter_losses
        counter_draws = counter_draws + 1 if winner == -1 else counter_draws
    print(f'Over {n_matches} matches: {counter_wins} wins, {counter_losses} losses and {counter_draws} draw')
    print(f'Wins + Draws percentage: {(counter_wins + counter_draws) / n_matches:.2%}')

### Abstract Player definition

In [5]:
class Player(ABC):
    """
    Class representing an abstract player.
    """

    def __init__(self) -> None:
        """
        Player constructor to be implemented.

        Args:
            None.

        Returns:
            None.
        """
        pass

    @abstractmethod
    def make_move(self, game: 'TicTacToe') -> tuple[int, int]:
        """
        Abstract Method for deciding which move to play.

        Args:
            game: a Tic-Tac-Toe game instance.

        Returns:
            A move is returned.
        """
        pass

### Random Player definition

In [6]:
class RandomPlayer(Player):
    """
    Class representing a player who plays randomly.
    """

    def __init__(self) -> None:
        """
        Constructor of the random player.

        Args:
            None.

        Returns:
            None.
        """
        super().__init__()

    def make_move(self, game: TicTacToe) -> tuple[int, int]:
        """
        Pick a randome move.

        Args:
            game: a Tic-Tac-Toe game instance.

        Returns:
            A random move is returned.
        """
        return (randint(0, game.board.shape[0] - 1), randint(0, game.board.shape[1] - 1))

## Reinforcement Learning: Q-learning

**Q-learning** is a reinforcement learning technique which is based on updating the action-value function $Q(s_t, a_t)$ according to the following formula:
$$
Q(s_t, a_t) \leftarrow (1 - \alpha) * Q(s_t, a_t) + \alpha * ( R_{t+1} + \gamma * \max_a Q(s_{t+1}, a) )
$$
where $\gamma$ is the _discount rate_ and $\alpha$ is a learning rate parameter. \
Here $s_{t+1}$ is supposed to be a state in which our player is again called upon to make a move. \
Since in our case $s_{t+1}$ is a state in which the opponent has to play, we should take this into account by putting a _minus_ sign in front of $\max_a Q(s_{t+1}, a) )$. Thus, the update formula becomes:
$$
Q(s_t, a_t) \leftarrow (1 - \alpha) * Q(s_t, a_t) + \alpha * ( R_{t+1} + \gamma * ( - \max_a Q(s_{t+1}, a) ) )
$$
> Note: This idea came to [Davide Vitabile](https://github.com/Vitabile/Computational-Intelligence/tree/main)'s mind

The cells below define a Q-learning player and train it against a random player. \
The obtained results are also shown.

In [26]:
class QLearningRLPlayer(Player):
    """
    Class representing player who learns to play thanks to the Q-learning technique.
    """

    def __init__(
        self,
        n_episodes: int,
        alpha: float,
        gamma: float,
        min_exploration_rate: float,
        exploration_decay_rate: float,
        opponent: 'Player',
    ) -> None:
        """
        The Q-learning player constructor.

        Args:
            n_episodes: the number of episodes for the training phase;
            alpha: how much information to incorporate from the new experience;
            gamma: the discount rate of the Bellman equation;
            min_exploration_rate: the minimum rate for exploration during the training phase;
            exploration_decay_rate: the exploration decay rate used during the training;
            opponent: the opponent to play against.

        Returns:
            None.
        """
        super().__init__()
        self._q_table = {}  # define the Action-value function
        self._n_episodes = n_episodes  # define the number of episodes for the training phase
        self._alpha = alpha  # define how much information to incorporate from the new experience
        self._gamma = gamma  # define the discount rate of the Bellman equation
        self._exploration_rate = 1  # define the exploration rate for the training phase
        self._min_exploration_rate = (
            min_exploration_rate  # define the minimum rate for exploration during the training phase
        )
        self._exploration_decay_rate = (
            exploration_decay_rate  # define the exploration decay rate used during the training
        )
        self._opponent = opponent  # define the opponent to play against

    def _move_reward(self, game: 'TicTacToe', move: tuple[int, int], player_id: int) -> tuple[Literal[-1, 1], bool]:
        """
        Try a move and return the corresponding reward.

        Args:
            game: a Tic-Tac-Toe game instance;
            move: the move to try;
            player_id: my player's id.

        Returns:
            The reward and the acceptability of the move are returned.
        """
        # play a move
        acceptable = game.move(move, player_id)
        # give a negative reward to the agent
        reward = -1
        # if the move is acceptable
        if acceptable:
            # give a positive reward to the agent
            reward = 1
        return reward, acceptable

    def _game_reward(self, player: 'TicTacToe', winner: int) -> Literal[-10, 0, 10]:
        """
        Calculate the reward based on how the game ended.

        Args:
            player: the winning player;
            winner: the winner's player id.

        Returns:
            The game reward is returned.
        """
        # if there was no winner
        if winner == -1:
            # return no reward
            return 0
        # if the agent is the winner
        elif self == player:
            # give a big positive reward
            return 10
        # give a big negative reward, otherwise
        return -10

    def _map_state_to_index(self, game: 'TicTacToe') -> str:
        """
        Given a game state, this function translates it into an index to access the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance.
        """
        # take the current game state
        state = game.board
        # change not taken tiles values to 2
        state[state == -1] = 2
        # map the state to a string in base 3
        state_repr_index = ''.join(str(_) for _ in state.flatten())
        return state_repr_index

    def _update_q_table(self, state_repr_index: str, new_state_repr_index: str, action: int, reward: int) -> None:
        """
        Update the Q_table according to the Q-learning update formula.

        Args:
            state_repr_index: the current state index;
            new_state_repr_index: the next state index;
            action: the performed action;
            reward: the reward obtained by applying the action on the current state.

        Returns:
            None.
        """
        # if the current state is unknown
        if state_repr_index not in self._q_table:
            # create its entry in the action-value mapping table
            self._q_table[state_repr_index] = np.zeros((9,))
        # if the next state is unknown
        if new_state_repr_index not in self._q_table:
            # create its entry in the action-value mapping table
            self._q_table[new_state_repr_index] = np.zeros((9,))
        prev_value = self._q_table[state_repr_index][action]
        # update the action-value mapping entry for the current state using Q-learning
        self._q_table[state_repr_index][action] = (1 - self._alpha) * prev_value + self._alpha * (
            reward + self._gamma * (-np.max(self._q_table[new_state_repr_index]))
        )

    def _make_move(self, game: 'TicTacToe') -> tuple[int, int]:
        """
        Construct a move during the training phase to update the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance.

        Returns:
            A move to play is returned.
        """
        # get the current state representation
        state_repr_index = self._map_state_to_index(game)

        # randomly perform exploration
        if random() < self._exploration_rate:
            # by returning a random move
            move = randint(0, 8)
        # perform eploitation, otherwise
        else:
            # if the current state is unknown
            if state_repr_index not in self._q_table:
                # create its entry in the action-value mapping table
                self._q_table[state_repr_index] = np.zeros((9,))
            # take the action with maximum return of rewards
            move = np.argmax(self._q_table[state_repr_index])

        # reshape the move to match the board shape
        move = move // 3, move % 3

        return move

    def make_move(self, game: 'TicTacToe') -> tuple[int, int]:
        """
        Construct a move to be played according to the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance.

        Returns:
            A move to play is returned.
        """
        # get the current state representation
        state_repr_index = self._map_state_to_index(game)
        # if the current state is known
        if state_repr_index in self._q_table:
            # take the action with maximum return of rewards
            move = np.argmax(self._q_table[state_repr_index])
            # reshape the move to match the board shape
            move = move // 3, move % 3
            # if the move is acceptable
            if game.is_acceptable(move):
                # return it
                return move
        # perform a random move, otherwise
        return (randint(0, game.board.shape[0] - 1), randint(0, game.board.shape[1] - 1))

    def train(self) -> None:
        """
        Train the Q-learning player.

        Args:
            None.

        Returns:
            None.
        """
        # define the history of rewards
        all_rewards = []
        # define how many episodes to run
        pbar = trange(self._n_episodes)
        # define the players
        players = (self, self._opponent)
        # for each episode
        for episode in pbar:
            # define a new game
            game = TicTacToe()
            # sets the rewards to zero
            rewards = 0

            # define a variable to indicate if there is a winner
            winner = -1
            # change players order
            players = (players[1], players[0])
            # define the current player index
            player_idx = 1

            # if we can still play
            while winner < 0 and game.is_still_playable():
                # change player
                player_idx = (player_idx + 1) % 2
                player = players[player_idx]

                # define a variable to check if the chosen move is ok or not
                ok = False
                # if it is our turn
                if self == player:
                    # while the chosen move is not ok
                    while not ok:
                        # get the current state representation
                        state_repr_index = self._map_state_to_index(game)
                        # get a move
                        move = self._make_move(game)
                        # reshape the move to form an index
                        action = move[0] * 3 + move[1]
                        # perform the move and get the reward
                        reward, ok = self._move_reward(game, move, player_idx)
                        # get the next state representation
                        new_state_repr_index = self._map_state_to_index(game)

                        # update the action-value function
                        self._update_q_table(state_repr_index, new_state_repr_index, action, reward)

                        # update the rewards
                        rewards += reward
                # if it is the opponent turn
                else:
                    # while the chosen move is not ok
                    while not ok:
                        # get a move
                        move = player.make_move(game)
                        # perform the move
                        ok = game.move(move, player_idx)

                # check if there is a winner
                winner = game.check_winner()

            # update the exploration rate
            self._exploration_rate = np.clip(
                np.exp(-self._exploration_decay_rate * episode), self._min_exploration_rate, 1
            )
            # get the game reward
            reward = self._game_reward(player, winner)
            # update the action-value function
            self._update_q_table(state_repr_index, new_state_repr_index, action, reward)
            # update the rewards
            rewards += reward
            # update the rewards history
            all_rewards.append(rewards)
            pbar.set_description(f'rewards value: {rewards}, current exploration rate: {self._exploration_rate:2f}')

        print(f'** Last 1_000 episodes - Mean rewards value: {sum(all_rewards[-1_000:]) / 1_000:.2f} **')
        print(f'** Last rewards value: {all_rewards[-1]:} **')

In [38]:
# create the Q-learning player
q_learning_rl_agent = QLearningRLPlayer(
    n_episodes=500_000,
    alpha=0.1,
    gamma=0.99,
    min_exploration_rate=0.01,
    exploration_decay_rate=3e-6,
    opponent=RandomPlayer(),
)
# train the Q-learning player
q_learning_rl_agent.train()

rewards value: 13, current exploration rate: 0.223131: 100%|██████████| 500000/500000 [09:39<00:00, 862.85it/s]  

** Last 1_000 episodes - Mean rewards value: 11.12 **
** Last rewards value: 13 **





In [44]:
# print the number of explored states
print(f'Number of explored states: {len(q_learning_rl_agent._q_table.keys())}')

Number of explored states: 5478


In [46]:
# serialize the Q-learning player
with open('./q_learning_rl_agent.pkl', 'wb') as f:
    pickle.dump(q_learning_rl_agent, f)

In [47]:
# load the serialized Q-learning player
with open('./q_learning_rl_agent.pkl', 'rb') as f:
    q_learning_rl_agent = pickle.load(f)

### Q-learning 🆚 Random Player

In [178]:
# let the Q-learning player play against a random player
# Q-learning player is second to move
game = TicTacToe()
player1 = RandomPlayer()
player2 = q_learning_rl_agent
winner = play(game, player1, player2, log=True)

[['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['❌' '⬜' '⬜']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['❌' '⬜' '⬜']
 ['⬜' '⭕' '⬜']
 ['⬜' '⬜' '⬜']]
[['❌' '⬜' '⬜']
 ['⬜' '⭕' '⬜']
 ['⬜' '❌' '⬜']]
[['❌' '⬜' '⬜']
 ['⭕' '⭕' '⬜']
 ['⬜' '❌' '⬜']]
[['❌' '⬜' '❌']
 ['⭕' '⭕' '⬜']
 ['⬜' '❌' '⬜']]
[['❌' '⬜' '❌']
 ['⭕' '⭕' '⭕']
 ['⬜' '❌' '⬜']]
Winner: Player 1


#### Q-learning plays as first player

In [48]:
# let the Q-learning player play 1_000 games against a random player
# Q-learning player is first to move
show_statistics(0, q_learning_rl_agent, RandomPlayer())

Over 1000 matches: 996 wins, 0 losses and 4 draw
Wins + Draws percentage: 100.00%


#### Q-learning plays as second player

In [54]:
# let the Q-learning player play 1_000 games against a random player
# Q-learning player is second to move
show_statistics(1, RandomPlayer(), q_learning_rl_agent)

Over 1000 matches: 819 wins, 0 losses and 181 draw
Wins + Draws percentage: 100.00%


## Reinforcement Learning: Monte Carlo-learning

**Monte Carlo-learning** is a reinforcement learning technique which is based on updating the action-value function $Q(s_t, a_t)$ according to the following formula:
$$
Q(s_t, a_t) \leftarrow \cfrac{1}{N} \sum_{i = 1}^{N} G_i (s_t, a_t)
$$
where $N$ is the number of episodes and $G_i (s_t, a_t)$ is the return of rewards experienced at the $i$-th episode for the first occurrence of $(s_t, a_t)$ in that episode. \
The action-value function is updated iteratively over the $N$ episodes.

The cells below define a Monte Carlo-learning player and train it against a random player. \
The obtained results are also shown.

In [13]:
class MonteCarloRLPlayer(Player):
    """
    Class representing player who learns to play thanks to the Monte Carlo-learning technique.
    """

    def __init__(
        self,
        n_episodes: int,
        gamma: float,
        min_exploration_rate: float,
        exploration_decay_rate: float,
        opponent: 'Player',
    ) -> None:
        """
        The Monte Carlo-learning player constructor.

        Args:
            n_episodes: the number of episodes for the training phase;
            gamma: the discount rate of the Bellman equation;
            min_exploration_rate: the minimum rate for exploration during the training phase;
            exploration_decay_rate: the exploration decay rate used during the training;
            opponent: the opponent to play against.

        Returns:
            None.
        """
        super().__init__()
        self._q_table = {}  # define the Action-value function
        self._q_counters = {}  # define counters for the return of rewards
        self._n_episodes = n_episodes  # define the number of episodes for the training phase
        self._gamma = gamma  # define the discount rate of the Bellman equation
        self._exploration_rate = 1  # define the exploration rate for the training phase
        self._min_exploration_rate = (
            min_exploration_rate  # define the minimum rate for exploration during the training phase
        )
        self._exploration_decay_rate = (
            exploration_decay_rate  # define the exploration decay rate used during the training
        )
        self._opponent = opponent  # define the opponent to play against

    def _move_reward(self, game: 'TicTacToe', move: tuple[int, int], player_id: int) -> tuple[Literal[-1, 1], bool]:
        """
        Try a move and return the corresponding reward.

        Args:
            game: a Tic-Tac-Toe game instance;
            move: the move to try;
            player_id: my player's id.

        Returns:
            The reward and the acceptability of the move are returned.
        """
        # play a move
        acceptable = game.move(move, player_id)
        # give a negative reward to the agent
        reward = -1
        # if the move is acceptable
        if acceptable:
            # give a positive reward to the agent
            reward = 1
        return reward, acceptable

    def _game_reward(self, player: 'TicTacToe', winner: int) -> Literal[-10, 0, 10]:
        """
        Calculate the reward based on how the game ended.

        Args:
            player: the winning player;
            winner: the winner's player id.

        Returns:
            The game reward is returned.
        """
        # if there was no winner
        if winner == -1:
            # return no reward
            return 0
        # if the agent is the winner
        elif self == player:
            # give a big positive reward
            return 10
        # give a big negative reward, otherwise
        return -10

    def _map_state_to_index(self, game: 'TicTacToe') -> str:
        """
        Given a game state, this function translates it into an index to access the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance;
            player_id: my player's id.
        """
        # take the current game state
        state = game.board
        # change not taken tiles values to 2
        state[state == -1] = 2
        # map the state to a string in base 3
        state_repr_index = ''.join(str(_) for _ in state.flatten())
        return state_repr_index

    def _update_q_table(self, state_repr_index: str, action: int, return_of_rewards: float) -> None:
        """
        Update the Q_table according to the Monte Carlo-learning technique.

        Args:
            state_repr_index: the current state index;
            action: the performed action;
            return_of_rewards: the return of rewards for the current state.

        Returns:
            None.
        """
        # if the current state is unknown
        if state_repr_index not in self._q_counters:
            # create its entry in the action-value mapping table
            self._q_table[state_repr_index] = np.zeros((9,))
            # create its entry in the counters of the return of rewards
            self._q_counters[state_repr_index] = np.zeros((9,))
        # update the counters of the return of rewards
        self._q_counters[state_repr_index][action] += 1
        # update the action-value mapping table
        self._q_table[state_repr_index][action] = (
            self._q_table[state_repr_index][action]
            + (return_of_rewards - self._q_table[state_repr_index][action]) / self._q_counters[state_repr_index][action]
        )

    def _make_move(self, game: 'TicTacToe') -> tuple[int, int]:
        """
        Construct a move during the training phase to update the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance;
            player_id: my player's id.

        Returns:
            A move to play is returned.
        """
        # get the current state representation
        state_repr_index = self._map_state_to_index(game)

        # randomly perform exploration
        if random() < self._exploration_rate:
            # by returning a random move
            move = randint(0, 8)
        # perform eploitation, otherwise
        else:
            # if the current state is unknown
            if state_repr_index not in self._q_table:
                # create its entry in the action-value mapping table
                self._q_table[state_repr_index] = np.zeros((9,))
                # create its entry in the counters of the return of rewards
                self._q_counters[state_repr_index] = np.zeros((9,))
            # take the action with maximum return of rewards
            move = np.argmax(self._q_table[state_repr_index])

        # reshape the move to match the board shape
        move = move // 3, move % 3

        return move

    def make_move(self, game: 'TicTacToe') -> tuple[int, int]:
        """
        Construct a move to be played according to the Q_table.

        Args:
            game: a Tic-Tac-Toe game instance;
            player_id: my player's id.

        Returns:
            A move to play is returned.
        """
        # get the current state representation
        state_repr_index = self._map_state_to_index(game)
        # if the current state is known
        if state_repr_index in self._q_table:
            # take the action with maximum return of rewards
            move = np.argmax(self._q_table[state_repr_index])
            # reshape the move to match the board shape
            move = move // 3, move % 3
            # if the move is acceptable
            if game.is_acceptable(move):
                # return it
                return move
        # perform a random move, otherwise
        return (randint(0, game.board.shape[0] - 1), randint(0, game.board.shape[1] - 1))

    def train(self) -> None:
        """
        Train the Monte Carlo-learning player.

        Args:
            None.

        Returns:
            None.
        """
        # define the history of rewards
        all_rewards = []
        # define how many episodes to run
        pbar = trange(self._n_episodes)
        # define the players
        players = (self, self._opponent)

        # for each episode
        for episode in pbar:
            # define a new game
            game = TicTacToe()
            # sets the rewards to zero
            rewards = 0

            # define the trajectory
            trajectory = []

            # define a variable to indicate if there is a winner
            winner = -1
            # swap players order
            players = (players[1], players[0])
            # define the current player index
            player_idx = 1

            # if we can still play
            while winner < 0 and game.is_still_playable():
                # change player
                player_idx = (player_idx + 1) % 2
                player = players[player_idx]

                # define a variable to check if the chosen move is ok or not
                ok = False
                # if it is our turn
                if self == player:
                    # while the chosen move is not ok
                    while not ok:
                        # get the current state representation
                        state_repr_index = self._map_state_to_index(game)
                        # get a move
                        move = self._make_move(game)
                        # reshape the move to form an index
                        action = move[0] * 3 + move[1]
                        # perform the move and get the reward
                        reward, ok = self._move_reward(game, move, player_idx)

                        # update the trajectory
                        trajectory.append((state_repr_index, action, reward))

                        # update the rewards
                        rewards += reward
                # if it is the opponent turn
                else:
                    # while the chosen move is not ok
                    while not ok:
                        # get a move
                        move = player.make_move(game)
                        # perform the move
                        ok = game.move(move, player_idx)

                # check if there is a winner
                winner = game.check_winner()

            # update the exploration rate
            self._exploration_rate = np.clip(
                np.exp(-self._exploration_decay_rate * episode), self._min_exploration_rate, 1
            )
            # delete last reward
            rewards -= reward
            # delete last tuple in trajectory
            trajectory.pop()
            # get the game reward
            reward = self._game_reward(player, winner)
            # update the trajectory
            trajectory.append((state_repr_index, action, reward))
            # update the rewards
            rewards += reward
            # update the rewards history
            all_rewards.append(rewards)

            # set the current return of rewards
            return_of_rewards = 0
            # for all tuples in trajectory
            for state_repr_index, action, reward in trajectory:
                # update the return of rewards
                return_of_rewards = reward + self._gamma * return_of_rewards
                # update the action-value function
                self._update_q_table(state_repr_index, action, return_of_rewards)

            pbar.set_description(f'rewards value: {rewards}, current exploration rate: {self._exploration_rate:2f}')

        print(f'** Last 1_000 episodes - Mean rewards value: {sum(all_rewards[-1_000:]) / 1_000:.2f} **')
        print(f'** Last rewards value: {all_rewards[-1]:} **')

In [15]:
# create the Monte Carlo-learning player
monte_carlo_rl_agent = MonteCarloRLPlayer(
    n_episodes=100_000,
    gamma=0.99,
    min_exploration_rate=0.01,
    exploration_decay_rate=2.5e-5,
    opponent=RandomPlayer(),
)
# train the Monte Carlo-learning player
monte_carlo_rl_agent.train()

rewards value: 12, current exploration rate: 0.082087: 100%|██████████| 100000/100000 [01:56<00:00, 859.23it/s]  

** Last 1_000 episodes - Mean rewards value: 8.55 **
** Last rewards value: 12 **





In [55]:
# print the number of explored states
print(f'Number of explored states: {len(monte_carlo_rl_agent._q_table.keys())}')

Number of explored states: 4516


In [64]:
# serialize the Monte Carlo-learning player
with open('./monte_carlo_rl_agent.pkl', 'wb') as f:
    pickle.dump(monte_carlo_rl_agent, f)

In [65]:
# load the serialized Monte Carlo-learning player
with open('./monte_carlo_rl_agent.pkl', 'rb') as f:
    monte_carlo_rl_agent = pickle.load(f)

### Monte Carlo-learning 🆚 Random Player

In [289]:
# let the Monte Carlo-learning player play against a random player
# Monte Carlo-learning player is second to move
game = TicTacToe()
player1 = RandomPlayer()
player2 = monte_carlo_rl_agent
winner = play(game, player1, player2, log=True)

[['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['⬜' '⬜' '❌']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['⭕' '⬜' '❌']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['⭕' '❌' '❌']
 ['⬜' '⬜' '⬜']
 ['⬜' '⬜' '⬜']]
[['⭕' '❌' '❌']
 ['⬜' '⬜' '⬜']
 ['⭕' '⬜' '⬜']]
[['⭕' '❌' '❌']
 ['⬜' '❌' '⬜']
 ['⭕' '⬜' '⬜']]
[['⭕' '❌' '❌']
 ['⭕' '❌' '⬜']
 ['⭕' '⬜' '⬜']]
Winner: Player 1


#### Monte Carlo-learning plays as first player

In [273]:
# let the Monte Carlo-learning player play 1_000 games against a random player
# Monte Carlo-learning player is first to move
show_statistics(0, monte_carlo_rl_agent, RandomPlayer())

Over 1000 matches: 959 wins, 8 losses and 33 draw
Wins + Draws percentage: 99.20%


#### Monte Carlo-learning plays as second player

In [66]:
# let the Monte Carlo-learning player play 1_000 games against a random player
# Monte Carlo-learning player is second to move
show_statistics(1, RandomPlayer(), monte_carlo_rl_agent)

Over 1000 matches: 692 wins, 48 losses and 260 draw
Wins + Draws percentage: 95.20%


## Monte Carlo-learning 🆚 Q-learning

#### Monte Carlo-learning plays as first player

In [138]:
# let the Monte Carlo-learning player and Q-learning player play 1_000 games against each other
# Monte Carlo-learning player is first to move
show_statistics(0, monte_carlo_rl_agent, q_learning_rl_agent)

Over 1000 matches: 0 wins, 0 losses and 1000 draw
Wins + Draws percentage: 100.00%


#### Q-learning plays as first player

In [139]:
# let the Q-learning player and Monte Carlo-learning player play 1_000 games against each other
# Q-learning player player is first to move
show_statistics(0, q_learning_rl_agent, monte_carlo_rl_agent)

Over 1000 matches: 1000 wins, 0 losses and 0 draw
Wins + Draws percentage: 100.00%
