Copyright **`(c)`** 2023 Giovanni Squillero `<giovanni.squillero@polito.it>`  
[`https://github.com/squillero/computational-intelligence`](https://github.com/squillero/computational-intelligence)  
Free for personal or classroom use; see [`LICENSE.md`](https://github.com/squillero/computational-intelligence/blob/master/LICENSE.md) for details.  

# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: [Dies Natalis Solis Invicti](https://en.wikipedia.org/wiki/Sol_Invictus)
* Reviews: [Befana](https://en.wikipedia.org/wiki/Befana)

Notes:

* Reviews will be assigned  on Monday, December 4
* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)

In [56]:
import numpy as np
import random


I used this site to help me with the code:

- [Building a Tic-Tac-Toe Game with Reinforcement Learning in Python: A Step-by-Step Tutorial](https://plainenglish.io/blog/building-a-tic-tac-toe-game-with-reinforcement-learning-in-python)


### Instantiate a tic-tac-toe game
The following class provides all tools to play a **tic-tac-toe** game. The game is played on a 3x3 board, where each cell can be empty, or occupied by a player's token (either `X` or `O`). The game is played by two players, `X` and `O`, that alternate in placing their tokens on the board. The game ends when one of the players manages to place three tokens in a row, either horizontally, vertically, or diagonally. If all the cells are occupied and no player managed to place three tokens in a row, the game ends in a draw.

In [57]:
class TicTacToe:
    def __init__(self):
        self.board = [" "] * 9
        self.current_player = "X"
        self.winner = None
    
    def display_board(self):
        """ Display the current state of the board. """
        for i in range(0, 9, 3):
            print(f"{self.board[i]} | {self.board[i+1]} | {self.board[i+2]}")
            if i != 6:
                print("---------")
        print("\n")
    
    def check_winner(self):
        """ Check if there is a winner. If there is, set self.winner to the winner's symbol. """
        # check rows
        for i in range(0, 9, 3):
            if self.board[i] == self.board[i+1] == self.board[i+2] != " ":
                self.winner = self.current_player
                return
        
        # check columns
        for i in range(3):
            if self.board[i] == self.board[i+3] == self.board[i+6] != " ":
                self.winner = self.current_player
                return
        
        # check diagonals
        if self.board[0] == self.board[4] == self.board[8] != " " or self.board[2] == self.board[4] == self.board[6] != " ":
            self.winner = self.current_player
            return
    
    def switch_player(self):
        """ Switch the current player. """
        self.current_player = "O" if self.current_player == "X" else "X"
    
    def make_move(self, position):
        if self.board[position] == " ":
            self.board[position] = self.current_player
            self.check_winner()
            self.switch_player()
            return True
        else:
            print("Invalid move! You can't take a spot that's already taken.")
            return False
    
    def full_board(self):
        """ Check if the board is full. """
        return " " not in self.board
    
    def game_over(self):
        """ Check if the game is over."""
        return self.winner is not None or self.full_board()
            
            

The RandomPlayer class is a player that selects a random move among the available ones.

In [58]:
class RandomPlayer:
    def get_move(self, game):
        valid_moves = [i for i in range(9) if game.board[i] == " "]
        return random.choice(valid_moves)

The QLearningPlayer class is a player that uses Q-learning to learn how to play the game. 

The `get_move` method choose a move randomly if the probability `epsilon` is less than a random number, otherwise it chooses the move with the highest Q-value.
The `update` method updates the Q-values of the previous state-action pair, using the Q-learning rule.
- Q-learning rule: `Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s') - Q(s,a))`


In [59]:
class QLearningPlayer:
    def __init__(self, alpha, epsilon, gamma):
        self.q_values = {}
        self.alpha = alpha
        self.epsilon = epsilon
        self.gamma = gamma
    
    def get_move(self, game):
        if random.uniform(0, 1) < self.epsilon:
            available_moves = [i for i in range(9) if game.board[i] == " "]
            return random.choice(available_moves) 
        else:
            current_board_state = tuple(game.board)
            available_moves = [i for i in range(9) if game.board[i] == " "]
            q_values = {move: self.q_values.get((current_board_state, move), 0) for move in available_moves}
            return max(q_values, key=q_values.get)
    
    def update_q_value(self, state, action, reward, next_state):
        current_q_value = self.q_values.get((state, action), 0)
        max_next_q_value = max([self.q_values.get((next_state, next_action), 0) for next_action in range(9)])
        new_q_value = current_q_value + self.alpha * (reward + self.gamma * max_next_q_value - current_q_value)
        self.q_values[(state, action)] = new_q_value

The `train` function trains the player for a given number of epochs. At each epoch, the player plays a game against a random player, and updates the Q-values according to the Q-learning rule. The `train` method returns the list of rewards obtained at each epoch.
- number of epochs: `10_000`

In [60]:
def train_agent(random_agent, learning_agent, epochs):
    for _ in range(epochs):
        game = TicTacToe()
        
        while not game.game_over():
            if game.current_player == "X":
                move = learning_agent.get_move(game)
            else:
                move = random_agent.get_move(game)
            
            current_board_state = tuple(game.board)
            game.make_move(move)
            
            if game.game_over():
                reward = 1 if game.winner == "X" else 0
                learning_agent.update_q_value(current_board_state, move, reward, tuple(game.board))

The `test` function tests the player for a given number of epochs. At each epoch, the player plays a game against a random player, and returns the number of wins, draws, and losses.
- number of epochs: `1_000`

In [61]:
def test_agent(random_agent, learning_agent, epochs):
    learning_agent.epsilon = 0 # turn off exploration
    wins = 0
    draws = 0
    
    for _ in range(epochs):
        game = TicTacToe()
        while not game.game_over():
            if game.current_player == "X":
                move = learning_agent.get_move(game)
            else:
                move = random_agent.get_move(game)
            
            game.make_move(move)
            
            if game.game_over():
                if game.winner == "X":
                    wins += 1
                else:
                    draws += 1
    
    win_rate = wins / (wins + draws)
    return win_rate

This function plays a game between two players, and returns the winner of the game.

In [62]:
def play_game(player1, player2):
    game = TicTacToe()
    
    while not game.game_over():
        if game.current_player == "X":
            move = player1.get_move(game)
        else:
            move = player2.get_move(game)
        
        game.make_move(move)
        
        if game.game_over():
            game.display_board()
            if game.winner is not None:
                winner = "QLearningPlayer" if game.winner == "X" else "RandomPlayer"
                print(f"{winner} wins!")
            else:
                print("It's a tie!")

Parameters for learning agent (Q-learning):
- Alpha: learning rate
  - determines the extent to wich newly acquired information overrides old information
  - 0: the agent does not learn anything
  - 1: the agent considers only the most recent information
  - 0.1 is a good starting point
- Gamma: discount factor
  - determines the importance of future rewards
  - 0: the agent is myopic (short-sighted)
  - 1: the agent is far-sighted
  - 0.9 is a good starting point
- Epsilon: exploration rate
  - determines the probability that the agent will explore the environment rather than exploiting it
  - 0: the agent is greedy
  - 1: the agent always explores
  - 0.1 is a good starting point

Now we will train an agent to play tic-tac-toe using pre-defined parameters for the learning agent. The agent will play against a random player, and we will see how the agent's performance improves over time.
- `alpha = 0.1`
- `epsilon = 0.1`
- `gamma = 0.9`

In [63]:
random_player = RandomPlayer()
learning_player = QLearningPlayer(
    alpha=0.5,
    epsilon=0.1,
    gamma=0.9
)
print("Training...")
train_agent(random_player, learning_player, 10000)
print("Testing...\n")
win_rate = test_agent(random_player, learning_player, 1000)
print(f"Learning agent has a win rate of {(win_rate*100):.2f}% against a random agent")

print("\nPlay one more game against the agent after the training:")
play_game(random_player, learning_player)

Training...
Testing...

Learning agent has a win rate of 86.40% against a random agent

Play one more game against the agent after the training:
O | O | X
---------
X | O | X
---------
  |   | X


QLearningPlayer wins!


Now we will train an agent for each combination of parameters, and we will see how the agent's performance changes with different values of `alpha`, `epsilon`, and `gamma`.
- `alpha = np.linspace(0.1, 1, 5)`
- `epsilon = np.linspace(0.1, 1, 5)`
- `gamma = np.linspace(0.1, 1, 5)`

In [64]:
from itertools import product

random_player = RandomPlayer()
alpha = np.linspace(0.01, 1, 5)
epsilon = np.linspace(0.01, 1, 5)
gamma = np.linspace(0.01, 1, 5)
best_win_rate = 0
best_params = None

for a, e, g in product(alpha, epsilon, gamma):
    learning_player = QLearningPlayer(
        alpha=a,
        epsilon=e,
        gamma=g
    )
    train_agent(random_player, learning_player, 1000)
    win_rate = test_agent(random_player, learning_player, 1000)
    if win_rate > best_win_rate:
        best_win_rate = win_rate
        best_params = (a, e, g)

print(f"Best win rate: {(best_win_rate*100):.2f}%")
print(f"Best parameters: {best_params}")

random_player = RandomPlayer()
learning_player = QLearningPlayer(
    alpha=best_params[0],
    epsilon=best_params[1],
    gamma=best_params[2]
)

print("\nPlay one more game against the agent after the training with the best parameters:")
play_game(random_player, learning_player)

Best win rate: 87.10%
Best parameters: (0.01, 0.505, 1.0)

Play one more game against the agent after the training with the best parameters:
X | O |  
---------
O | X | X
---------
  | O | X


QLearningPlayer wins!


Tried to improve the execution time by using the `concurrent` module to parallelize the training of the agents. Also tried to use the `multiprocessing` module, but it seems that it does not work well with Jupyter notebooks.

In [65]:
from concurrent.futures import ThreadPoolExecutor
from itertools import product

def train_and_test_agent(args):
    alpha, epsilon, gamma = args
    random_player = RandomPlayer()
    learning_player = QLearningPlayer(alpha=alpha, epsilon=epsilon, gamma=gamma)
    train_agent(random_player, learning_player, 1000)
    win_rate = test_agent(random_player, learning_player, 1000)
    return win_rate, (alpha, epsilon, gamma)

In [66]:
random_player = RandomPlayer()
alpha = np.linspace(0.1, 1, 10)
epsilon = np.linspace(0.1, 1, 10)
gamma = np.linspace(0.1, 1, 10)

params_combinations = list(product(alpha, epsilon, gamma))

with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(train_and_test_agent, params_combinations))

best_win_rate, best_params = max(results, key=lambda x: x[0])

print(f"Best win rate: {(best_win_rate*100):.2f}%")
print(f"Best parameters: {best_params}")

random_player = RandomPlayer()
learning_player = QLearningPlayer(alpha=best_params[0], epsilon=best_params[1], gamma=best_params[2])

print("\nPlay one more game against the agent after the training with the best parameters:")
play_game(random_player, learning_player)

Best win rate: 88.80%
Best parameters: (0.8, 0.2, 0.2)

Play one more game against the agent after the training with the best parameters:
X | O |  
---------
O | X |  
---------
  |   | X


QLearningPlayer wins!
