Copyright **`(c)`** 2023 Giovanni Squillero `<giovanni.squillero@polito.it>`  
[`https://github.com/squillero/computational-intelligence`](https://github.com/squillero/computational-intelligence)  
Free for personal or classroom use; see [`LICENSE.md`](https://github.com/squillero/computational-intelligence/blob/master/LICENSE.md) for details.  

# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: [Dies Natalis Solis Invicti](https://en.wikipedia.org/wiki/Sol_Invictus)
* Reviews: [Befana](https://en.wikipedia.org/wiki/Befana)

Notes:

* Reviews will be assigned  on Monday, December 4
* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)

In [109]:
import numpy as np
from itertools import permutations
from tqdm import tqdm

# Class Definition

Here I defined the TicTacToe class, explaining the methods:

- `print_board_nice()` -> pretty self explanatory to be honest
- `state()` -> returns the current state, described as the flattened board
- `next_actions()` -> returns the next possible actions, described as row and column
- `make_move()` -> assigns the player index to the chosen place of the board
- `check_winner()` -> also pretty self explanatory
- `reward()` -> rewards the player after a win and penalizes after a loss
- `is_over()` -> checks if a game is over

In [110]:
class TicTacToe:
  def __init__(self):
    self.board = np.zeros((3,3), dtype=np.int8)
    self.player = 1
    self.MAGIC = np.array([[1,6,5],[8,4,0],[3,2,7]])
    self.winner = None

  def print_board_nice(self):
    for i in range(3):
      for j in range(3):
        if self.board[i,j] == 1:
          print("X", end="")
        elif self.board[i,j] == -1:
          print("O", end="")
        else:
          print(" ", end="")
        if j != 2:
          print("|", end="")
      print()
      if i != 2:
        print("-----")
    print()
  
  def state(self):
    return self.board.flatten()
  
  def next_actions(self):
    if self.winner is not None:
      return list()
    row, col = np.where(self.board == 0)
    return list(zip(row, col))
  
  def make_move(self, move):
    if self.board[move] != 0:
      raise ValueError("Invalid move")
    self.board[move] = self.player
    self.player *= -1

  def check_winner(self, player):
    cells = self.MAGIC[self.board == player]
    if any(sum(cells) == 12 for cells in permutations(cells, 3)):
      self.winner = player
      return True
    return False
  
  def reward(self, player):
    if self.check_winner(player):
      return 1
    if self.check_winner(-player):
      return -1
    return 0

  def is_over(self):
    return len(self.next_actions()) == 0 or self.check_winner(1) or self.check_winner(-1)

Here I defined the Q-Learning class which is pretty similar to the standard approach.

To balance exploration and exploitation I used an epsilon variable that decreases linearly with the number of iterations, so exploration is favored in the beginning, while exploitation is favored in the end of the training process.

In [111]:
class QLearning:
  def __init__(self, alpha=0.1, gamma=0.9, epsilon=0.1):
    self.alpha = alpha
    self.gamma = gamma
    self.epsilon = epsilon
    self.Q = dict()

  def set_epsilon(self, epsilon):
    self.epsilon = epsilon

  def get_Q(self, state, action):
    if (state, action) not in self.Q:
      self.Q[(state, action)] = 0
    return self.Q[(state, action)]

  def get_action(self, state, actions):
    if np.random.uniform() < self.epsilon:
      return actions[np.random.choice(range(len(actions)))]
    else:
      Qs = np.array([self.get_Q(state, action) for action in actions])
      max_Q = np.max(Qs)
      return actions[np.random.choice(np.where(Qs == max_Q)[0])]

  def update(self, state, action, reward, next_state, next_actions):
    q_value = self.get_Q(state, action)
    next_q_values = np.array([self.get_Q(next_state, next_action) for next_action in next_actions])
    max_next_q_value = np.max(next_q_values) if len(next_q_values) > 0 else 0
    self.Q[(state, action)] = q_value + self.alpha * (reward + self.gamma * max_next_q_value - q_value)
  

In [112]:
Q1 = QLearning()
games = 50000
epsilon = np.linspace(1, 0.1, num=games, endpoint=True)

for i in tqdm(range(games), desc="Training", unit="game"):
  game = TicTacToe()
  Q1.set_epsilon(epsilon[i])
  while not game.is_over():
    state = game.state().copy()
    actions = game.next_actions()
    action = Q1.get_action(str(state), actions)
    game.make_move(action)
    
    if game.is_over():
      next_state = game.state().copy()
      next_actions = game.next_actions()
      reward = game.reward(1)
      Q1.update(str(state), action, reward, str(next_state), next_actions)

    else:
      reward = game.reward(1)
      actions_2 = game.next_actions()
      action_2 = actions_2[np.random.choice(range(len(actions_2)))]
      game.make_move(action_2)

      if game.is_over():
        reward = game.reward(1)
        
      next_state = game.state().copy()
      next_actions = game.next_actions()
      Q1.update(str(state), action, reward, str(next_state), next_actions)


Training:   0%|          | 76/50000 [00:00<02:12, 377.90game/s]

Training: 100%|██████████| 50000/50000 [01:32<00:00, 537.66game/s]


In [113]:
Q1.set_epsilon(0)
wins = 0
losses = 0
ties = 0 
games = 10000

for i in range (games):
  game = TicTacToe()
  while not game.is_over():
    if game.player == 1:
      state = game.state()
      actions = game.next_actions()
      action = Q1.get_action(str(state), actions)
      game.make_move(action)
    else:
      state = game.state()
      actions = game.next_actions()
      action = actions[np.random.choice(range(len(actions)))]
      game.make_move(action)
    
  if game.winner == 1:
    wins += 1
  elif game.winner == -1:
    losses += 1
  else:
    ties += 1

print("Wins: ", wins)
print("Losses: ", losses)
print("Ties: ", ties)
print("Win rate: ", wins/games)
print("Loss rate: ", losses/games)
print("Tie rate: ", ties/games)

Wins:  9955
Losses:  0
Ties:  45
Win rate:  0.9955
Loss rate:  0.0
Tie rate:  0.0045


The same approach can be used to create an agent that plays as player 2, however the results may vary slightly because of the nature of the Tic Tac Toe game.