Copyright **`(c)`** 2023 Giovanni Squillero `<giovanni.squillero@polito.it>`  
[`https://github.com/squillero/computational-intelligence`](https://github.com/squillero/computational-intelligence)  
Free for personal or classroom use; see [`LICENSE.md`](https://github.com/squillero/computational-intelligence/blob/master/LICENSE.md) for details.  

In [93]:
from itertools import combinations
from collections import namedtuple, defaultdict
from random import choice
from copy import deepcopy
from SimBoard import Board
from tqdm.auto import tqdm

# Introduction

The Lab's objective is to develop an agent capable of playing Tic-Tac-Toe using Reinforcement Learning (RL). Through the application of simple strategies and techniques, I have managed to create an agent that, after approximately 6 minutes of training (equivalent to 500.000 random games + 300.000 policy games), achieves a 73% win rate and avoids losses in 85% of cases. The results are quite satisfactory, given the agent's straightforward design and the relatively short training time. It is worth noting that I deliberately restricted the agent to always making the second move, providing an advantage to the opponent, and thus establishing a lower bound on the performances.

---

# Methodology

To ease the evaluation process and reduce the number of potential configurations, I made use of symmetries and rotations when assessing states. Players make their moves on a board, which are then translated to a canonical form serving as a reference to which all other configurations can be mapped. Significantly, evaluations are exclusively conducted on the canonical state, enhancing the efficiency of state evaluation. This approach has demonstrated an increase in the win rate by several percentage points, proving to be a valuable technique for a smarter states exploration. Moreover, the evaluation of states, initially performed during a training phase using only random games, has been extended to games played by an agent adopting a policy. While the first 500.000 evaluations were conducted randomly, an additional 200.000 evaluations were performed using both a random agent and an agent employing a policy. This fine-tuning has significantly increased the win rate by some percentage points. In terms of move selection, the agent adopts a greedy strategy, prioritizing moves that maximize immediate rewards in the next configuration. While deeper policies could be explored and prove to be more effective, time constraints have limited the implementation of such alternatives.

*P.S. +1 at the exam for committing the Lab on the 24th December!*



## Starting code

In [4]:
State = namedtuple('State', ['x', 'o'])
MAGIC = [2, 7, 6, 9, 5, 1, 4, 3, 8]

def print_board(pos):
    """Nicely prints the board"""
    for r in range(3):
        for c in range(3):
            i = r * 3 + c
            if MAGIC[i] in pos.x:
                print('X', end='')
            elif MAGIC[i] in pos.o:
                print('O', end='')
            else:
                print('.', end='')
        print()
    print()

def print_square():
    for r in range(3):
        for c in range(3):
            i = r * 3 + c
            print(MAGIC[i], end='')
        print()
    print()

def win(elements):
    """Checks is elements is winning"""
    return any(sum(c) == 15 for c in combinations(elements, 3))

def state_value(pos: State):
    """Evaluate state: +1 first player wins"""
    if win(pos.x):
        return 1
    elif win(pos.o):
        return -1
    else:
        return 0

def random_game():
    # List of states
    trajectory = list()
    state = State(set(), set())
    available = set(range(1, 9+1))
    while available:
        # First player choices a move
        x = choice(list(available))
        # State of first player updates
        state.x.add(x)
        # Append the state, not only the move
        trajectory.append(deepcopy(state))
        available.remove(x)
        if win(state.x) or not available:
            break

        o = choice(list(available))
        state.o.add(o)
        trajectory.append(deepcopy(state))
        available.remove(o)
        if win(state.o):
            break
    return trajectory

In [None]:
value_dictionary = defaultdict(float)
hit_state = defaultdict(int)
epsilon = 0.001

for steps in tqdm(range(1_000_000)):
    trajectory = random_game()
    final_reward = state_value(trajectory[-1])
    for state in trajectory:
        hashable_state = (frozenset(state.x), frozenset(state.o))
        hit_state[hashable_state] += 1
        value_dictionary[hashable_state] = value_dictionary[
            hashable_state
        ] + epsilon * (final_reward - value_dictionary[hashable_state])

sorted(value_dictionary.items(), key=lambda e: e[1], reverse=True)[:10]

## Personal implementation

In [148]:
# Training over random games.
b = Board()
for _ in tqdm(range(500_000)):
    b.random_game(epsilon=0.001)

100%|██████████| 500000/500000 [02:59<00:00, 2779.98it/s]


In [149]:
# Tuning over policy games.
for _ in tqdm(range(300_000)):
    b.policy_game(agent='greedy', epsilon=0.001)

100%|██████████| 300000/300000 [02:27<00:00, 2029.38it/s]


In [164]:
# Testing performance.
winners = []
for _ in tqdm(range(10_000)):
    b.policy_game(agent='greedy', epsilon=0)
    winners.append(b.reward)

100%|██████████| 10000/10000 [00:05<00:00, 1963.28it/s]


In [163]:
print(f'Wins   : {winners.count(-1)/100}%\nDraws  : {winners.count(0)/100}%\nLosses : {winners.count(1)/100}%')

Wins   : 73.36%
Draws  : 11.12%
Losses : 15.52%
