# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: [Dies Natalis Solis Invicti](https://en.wikipedia.org/wiki/Sol_Invictus)
* Reviews: [Befana](https://en.wikipedia.org/wiki/Befana)

Notes:

* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)


## Work
This code was designed, programmed and tested by
* Lorenzo Bonannella 
* Giacomo Fantino
* Farisan Fekri
* Giacomo Cauda

### Comments on our work
In this laboratory we designed a RL approach to play tic-tac-toe. In particular, our work was focused on:
1. Creating an action-value function
2. Designing a hybrid Montecarlo approach: when a certain threshold `MAX_NUM_ACTIONS=3` is reached, a full-game simultation is applied

In [1]:
from itertools import combinations
from collections import namedtuple

#if the player has selected some squares such that with 3 squares the sum is 15 he won
#we have to define the squares in tic tac toe such that this function make sense
def win(squares):
    return any(sum(c) == 15 for c in combinations(squares, 3))

Position = namedtuple('Position', ['x', 'o'])

#this state value can become the evaluation function
def state_value(position : Position):
    if win(position.x):
        return 1
    elif win(position.o):
        return -1
    else:
        return 0

MAGIC = [2, 7, 6, 9, 5, 1, 4, 3, 8] #magic board that is used for check
def print_board(pos):
    for x in range(3):
        for y in range(3):
            idx = 3*x + y
            if MAGIC[idx] in pos.x:
                print(' X |', end='')
            elif MAGIC[idx] in pos.o:
                print(' O |', end='')
            else:
                print('   |', end='')
        print()
    print()

state = Position(x={1, 2, 5, 7}, o={3, 6, 9})
print_board(state)
state_value(state)

 X | X | O |
 O | X | X |
   | O |   |



0

## Agent1: Action-Value Function

The first step was to change the agent so that the computed dictionary is now an action value function.

In [2]:
#game with two random players
from copy import deepcopy
from random import choice


def random_game():
    state = Position(set(), set())
    available = set(range(1, 10))
    trajectory = []

    while True:
        #first player
        x = choice(list(available))
        trajectory.append((deepcopy(state), x)) #current state + chosen action
        state.x.add(x)
        available.remove(x)
        
        if win(state.x):
            break
        elif len(available) == 0:
            break
        
        #opponent player (doesn't consider the state)
        y = choice(list(available))
        state.o.add(y)
        available.remove(y)

        if win(state.o):
            break
        elif len(available) == 0:
            break
    
    return trajectory, state #sequence of move for computing the state value function

from collections import defaultdict

Q_func = defaultdict(float)
epsilon = 0.01

for step in range(4_000):
    trajectory, last_state = random_game()
    final_reward = state_value(last_state)
    
    for (state, action) in trajectory:
        #hash_state = (frozenset(state.x), frozenset(state.o))
        hash_state = (frozenset(state.x), frozenset(state.o), action)
        difference = (final_reward - Q_func[hash_state])
        Q_func[hash_state] = Q_func[hash_state] + epsilon*difference

sorted(Q_func.items(), key=lambda e : e[1], reverse = True)[:30]


[((frozenset(), frozenset(), 5), 0.47715282503048256),
 ((frozenset(), frozenset(), 2), 0.38014217911973874),
 ((frozenset(), frozenset(), 4), 0.2614545388480996),
 ((frozenset(), frozenset(), 8), 0.25332569107542413),
 ((frozenset(), frozenset(), 6), 0.2520724547838263),
 ((frozenset(), frozenset(), 3), 0.1977963254096396),
 ((frozenset(), frozenset(), 7), 0.15220787746652853),
 ((frozenset(), frozenset(), 9), 0.13993995679074497),
 ((frozenset({4}), frozenset({3}), 2), 0.12874122890512435),
 ((frozenset({1}), frozenset({8}), 5), 0.1215511972310217),
 ((frozenset(), frozenset(), 1), 0.12061554957746343),
 ((frozenset({1, 2, 4, 5}), frozenset({3, 6, 7, 8}), 9), 0.11361512828387071),
 ((frozenset({1, 2, 3, 5}), frozenset({4, 6, 7, 8}), 9), 0.11361512828387071),
 ((frozenset({2}), frozenset({3}), 4), 0.11325967004354746),
 ((frozenset({1, 2, 4, 7}), frozenset({3, 5, 8, 9}), 6), 0.10466174574128355),
 ((frozenset({5}), frozenset({3}), 2), 0.10466174574128355),
 ((frozenset({1, 3, 7, 9}), 

In [3]:
for i in range(1, 10):
    hash_state = (frozenset({}), frozenset({}), i)
    reward = Q_func[hash_state]
    print(f"action {i} reward {reward}")

action 1 reward 0.12061554957746343
action 2 reward 0.38014217911973874
action 3 reward 0.1977963254096396
action 4 reward 0.2614545388480996
action 5 reward 0.47715282503048256
action 6 reward 0.2520724547838263
action 7 reward 0.15220787746652853
action 8 reward 0.25332569107542413
action 9 reward 0.13993995679074497


In [None]:
def agent(Q_dict, state, available):
    current_reward = -1000
    action = -1
    for i in available:
        hash_state = (frozenset(state.x), frozenset(state.o), i)
        reward = Q_dict[hash_state]
        if reward > current_reward:
            current_reward = reward
            action = i
    return action

wins = 0
draws = 0
looses = 0

for _ in range(100_000):
    state = Position(set(), set())
    available = set(range(1, 10))

    while True:
        #first player
        x = agent(Q_func, state, available)
        state.x.add(x)
        available.remove(x)
        
        if win(state.x):
            wins += 1
            break
        elif len(available) == 0:
            draws += 1
            break
        
        #opponent player (doesn't consider the state)
        y = choice(list(available))

        state.o.add(y)
        available.remove(y)

        if win(state.o):
            looses += 1
            break
        elif len(available) == 0:
            draws += 1
            break
print(f'Wins {wins/(wins + looses + draws)}')
print(f'Draws {draws/(wins + looses + draws)}')
print(f'Looses {looses/(wins + looses + draws)}')

## Agent2: Hybrid Montecarlo simulation with full game simulation

As second agent we have created an hybrid of MonteCarlo and Full game: we start by exploring a certain path but after a certain level of depth we switch to exploitation and explore the entire subtree. The reward is the average of all possible terminal states and that's the value that will be propagated.
For nodes below the level we use a pessimistic strategy, assigning the worst reward to the node.

In [6]:
from collections import defaultdict

Q_func = defaultdict(float)
epsilon = 0.01

#game with two random players using a hybr
def full_game(state, available):
    turn_player = 'x' if len(state.x) == len(state.o) else 'o'

    if win(state.x):
        return [1]
    elif win(state.o):
        return [-1]
    elif len(available) == 0:
        return [0]

    general_list = []
    for act in available:
        new_available = deepcopy(available)
        new_state = deepcopy(state)
        if turn_player == 'x':
            new_state.x.add(act)
        else:
            new_state.o.add(act)
        new_available.remove(act)
        list_act = full_game(new_state, new_available)

        if turn_player == 'x':
            #we use min to consider the worst case so the agent will choose more safe actions
            hash_state = (frozenset(state.x), frozenset(state.o), action)
            Q_func[hash_state] = min(list_act)
        general_list.extend(list_act)

    return general_list

MAX_NUM_ACTIONS = 3
def random_game():
    state = Position(set(), set())
    available = set(range(1, 10))
    trajectory = []
    num_actions = 0

    while num_actions < MAX_NUM_ACTIONS:
        #first player
        x = choice(list(available))
        trajectory.append((deepcopy(state), x)) #current state + chosen action
        state.x.add(x)
        available.remove(x)
        
        if win(state.x):
            break
        elif len(available) == 0:
            break
        
        #opponent player (doesn't consider the state)
        y = choice(list(available))
        state.o.add(y)
        available.remove(y)

        if win(state.o):
            break
        elif len(available) == 0:
            break

        num_actions += 1
    
    if win(state.x):
        reward = 1
    elif win(state.o):
        reward = -1
    elif len(available) == 0:
        reward = 0
    else:
        #game not finished: exploitation of the subtree
        reward_list = full_game(state, available) #list of all rewards, one for each terminal state
        reward = sum(reward_list)/len(reward_list) #let's compute the average result
    return trajectory, reward

for step in range(100_000):
    trajectory, final_reward = random_game()
    
    for (state, action) in trajectory:
        hash_state = (frozenset(state.x), frozenset(state.o), action)
        difference = (final_reward - Q_func[hash_state])
        Q_func[hash_state] = Q_func[hash_state] + epsilon*difference


In [None]:
def agent(Q_dict, state, available):
    current_reward = -1000
    action = -1
    for i in available:
        hash_state = (frozenset(state.x), frozenset(state.o), i)
        reward = Q_dict[hash_state]
        if reward > current_reward:
            current_reward = reward
            action = i
    return action

wins = 0
draws = 0
looses = 0

for _ in range(100_000):
    state = Position(set(), set())
    available = set(range(1, 10))

    while True:
        #first player
        x = agent(Q_func, state, available)
        state.x.add(x)
        available.remove(x)
        
        if win(state.x):
            wins += 1
            break
        elif len(available) == 0:
            draws += 1
            break
        
        #opponent player (doesn't consider the state)
        y = choice(list(available))

        state.o.add(y)
        available.remove(y)

        if win(state.o):
            looses += 1
            break
        elif len(available) == 0:
            draws += 1
            break
print(f'Wins {wins/(wins + looses + draws)}')
print(f'Draws {draws/(wins + looses + draws)}')
print(f'Looses {looses/(wins + looses + draws)}')

## Experiments and Results

We have tried for each agent to change the number the iterations and then tried to play 100_000 games with a random player. Here the results (for each agent we present the number of wins and draws in percentage):

| Iterations | Agent1      | Agent2      |
|------------|-------------|-------------|
| 3_000      | (81.5, 7.5) | (82, 6)     |
| 4_000      | (83, 8.5)   | (85, 7)     |
| 5_000      | (86, 6)     | (88, 8.5)   |
| 10_000     | (90, 5.4)   | (90, 7)     |
| 20_000     | (95.3, 3.3) | (87, 8)     |
| 50_000     | (97, 2.5)   | (90, 8)     |
| 100_000    | (98.7, 1.3) | (93.4, 5.5) |
| 200_000    | (99, 1)     | (95, 4)     |
| 500_000    | (99, 1)     | (95.6, 4)   |

We have the following conclusions:
* Thanks to a better exploration of the tree the second agent can achive good results with a lower number of generations.
* The first agent can converge quickly to almost perfect performance (with 100_000 iterations) while the second agent is struggling. This may be due to how we treated nodes at lower level, or maybe the epsilon value is too low.