Copyright **`(c)`** 2023 Giovanni Squillero `<giovanni.squillero@polito.it>`  
[`https://github.com/squillero/computational-intelligence`](https://github.com/squillero/computational-intelligence)  
Free for personal or classroom use; see [`LICENSE.md`](https://github.com/squillero/computational-intelligence/blob/master/LICENSE.md) for details.  

# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: [Dies Natalis Solis Invicti](https://en.wikipedia.org/wiki/Sol_Invictus)
* Reviews: [Befana](https://en.wikipedia.org/wiki/Befana)

Notes:

* Reviews will be assigned  on Monday, December 4
* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)

In [60]:
from itertools import combinations
from collections import namedtuple, defaultdict
from random import choice
from copy import deepcopy
import functools

from tqdm.auto import tqdm
import numpy as np

### Magic Boards

In [61]:
#Here we generate all possible symmetry board of tic tac toe
MAGIC = [2, 7, 6, 9, 5, 1, 4, 3, 8]
MAGIC_90=[6,1,8,7,5,3,2,9,4]
MAGIC_180=[8,3,4,1,5,9,6,7,2]
MAGIC_270=[4,9,2,3,5,7,8,1,6]
MAGIC_MIRROR=[4,3,8,9,5,1,2,7,6]
MAGIC_MIRROR_90=[2,9,4,7,5,3,6,1,8]
MAGIC_MIRROR_180=[6,7,2,1,5,9,8,3,4]
MAGIC_MIRROR_270=[8,1,6,3,5,7,4,9,2]

MAGIC_BOARDS = [MAGIC, MAGIC_90, MAGIC_180, MAGIC_270, MAGIC_MIRROR, MAGIC_MIRROR_90, MAGIC_MIRROR_180, MAGIC_MIRROR_270]

In [62]:
State = namedtuple('State', ['x', 'o'])

In [63]:
'''
We define this 'compare' function to sort all possible configuration of one state in order to convert it in a common configuration
This function compare two states in this way: 
    - First it compare the elements of self.x one by one with the elements of other.x, if the two element are the same it compare the next element of 'x' and so on
    - If all element of 'x' are the same it do the same thing with 'o'
'''

def compare(self, other):
    
    for elem_self, elem_other in zip(self.x, other.x):
        if elem_self < elem_other:
            return -1
        elif elem_self > elem_other:
            return 1

    for elem_self, elem_other in zip(self.o, other.o):
        if elem_self < elem_other:
            return -1
        elif elem_self > elem_other:
            return 1
        
    return 0

In [64]:
def find_indexes(state:State):
        #This function extract positional index of state's element in the 'MAGIC' bord
        state_indexes = State(set(), set())

        for val in state.x:
                index = MAGIC.index(val)
                state_indexes.x.add(index)

        for val in state.o:
                index = MAGIC.index(val)
                state_indexes.o.add(index)
        
        return state_indexes
         
def get_representation_by_index(state_indexes:State, magic):
    #Given state indexes in the 'MAGIC' board, 
    #this function return the representation of the state in another variant of magic board passed as parameter
    result= State(set(),set())

    for i in state_indexes.x:
        value = magic[i]
        result.x.add(value)

    for i in state_indexes.o:
        value = magic[i]
        result.o.add(value) 

    return result
        
def get_equivalent_representations(state:State):
    #Given a state this function returns a list containing all equivalent state representation in the different magic boards
    representations = []
    state_indexes = find_indexes(state)

    for magic in MAGIC_BOARDS:
        representations.append(get_representation_by_index(state_indexes, magic))

    return representations

### State Class

In [65]:
class CustomState(namedtuple('State', ['x', 'o'])):
    #for the hash we generate all possible configuration from one state and then we re order the vector with comapre fucntion and extract first element
    #in this way we have always the same configuration

    def __eq__(self, other_state):
        #To see if one representation is equivalent to another one we extract indices in the 'MAGIC' board 
        #and then we search those indexes in all magic boards to extract the equivalent representations
        #If one is equivalent to other_state, the two state are the same

        state_indexes = find_indexes(self)
        for magic in MAGIC_BOARDS:
            representation = get_representation_by_index(state_indexes, magic)
            if (sorted(representation.x) == sorted(other_state.x) and sorted(representation.o) == sorted(other_state.o)):
                return True
            
        return False

    def __hash__(self):
        #We generate all possible equivalent representations from one state
        #and then we re-order them with 'compare' function, we extract first element 
        #and then we apply hash functio to its string representation

        rappresentations = get_equivalent_representations(self)
        sorted_rappresentations = sorted(rappresentations, key=functools.cmp_to_key(compare))

        return hash(str(sorted_rappresentations))
    
    def unique_representation(self):
        #This function convert the state in its unique representation         
        rappresentations = get_equivalent_representations(self)
        sorted_rappresentations = sorted(rappresentations, key=functools.cmp_to_key(compare))
            
        return CustomState(sorted_rappresentations[0].x, sorted_rappresentations[0].o)

print(CustomState(set([1]),set([4])).__hash__())

print(CustomState(set([1]),set([2])).__hash__())



-3553939072360903781
-3553939072360903781


### Q-Learning  DA MODIFICARE
The reward are:
- 1 for win
- 0.75 if we block adversarial win
- 0.5 if we make a trap
and viceversa but negative for adversarial

We've implemented the symmetry for board of tic tac toe, in this way instead of save all possible configuration we save only one configuration, we chose the common configuration with reverse function (from a state we generate all possible configuration and take the first after that we re order the configuration using compare function for key). In this way we pass from a dictionary of 5k element (da verificare) to a dictionary of 1k element (da verificare)

In [66]:
def random_move(available):
    x = choice(list(available))
    return x

In [67]:
def win(elements):
    #Checks is elements is winning
    return any(sum(c) == 15 for c in combinations(elements, 3))

def block_win_adv(adv_elements, action):
    #Checks if our action can block adversarial win
    for c in combinations(adv_elements, 2):
        if 15 - sum(c) == action:
            return True
    return False
 
def trap_condition(user_elements, adv_elements):
    #Checks if the user successfully create a double trap condition for the advesary in the given state 
    cnt = 0
    for c in combinations(user_elements, 2):
        val = 15 - sum(c)
        if val not in adv_elements and val > 0:
            cnt += 1
            if cnt >= 2:
                return True         
    return False
 
def state_value(state: State):
    #Evaluate state: +1 first player wins
    if win(state.x):
        return 1
    elif win(state.o):
        return -1
    else:
        return 0
    

In [68]:
def q_learning(steps, learning_rate, discount_factor):
    value_dictionary = {}

    for _ in tqdm(range(steps)):
        current_state = CustomState(set(), set())
        cnt = 0

        while current_state.x.union(current_state.o) != set(range(1, 10)) and state_value(current_state) == 0:
            next_state = deepcopy(current_state)
            current_state = current_state.unique_representation()
            next_state = next_state.unique_representation()
            action = random_move(set(range(1, 10)) - (current_state.x.union(current_state.o)))
            player = cnt % 2
            cnt += 1

            if player == 1:
                next_state.x.add(action)
                next_state = next_state.unique_representation()
                reward = state_value(next_state)

                if(reward == 0):
                    if block_win_adv(next_state.o, action):
                        reward = 0.75
                    elif trap_condition(next_state.x, next_state.o):
                        reward = 0.5

                if current_state not in value_dictionary:
                    value_dictionary[current_state] = {action: 0.}
                elif action not in value_dictionary[current_state]:
                    value_dictionary[current_state][action] = 0.

                if next_state not in value_dictionary:
                    value_dictionary[next_state] = {action: 0.}
                elif action not in value_dictionary[next_state]:
                    value_dictionary[next_state][action] = 0.

                value_dictionary[current_state][action] = ((1 - learning_rate) * value_dictionary[current_state][action] + 
                    learning_rate * (reward + discount_factor * max(value_dictionary[next_state].values())))
                current_state = deepcopy(next_state)

            else:  
                next_state.o.add(action)
                next_state = next_state.unique_representation()
                reward = state_value(next_state)

                if(reward == 0):
                    if block_win_adv(next_state.x, action):
                        reward = -0.75
                    elif trap_condition(next_state.o, next_state.x):
                        reward = -0.5

                if current_state not in value_dictionary:
                    value_dictionary[current_state] = {action: 0.}
                elif action not in value_dictionary[current_state]:
                    value_dictionary[current_state][action] = 0.

                if next_state not in value_dictionary:
                    value_dictionary[next_state] = {action: 0.}
                elif action not in value_dictionary[next_state]:
                    value_dictionary[next_state][action] = 0.

                value_dictionary[current_state][action] = ((1 - learning_rate) * value_dictionary[current_state][action] + 
                    learning_rate * (reward + discount_factor * min(value_dictionary[next_state].values())))
                current_state = deepcopy(next_state)

    return value_dictionary       

In [69]:
def stampa_dizionario(dizionario, livello=0):
    spazi = "  " * livello
    for chiave, valore in dizionario.items():
        if isinstance(valore, dict):
            print(f"{spazi}{chiave}:")
            stampa_dizionario(valore, livello + 1)
        else:
            print(f"{spazi}{chiave}: {valore}")

### Tests

In [70]:
def play_against_random(value_dictionary, invert=False):
    current_state = CustomState(set(),set())

    while len(current_state.x.union(current_state.o)) < 9 and state_value(current_state) == 0:
        if invert:
            ## Player 1 (random agent)
            action = random_move(set(range(1, 10)) - (current_state.x.union(current_state.o)))
            current_state.o.add(action)
            current_state = current_state.unique_representation()

            if len(current_state.x.union(current_state.o)) == 9 or state_value(current_state) == -1:
                break

            ## Player 2 (RL agent)
            list_action = sorted(value_dictionary[current_state], key=value_dictionary[current_state].get)
            
            for action in list_action:
                if action not in (current_state.x.union(current_state.o)):
                    current_state.x.add(action)
                    break
        else:
            ## Player 1 (RL agent)
            list_action = sorted(value_dictionary[current_state], key=value_dictionary[current_state].get)
            
            for action in list_action:
                if action not in (current_state.x.union(current_state.o)):
                    current_state.x.add(action)
                    break
            
            if len(current_state.x.union(current_state.o)) == 9 or state_value(current_state) == 1:
                break
            
            ## Player 2 (random agent)
            action = random_move(set(range(1, 10)) - (current_state.x.union(current_state.o)))
            current_state.o.add(action)

        current_state = current_state.unique_representation()
    
    return state_value(current_state)

### Results

In [71]:
steps = 5000
lr_params = [0.1, 0.3]
df_params = [0.5, 0.7]

print('Results:\n')
for lr in lr_params:
    for df in df_params:
        value_dictionary = q_learning(steps, lr, df)
        
        n_win = 0
        n_draw = 0
        n_lose = 0

        for _ in range(100):
            result = play_against_random(value_dictionary)
            if result == 1:
                n_win += 1
            elif result == -1:
                n_lose += 1
            else:
                n_draw += 1
        
        for _ in range(100):
            result = play_against_random(value_dictionary, invert=True)
            if result == 1:
                n_win += 1
            elif result == -1:
                n_lose += 1
            else:
                n_draw += 1
        
        win_rate = n_win / (n_win + n_draw + n_lose)
        print('\t- learning rate: {0:.2f}     discount factor: {1:.2f}     win rate: {2:.2%}'.format(lr, df, win_rate))
        print('n_lose: {0}    n_draw: {1}, n_win: {2}'.format(n_lose, n_draw, n_win))




Results:



  0%|          | 0/5000 [00:00<?, ?it/s]

	- learning rate: 0.10     discount factor: 0.50     win rate: 48.50%
n_lose: 96    n_draw: 7, n_win: 97


  0%|          | 0/5000 [00:00<?, ?it/s]

	- learning rate: 0.10     discount factor: 0.70     win rate: 51.00%
n_lose: 86    n_draw: 12, n_win: 102


  0%|          | 0/5000 [00:00<?, ?it/s]

	- learning rate: 0.30     discount factor: 0.50     win rate: 38.50%
n_lose: 91    n_draw: 32, n_win: 77


  0%|          | 0/5000 [00:00<?, ?it/s]

	- learning rate: 0.30     discount factor: 0.70     win rate: 48.50%
n_lose: 92    n_draw: 11, n_win: 97


In [72]:
value_dictionary = q_learning(10000, 0.1, 0.7)
for i in range(2):
    n_win = 0
    n_draw = 0
    n_lose = 0
    print('Test {0}:'.format(i))
    for _ in range(100):
        if i == 0:
            result = play_against_random(value_dictionary)
        else:
            result = play_against_random(value_dictionary, invert=True)

        if result == 1:
            n_win += 1
        elif result == -1:
            n_lose += 1
        else:
            n_draw += 1
    print('Win: {0}\tLose: {1}\tDraw: {2}'.format(n_win, n_lose, n_draw))

  0%|          | 0/10000 [00:00<?, ?it/s]

Test 0:
Win: 63	Lose: 10	Draw: 27
Test 1:
Win: 6	Lose: 90	Draw: 4
