## The Tic-Tac-Toe environment

The [Tic-Tac-Toe](https://github.com/MauroLuzzatto/OpenAI-Gym-TicTacToe-Environment) is a simple game environment that allows to train reinforcement learning agents.

In [5]:
from IPython.display import Image

Image(
    url="https://img.poki.com/cdn-cgi/image/quality=78,width=600,height=600,fit=cover,f=auto/85535e05d1f130b16751c8308cfbb19b.png",
    width=300,
)

In [6]:
# load the python modules
import time
import sys
import warnings

import gym
import numpy as np
from tqdm import tqdm
import gym_TicTacToe

from src.qagent import Qagent
from src.player import Player
from src.play_tictactoe import play_tictactoe

from src.utils import (
    create_state_dictionary,
    reshape_state,
    save_qtable,
)

# ignore warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [7]:
# initialize the tictactoe environment
env = gym.envs.make("TTT-v0", small=-1, large=10)

In [8]:
# get 10 randomly sampled actions
[env.action_space.sample() for ii in range(10)]

[0, 5, 0, 5, 5, 6, 3, 2, 4, 1]

In [9]:
env.reset()
print(env.render())

╒═══╤═══╤═══╕
│ - │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛


In [10]:
color = 1
action = 4

new_state, reward, done, _ = env.step((action, color))
print(new_state, reward, done)
print(env.render())

[[0 0 0]
 [0 1 0]
 [0 0 0]] -1 False
╒═══╤═══╤═══╕
│ - │ - │ - │
├───┼───┼───┤
│ - │ X │ - │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛


In [11]:
state_dict = create_state_dictionary()
state_size = env.observation_space.n
action_size = env.action_space.n

Number of legal states: 8953


In [12]:
# set training parameters
episodes = 90_000  # 10**6 * 2
max_steps = 9

# name of the qtable when saved
load = False
save = True
test = True

num_test_games = 1

In [13]:
learning_parameters = {"learning_rate": 1.0, "gamma": 0.9}
exploration_parameters = {
    "max_epsilon": 1.0,
    "min_epsilon": 0.0,
    "decay_rate": 0.000005,
}

name = f"qtable_{episodes}"
folder = "tables"

In [None]:
qagent = Qagent(state_size, action_size, learning_parameters, exploration_parameters)

In [14]:

def play(qagent:Qagent, player: Player, state: int, action_space: np.array) -> tuple:
    """this function contains the one round of game play for one player

    Args:
        qagent (Qagent): qagent to play to game
        player (Player): player class that has its turn
        state (int): number of the current state
        action_space (np.arry): array with all available actions

    Returns:
        tuple: state, action_space, done
    """
    action = qagent.get_action(state, action_space)

    # remove action from the action space
    action_space = action_space[action_space != action]

    new_state, reward, done, _ = env.step((action, player.color))
    new_state = state_dict[reshape_state(new_state)]

    qagent.qtable[state, action] = qagent.update_qtable(
        state, new_state, action, reward, done
    )
    # new state
    state = new_state
    player.add_reward(reward)
    return state, action_space, done

In [15]:

start_time = time.time()

player_1 = Player(color=1, episodes=episodes)
player_2 = Player(color=2, episodes=episodes)


for episode in tqdm(range(episodes)):
    state = env.reset()
    state = state_dict[reshape_state(state)]

    action_space = np.arange(9)

    player_1.reset_reward()
    player_2.reset_reward()

    # randomly change the order players
    # to start the game, integer either 0 or 1
    start = np.random.randint(2)

    for _step in range(start, max_steps + start):

        # alternate the moves of the players
        if _step % 2 == 0:
            state, action_space, done = play(qagent, player_1, state, action_space)
        else:
            state, action_space, done = play(qagent, player_2, state, action_space)

        if done == True:
            break

    # reduce epsilon for exporation-exploitation tradeoff
    qagent.update_epsilon(episode)
    player_1.save_reward(episode)
    player_2.save_reward(episode)

    if episode % 1_0000 == 0:

        sum_q_table = np.sum(qagent.qtable)
        time_passed = round((time.time() - start_time) / 60.0, 2)

        print(
            f"episode: {episode}, \
            epsilon: {round(qagent.epsilon, 2)}, \
            sum q-table: {sum_q_table}, \
            elapsed time [min]: {time_passed},  \
            done [%]: {episode / episodes * 100} \
            "
        )


  0%|          | 51/2000000 [00:00<2:12:20, 251.88it/s]

episode: 0,             epsilon: 1.0,             sum q-table: 2.0,             elapsed time [min]: 0.0,              done [%]: 0.0             


  1%|          | 10035/2000000 [01:12<1:36:42, 342.96it/s]

episode: 10000,             epsilon: 0.99,             sum q-table: 95595.12738000002,             elapsed time [min]: 1.21,              done [%]: 0.5             


  1%|          | 20054/2000000 [01:47<1:21:04, 407.05it/s]

episode: 20000,             epsilon: 0.98,             sum q-table: 134310.21514000001,             elapsed time [min]: 1.78,              done [%]: 1.0             


  2%|▏         | 30028/2000000 [02:38<2:50:22, 192.71it/s]

episode: 30000,             epsilon: 0.97,             sum q-table: 145288.93072,             elapsed time [min]: 2.64,              done [%]: 1.5             


  2%|▏         | 40025/2000000 [03:39<2:20:50, 231.94it/s]

episode: 40000,             epsilon: 0.96,             sum q-table: 149170.6773,             elapsed time [min]: 3.66,              done [%]: 2.0             


  3%|▎         | 50039/2000000 [04:25<2:23:13, 226.91it/s]

episode: 50000,             epsilon: 0.95,             sum q-table: 150339.05469999998,             elapsed time [min]: 4.42,              done [%]: 2.5             


  3%|▎         | 60004/2000000 [05:18<1:54:03, 283.47it/s]

episode: 60000,             epsilon: 0.94,             sum q-table: 151100.2896,             elapsed time [min]: 5.31,              done [%]: 3.0             


  4%|▎         | 70001/2000000 [06:41<9:41:31, 55.31it/s] 

episode: 70000,             epsilon: 0.93,             sum q-table: 151358.4228,             elapsed time [min]: 6.69,              done [%]: 3.5000000000000004             


  4%|▍         | 80014/2000000 [09:12<9:14:22, 57.72it/s] 

episode: 80000,             epsilon: 0.92,             sum q-table: 151251.0959,             elapsed time [min]: 9.21,              done [%]: 4.0             


  5%|▍         | 90031/2000000 [10:31<2:03:26, 257.86it/s]

episode: 90000,             epsilon: 0.91,             sum q-table: 151663.42260000002,             elapsed time [min]: 10.52,              done [%]: 4.5             


  5%|▌         | 100042/2000000 [11:27<1:56:42, 271.34it/s]

episode: 100000,             epsilon: 0.9,             sum q-table: 152165.9625,             elapsed time [min]: 11.45,              done [%]: 5.0             


  6%|▌         | 110006/2000000 [13:31<9:22:31, 56.00it/s] 

episode: 110000,             epsilon: 0.9,             sum q-table: 152031.2191,             elapsed time [min]: 13.52,              done [%]: 5.5             


  6%|▌         | 119538/2000000 [15:23<4:02:09, 129.42it/s] 


KeyboardInterrupt: 

In [None]:
qtable = qagent.get_qtable()
save_qtable(qtable, folder, name)

qtable_90000.npy saved!


In [None]:
# test the algorithm with playing against it
play_tictactoe(env, qtable, max_steps, state_dict)

Human beginns
--------------------
╒═══╤═══╤═══╕
│ - │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛
--------------------
Move Human
Action: 5
-1
--------------------
move Agent
Action: 0
╒═══╤═══╤═══╕
│ O │ - │ - │
├───┼───┼───┤
│ - │ - │ X │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛
--------------------
Move Human
Action: 4
-1
--------------------
move Agent
Action: 1
╒═══╤═══╤═══╕
│ O │ O │ - │
├───┼───┼───┤
│ - │ X │ X │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛
--------------------
Move Human
Action: 3
9
********************
Human won!
********************
╒═══╤═══╤═══╕
│ O │ O │ - │
├───┼───┼───┤
│ X │ X │ X │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛



Agent beginns
--------------------
--------------------
move Agent
Action: 0
╒═══╤═══╤═══╕
│ O │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
├───┼───┼───┤
│ - │ - │ - │
╘═══╧═══╧═══╛
--------------------
Move Human
Action: 4
-1
--------------------
move Agent
Action: 1
╒═══╤═══╤═══╕
│ O │ O │ - │
├───┼───┼

ValueError: invalid literal for int() with base 10: ''