In [119]:
]st

[32m[1m    Status[22m[39m `/mnt/E4E0A9C0E0A998F6/github/ReinforcementLearningAnIntroduction.jl/notebooks/Project.toml`
 [90m [02c1da58][39m[37m RLIntro v0.2.0 [`..`][39m
 [90m [158674fc][39m[37m ReinforcementLearning v0.4.0 [`../../ReinforcementLearning.jl`][39m
 [90m [25e41dd2][39m[37m ReinforcementLearningEnvironments v0.1.1[39m


In [120]:
using ReinforcementLearning, ReinforcementLearningEnvironments, RLIntro
using RLIntro.TicTacToe

env = TicTacToeEnv()

___
___
___
isdone = [false], winner = [nothing]


In [121]:
nstates, nactions = length(observation_space(env)), length(action_space(env))

(5478, 10)

If you are curious why there are `5478` states, you may see the discussions [here](https://math.stackexchange.com/questions/485752/tictactoe-state-space-choose-calculation/485852)

In [122]:
observe(env)

Observation{Float64,Bool,Int64,NamedTuple{(:legal_actions,),Tuple{Array{Bool,1}}}}(0.0, false, 4175, (legal_actions = Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],))

Now we'll use the Monte Carlo based method to estimate the value of each state for each player. Think about this, if we have the precise estimation of each state after taking some specific observation according to current observation, then we can just choose the action that leads to the maximum estimation.

Let's create a value approximator first (here we use the `TabularVApproximator` defined in `ReinforcementLearning.jl`):

In [123]:
V1 = TabularVApproximator(nstates)
V2 = TabularVApproximator(nstates)

TabularVApproximator([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

As you can see, by default all the estimations are initialed with `0.0`. Usually it won't be a problem, but here we can initialize it with a better starting point. For each state, we can check that if the state is a final state and set the initial estimation accordingly.

In [124]:
function init_V!(V, role)
    for i in 1:length(V.table)
        s = TicTacToe.ID2STATE[i]
        isdone, winner = TicTacToe.STATES_INFO[s]
        if isdone
            if winner === nothing
                V.table[i] = 0.5
            elseif winner === role
                V.table[i] = 1.
            else
                V.table[i] = 0.
            end
        else
            V.table[i] = 0.5
        end
    end
    V
end

init_V! (generic function with 1 method)

In [125]:
init_V!(V1, TicTacToe.offensive)
init_V!(V2, TicTacToe.defensive)

TabularVApproximator([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5  …  0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

Then we construct a `MonteCarloLearner` for each player. Here the `MonteCarloLearner` is just a wrapper around the approximator.

In [126]:
learner_1 = MonteCarloLearner(V1; α=0.1, kind=:EveryVisit)
learner_2 = MonteCarloLearner(V2; α=0.1, kind=:EveryVisit)

MonteCarloLearner{:EveryVisit,TabularVApproximator,CachedSampleAvg}(TabularVApproximator([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5  …  0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]), 1.0, 0.1, CachedSampleAvg(Dict{Any,Any}()))

Finally we will create the `MonteCarloAgent`. To create such an agent, we need to provide a `learner` and a `policy`. We already have the learners above. Now let's create a policy.

A policy is a mapping from states to actions. Considering that we already have the estimations of states, a simple policy would be checking the estimation of the following up states and select one action which will result to the best state.

In [127]:
function create_policy(V, role)
    obs -> begin
        legal_actions, state = findall(get_legal_actions(obs)), get_state(obs)
        next_states = TicTacToe.get_next_states(TicTacToe.ID2STATE[state], role, legal_actions)
        next_state_estimations = [V(TicTacToe.STATE2ID[ns]) for ns in next_states]
        max_val, idx = findmax(next_state_estimations)
        rand() < 0.01 ? rand(legal_actions) : legal_actions[idx]
    end
end

create_policy (generic function with 1 method)

In [128]:
π_1 = create_policy(V1, TicTacToe.offensive)
π_2 = create_policy(V2, TicTacToe.defensive)

#69 (generic function with 1 method)

In [129]:
agent_1 = MonteCarloAgent(TicTacToe.offensive, learner_1, π_1, episode_RTSA_buffer())
agent_2 = MonteCarloAgent(TicTacToe.defensive, learner_2, π_2, episode_RTSA_buffer())

MonteCarloAgent{MonteCarloLearner{:EveryVisit,TabularVApproximator,CachedSampleAvg},var"##69#71"{TabularVApproximator,RLIntro.TicTacToe.Defensive},EpisodeTurnBuffer{(:reward, :terminal, :state, :action),Tuple{Float64,Bool,Int64,Int64},NamedTuple{(:reward, :terminal, :state, :action),Tuple{Array{Float64,1},Array{Bool,1},Array{Int64,1},Array{Int64,1}}}},RLIntro.TicTacToe.Defensive}(O, MonteCarloLearner{:EveryVisit,TabularVApproximator,CachedSampleAvg}(TabularVApproximator([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5  …  0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]), 1.0, 0.1, CachedSampleAvg(Dict{Any,Any}())), var"##69#71"{TabularVApproximator,RLIntro.TicTacToe.Defensive}(TabularVApproximator([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5  …  0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]), O), NamedTuple{(:reward, :terminal, :state, :action),Tuple{Float64,Bool,Int64,Int64}}[])

In [139]:
train((agent_1, agent_2), env, StopAfterStep(1000000))

[32mProgress:   0%|                                         |  ETA: 1 days, 22:35:03[39m
[A4m  TRAINING/STEP:  0[39m
[32mProgress:   0%|                                         |  ETA: 2 days, 1:32:55[39m
[A4m  TRAINING/STEP:  1[39m
[32mProgress:   0%|▏                                        |  ETA: 0:03:48[39m
[A4m  TRAINING/STEP:  2000[39m
[32mProgress:   0%|▎                                        |  ETA: 0:01:57[39m
[A4m  TRAINING/STEP:  4745[39m
[32mProgress:   1%|▎                                        |  ETA: 0:01:26[39m
[A4m  TRAINING/STEP:  7566[39m
[32mProgress:   1%|▍                                        |  ETA: 0:01:11[39m
[A4m  TRAINING/STEP:  10499[39m
[32mProgress:   1%|▌                                        |  ETA: 0:01:03[39m
[A4m  TRAINING/STEP:  13501[39m
[32mProgress:   2%|▋                                        |  ETA: 0:00:57[39m
[A4m  TRAINING/STEP:  16460[39m
[32mProgress:   2%|▊                                        |  ET

[A4m  TRAINING/STEP:  204095[39m
[32mProgress:  21%|████████▌                                |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  207080[39m
[32mProgress:  21%|████████▋                                |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  210008[39m
[32mProgress:  21%|████████▊                                |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  212956[39m
[32mProgress:  22%|████████▉                                |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  215885[39m
[32mProgress:  22%|█████████                                |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  219066[39m
[32mProgress:  22%|█████████▏                               |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  222044[39m
[32mProgress:  23%|█████████▎                               |  ETA: 0:00:28[39m
[A4m  TRAINING/STEP:  225023[39m
[32mProgress:  23%|█████████▍                               |  ETA: 0:00:27[39m
[A4m  TRAINING/STEP:  228008[39m
[32mProgress:  23%|█████████

[A4m  TRAINING/STEP:  413369[39m
[32mProgress:  42%|█████████████████▏                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  416325[39m
[32mProgress:  42%|█████████████████▎                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  419215[39m
[32mProgress:  42%|█████████████████▎                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  422119[39m
[32mProgress:  43%|█████████████████▍                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  425047[39m
[32mProgress:  43%|█████████████████▌                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  428073[39m
[32mProgress:  43%|█████████████████▋                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  431262[39m
[32mProgress:  43%|█████████████████▊                       |  ETA: 0:00:20[39m
[A4m  TRAINING/STEP:  434270[39m
[32mProgress:  44%|█████████████████▉                       |  ETA: 0:00:19[39m
[A4m  TRAINING/STEP:  437267[39m
[32mProgress:  44%|█████████

[A4m  TRAINING/STEP:  622269[39m
[32mProgress:  63%|█████████████████████████▋               |  ETA: 0:00:13[39m
[A4m  TRAINING/STEP:  625182[39m
[32mProgress:  63%|█████████████████████████▊               |  ETA: 0:00:13[39m
[A4m  TRAINING/STEP:  628101[39m
[32mProgress:  63%|█████████████████████████▉               |  ETA: 0:00:13[39m
[A4m  TRAINING/STEP:  630994[39m
[32mProgress:  63%|██████████████████████████               |  ETA: 0:00:13[39m
[A4m  TRAINING/STEP:  633914[39m
[32mProgress:  64%|██████████████████████████▏              |  ETA: 0:00:12[39m
[A4m  TRAINING/STEP:  636965[39m
[32mProgress:  64%|██████████████████████████▎              |  ETA: 0:00:12[39m
[A4m  TRAINING/STEP:  639886[39m
[32mProgress:  64%|██████████████████████████▍              |  ETA: 0:00:12[39m
[A4m  TRAINING/STEP:  642793[39m
[32mProgress:  65%|██████████████████████████▌              |  ETA: 0:00:12[39m
[A4m  TRAINING/STEP:  645721[39m
[32mProgress:  65%|█████████

[A4m  TRAINING/STEP:  827913[39m
[32mProgress:  83%|██████████████████████████████████▏      |  ETA: 0:00:06[39m
[A4m  TRAINING/STEP:  830819[39m
[32mProgress:  83%|██████████████████████████████████▏      |  ETA: 0:00:06[39m
[A4m  TRAINING/STEP:  833723[39m
[32mProgress:  84%|██████████████████████████████████▎      |  ETA: 0:00:06[39m
[A4m  TRAINING/STEP:  836637[39m
[32mProgress:  84%|██████████████████████████████████▍      |  ETA: 0:00:06[39m
[A4m  TRAINING/STEP:  839697[39m
[32mProgress:  84%|██████████████████████████████████▌      |  ETA: 0:00:05[39m
[A4m  TRAINING/STEP:  842796[39m
[32mProgress:  85%|██████████████████████████████████▋      |  ETA: 0:00:05[39m
[A4m  TRAINING/STEP:  845737[39m
[32mProgress:  85%|██████████████████████████████████▊      |  ETA: 0:00:05[39m
[A4m  TRAINING/STEP:  848648[39m
[32mProgress:  85%|██████████████████████████████████▉      |  ETA: 0:00:05[39m
[A4m  TRAINING/STEP:  851613[39m
[32mProgress:  85%|█████████