# Reinforcement Learning Tutorial for POMDPs.jl

inspired from CS234

In [1]:
using POMDPs

## Install Reinforcement Learning algorithms

In [2]:
# POMDPs.add("TabularTDLearning")

In [3]:
using TabularTDLearning, POMDPToolbox

## Problem overview

### Wall-E exploration


Let's consider a 1D gridworld and 2 different rewards at each extremities. Wall-E must find the plant but he might greedily join his lover instead...

![wall-e-mdp](initial_state.png)

The environment can be modeled as follow

- 10 states: 1,2,3,4,5,6,7,8,9,10
- 2 actions: left, right
- one episode lasts until a reward is found
- there is a reward of +1 in state 1 (Eve) and +2 in state 10 (the plant)
- Wall-E starts at 3


In [134]:
mutable struct MarsExp <: MDP{Int64, Symbol}
    r_left::Float64
    r_right::Float64
    start::Int64
    γ::Float64
    MarsExp(;r_left::Float64 = 1., r_right::Float64 = 10.,start::Int64 = 5, γ::Float64 = 0.99) = new(r_left, r_right, start, γ)
end

In [135]:
@requirements_info QLearningSolver MarsExp()

LoadError: [91mMethodError: no method matching requirements_info(::Type{TabularTDLearning.QLearningSolver}, ::MarsExp)[0m
Closest candidates are:
  requirements_info([91m::Union{POMDPs.Simulator, POMDPs.Solver}[39m, ::Union{POMDPs.MDP, POMDPs.POMDP}, [91m::Any...[39m) at C:\Users\Maxime\.julia\v0.6\POMDPs\src\requirements_interface.jl:140[39m

In [136]:
function POMDPs.states(mdp::MarsExp)
    return 1:1:10
end
POMDPs.state_index(mdp::MarsExp, s::Int64) = s
POMDPs.n_states(mdp::MarsExp) = 10

In [137]:
function POMDPs.actions(mdp::MarsExp)
    return [:left, :right]
end
POMDPs.action_index(mdp::MarsExp, a::Symbol) = a == :left ? 1 : 2
POMDPs.n_actions(mdp::MarsExp) = 2

In [177]:
function POMDPs.generate_s(mdp::MarsExp, s::Int64, a::Symbol, rng::AbstractRNG)
    if a == :left
        return max(1, s-1)
    elseif a == :right
        return min(10, s+1)
    end
end
function POMDPs.reward(mdp::MarsExp, s::Int64, a::Symbol, sp::Int64)
    if sp == 1
        return mdp.r_left
    elseif sp == 10
        return mdp.r_right
    else
        return 0.0
    end
end     
function POMDPs.initial_state(mdp::MarsExp, rng::AbstractRNG)
    return mdp.start 
end

In [178]:
function POMDPs.isterminal(mdp::MarsExp, s::Int64)
    return s == 1 || s == 10
end

## Solve with Q-learning

First we need to initialize the solver with the desired hyper parameters:

In [179]:
@requirements_info QLearningSolver(MarsExp()) MarsExp()


INFO: POMDPs.jl requirements for [34msolve(::QLearningSolver, ::Union{POMDPs.MDP,POMDPs.POMDP})[39m and dependencies. ([✔] = implemented correctly; [X] = missing)

For [34msolve(::QLearningSolver, ::Union{POMDPs.MDP,POMDPs.POMDP})[39m:
[32m  [✔] initial_state(::MarsExp, ::AbstractRNG)[39m
[32m  [✔] generate_sr(::MarsExp, ::Int64, ::Symbol, ::AbstractRNG)[39m
[32m  [✔] state_index(::MarsExp, ::Int64)[39m
[32m  [✔] action_index(::MarsExp, ::Symbol)[39m
[32m  [✔] discount(::MarsExp)[39m



true

Then initialize the problem and the solver using the desired hyper-parameters

In [185]:
mdp = MarsExp(start=4)
solver = QLearningSolver(mdp, learning_rate=0.1, n_episodes=1000, max_episode_length=50, eval_every=50, n_eval_traj=100, 
                         exp_policy = EpsGreedyPolicy(mdp, 0.5))

TabularTDLearning.QLearningSolver(1000, 50, 0.1, POMDPToolbox.EpsGreedyPolicy(0.5, POMDPToolbox.ValuePolicy{Any}(MarsExp(1.0, 10.0, 4, 0.99), [0.0 0.0; 0.0 0.0; … ; 0.0 0.0; 0.0 0.0], Any[:left, :right]), POMDPToolbox.StochasticPolicy(MersenneTwister(UInt32[0xedad597d, 0x96c954c0, 0x5ac276a3, 0xc2ad1a2a], Base.dSFMT.DSFMT_state(Int32[-1522690906, 1073239195, 458614064, 1072943064, 658217498, 1073106907, 375654207, 1073190151, -807739292, 1073376892  …  388331957, 1073333209, 918934450, 1073169483, -44771095, -986091191, 85992859, -1875977699, 382, 0]), [1.52066, 1.23824, 1.3945, 1.47388, 1.65197, 1.22928, 1.4625, 1.54486, 1.9083, 1.51305  …  1.17703, 1.04775, 1.26268, 1.28971, 1.3476, 1.10172, 1.63457, 1.23743, 1.61031, 1.45417], 356), Symbol[:left, :right], MarsExp(1.0, 10.0, 4, 0.99), POMDPToolbox.VoidUpdater())), [0.0 0.0; 0.0 0.0; … ; 0.0 0.0; 0.0 0.0], 50, 100)

We are now ready to solve for the optimal policy!

In [186]:
policy = solve(solver, mdp)

On Iteration 50, Returns: 1.0
On Iteration 100, Returns: 1.0
On Iteration 150, Returns: 1.0
On Iteration 200, Returns: 1.0
On Iteration 250, Returns: 1.0
On Iteration 300, Returns: 1.0
On Iteration 350, Returns: 1.0
On Iteration 400, Returns: 1.0
On Iteration 450, Returns: 1.0
On Iteration 500, Returns: 1.0
On Iteration 550, Returns: 1.0
On Iteration 600, Returns: 1.0
On Iteration 650, Returns: 1.0
On Iteration 700, Returns: 1.0
On Iteration 750, Returns: 1.0
On Iteration 800, Returns: 1.0
On Iteration 850, Returns: 1.0
On Iteration 900, Returns: 1.0
On Iteration 950, Returns: 1.0
On Iteration 1000, Returns: 1.0


POMDPToolbox.ValuePolicy{Any}(MarsExp(1.0, 10.0, 4, 0.99), [0.0 0.0; 1.0 1.0; … ; 0.0413447 0.0; 0.0 0.0], Any[:left, :right])

In [182]:
policy.value_table

10×2 Array{Float64,2}:
  0.0   0.0
  1.0  10.0
 10.0  10.0
 10.0  10.0
 10.0  10.0
 10.0  10.0
 10.0  10.0
 10.0  10.0
 10.0  10.0
  0.0   0.0

## Simulate

In [120]:
include("render.jl")



In [121]:
random_policy = RandomPolicy(mdp)

POMDPToolbox.RandomPolicy{MersenneTwister,MarsExp,POMDPToolbox.VoidUpdater}(MersenneTwister(UInt32[0xedad597d, 0x96c954c0, 0x5ac276a3, 0xc2ad1a2a], Base.dSFMT.DSFMT_state(Int32[-1061178416, 1073167218, -2057272038, 1073641417, 133238069, 1072917503, -1778145389, 1073286043, 1801412150, 1073439155  …  1162524402, 1073260592, -966052076, 1072766075, -1961213945, 802650271, -494256477, -1906918512, 382, 0]), [1.45201, 1.90424, 1.21387, 1.56533, 1.71135, 1.06647, 1.54372, 1.64552, 1.143, 1.27429  …  1.49677, 1.09577, 1.12017, 1.49667, 1.12516, 1.15541, 1.825, 1.09416, 1.54106, 1.06945], 186), MarsExp(1.0, 10.0, 5, 0.9), POMDPToolbox.VoidUpdater())

In [122]:
rng = MersenneTwister(3)
hist = HistoryRecorder(max_steps=1000)

hist = simulate(hist, mdp, random_policy, initial_state(mdp, rng))

POMDPToolbox.MDPHistory{Int64,Symbol}([5, 6, 7, 6, 5, 4, 5, 6, 5, 4, 5, 4, 3, 2, 1], Symbol[:right, :right, :left, :left, :left, :right, :right, :left, :left, :right, :left, :left, :left, :left], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], 1.0, Nullable{Exception}(), Nullable{Any}())

In [116]:
random_perf = 0.
for i=1:100
    hist = HistoryRecorder(max_steps=1000)
    hist = simulate(hist, mdp, random_policy, initial_state(mdp, rng))
    random_perf += discounted_reward(hist)
end
println(random_perf/100)

3.79


## Some interesting experiments

If you reduce exploration, you end up in a suboptimal policy.

If the discount factor is too low then you end up with a greedier policy