In [22]:
using POMDPs
using POMDPModels
using POMDPModelTools
using BeliefUpdaters
using POMDPPolicies
using POMDPSimulators
using QuickPOMDPs

┌ Info: Precompiling POMDPSimulators [e0d0a172-29c6-5d4e-96d0-f262df5d01fd]
└ @ Base loading.jl:1273


# Problem 1

For problem 1, you may wish to use a belief updater:

In [23]:
# You can construct a standard crying baby POMDP model like this
r_feed = -5.0
r_hungry = -10.0
p_become_hungry = 0.1
p_cry_when_hungry = 0.8
p_cry_when_not_hungry = 0.1
γ = 0.9
m = BabyPOMDP(r_feed, r_hungry,
              p_become_hungry,
              p_cry_when_hungry,
              p_cry_when_not_hungry,
              γ
             )

# states, actions, and observations are represented by Bools
# true = feed, crying, hungry, etc.

# then you can do belief updates as follows
up = DiscreteUpdater(m)
b = initialize_belief(up, Deterministic(false))
showdistribution(b); println()
a = false; o = false
b = update(up, b, a, o)
showdistribution(b); println()
b = update(up, b, a, o)
showdistribution(b); println()

[1m         DiscreteBelief{BabyPOMDP,Bool} distribution[22m
[90m         ┌                                        ┐[39m 
   [0mfalse[90m ┤[39m[32m■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■[39m[0m 1.0 [90m [39m 
    [0mtrue[90m ┤[39m[0m 0.0                                    [90m [39m 
[90m         └                                        ┘[39m 
[1m         DiscreteBelief{BabyPOMDP,Bool} distribution[22m
[90m         ┌                                        ┐[39m 
   [0mfalse[90m ┤[39m[32m■■■■■■■■■■■■■■■■■■■■[39m[0m 0.9759036144578314 [90m [39m 
    [0mtrue[90m ┤[39m[0m 0.02409638554216867                    [90m [39m 
[90m         └                                        ┘[39m 
[1m         DiscreteBelief{BabyPOMDP,Bool} distribution[22m
[90m         ┌                                        ┐[39m 
   [0mfalse[90m ┤[39m[32m■■■■■■■■■■■■■■■■■■■■[39m[0m 0.9701315984030756 [90m [39m 
    [0mtrue[90m ┤[39m[32m■[39m[0m 0.029868401596924433    

# Problem 2

For Problem 2, you should create a model of the problem using QuickPOMDPs.

An example of the Tiger POMDP, but modified so that the initial state distribution/belief is that the tiger is definitely behind the left door.

See the `DiscreteExplicitPOMDP` docstring for more information including information about terminal states.

Note that for a more compact representation, you may want to use [`QuickPOMDP`](https://juliapomdp.github.io/QuickPOMDPs.jl/stable/quick/) ([example](https://github.com/JuliaPOMDP/QuickPOMDPs.jl/blob/master/examples/lightdark.jl)) rather than `DiscreteExplicitPOMDP`, but this requires a little more knowledge of Julia.

In [31]:
S = [:left, :right]
A = [:left, :right, :listen]
O = [:left, :right]
γ = 0.95

function T(s, a, sp)
    if a == :listen
        return s == sp
    else # a door is opened
        return 0.5 #reset
    end
end

function Z(a, sp, o)
    if a == :listen
        if o == sp
            return 0.85
        else
            return 0.15
        end
    else
        return 0.5
    end
end

function R(s, a)
    if a == :listen  
        return -1.0
    elseif s == a # the tiger was found
        return -100.0
    else # the tiger was escaped
        return 10.0
    end
end

b₀ = Deterministic(:left)

m = DiscreteExplicitPOMDP(S,A,O,T,Z,R,γ,b₀);

After creating the model, you can define policies. By default if you use a `FunctionPolicy`, it will get the previous observation as an input.

In [32]:
obs_based_policy = FunctionPolicy(
    function (o)
        if o == :left
            return :right
        else
            return :left
        end
    end
)

rsum = 0.0
N = 100_000
for i in 1:N
    sim = RolloutSimulator(max_steps=100)
    rsum += simulate(sim, m, obs_based_policy)
end
rsum/N

-949.3876979559959

You can also feed back on the belief by specifying a belief updater in the `simulate` function call. Use the POMDPs.jl [Distribution Interface](http://juliapomdp.github.io/POMDPs.jl/stable/interfaces/#Distributions-1) to interact with the belief:

In [35]:
belief_based_policy = FunctionPolicy(
    function (b)
        if pdf(b, :left) > 0.95
            return :right
        elseif pdf(b, :right) > 0.95
            return :left
        else
            return :listen
        end
    end
)

up = DiscreteUpdater(m);

In [34]:
rsum = 0.0
N = 100_000
for i in 1:N
    sim = RolloutSimulator(max_steps=100)
    rsum += simulate(sim, m, belief_based_policy, up)
end
rsum/N

28.35926829324021