## Defining a Heuristic Policy
In this section, we will walk through how to define a POMDP policy of your own. For more details on POMDPs and their policies, please consult Chapter 6 of the DMU textbook [1] We will define a simple heuristic policy that takes the action that maximises the expected single-step reward, given the current belief state. We will also compare it against a policy that chooses actions at random. Please look at the documentation of [POMDPPolicies.jl](https://github.com/JuliaPOMDP/POMDPPolicies.jl) for more on the code structure of a policy object that is compatible with POMDPs.jl. We will use the explicit TigerPOMDP model - see [this](http://localhost:8888/notebooks/POMDPExamples/notebooks/Defining-a-POMDP-with-the-Explicit-Interface.ipynb) notebook for more on that.

[1] Kochenderfer, Mykel J. Decision Making Under Uncertainty: Theory and Application. MIT Press, 2015

In [1]:
using POMDPs
using POMDPPolicies # For defining a policy
using POMDPModels # For the TigerPOMDP Model
using BeliefUpdaters # To use DiscreteUpdater

┌ Info: Recompiling stale cache file /home/shushman/.julia/compiled/v1.0/POMDPs/GAotg.ji for POMDPs [a93abf59-7444-517b-a68a-c42f96afdd7d]
└ @ Base loading.jl:1184
┌ Info: Recompiling stale cache file /home/shushman/.julia/compiled/v1.0/POMDPPolicies/2WG4l.ji for POMDPPolicies [182e52fb-cfd0-5e46-8c26-fd0667c990f4]
└ @ Base loading.jl:1184
┌ Info: Recompiling stale cache file /home/shushman/.julia/compiled/v1.0/POMDPModels/GHWgR.ji for POMDPModels [355abbd5-f08e-5560-ac9e-8b5f2592a0ca]
└ @ Base loading.jl:1184


We will define a HeuristicPolicy type that only requires the POMDP instance and the set of valid actions.

In [2]:
struct HeuristicPolicy{P<:POMDP, A} <: Policy
    pomdp::P
    action_map::Vector{A}
end

In [3]:
function HeuristicPolicy(pomdp::POMDP)
    HeuristicPolicy(pomdp, actions(pomdp))
end

HeuristicPolicy

### Overriding the POMDPs.jl action function
Now we will define the `action` function, which specifies the behavior of the heuristic policy. It requires the belief state to be represented as a `DiscreteBelief`, i.e. a Probability Mass Function over individual states. It computes the expected single-step reward for each action, given the current belief state, and chooses the maximum one. Note that we must use `POMDPs.action` to override the `action` method of `POMDPs.jl`

In [4]:
function POMDPs.action(p::HeuristicPolicy, b::DiscreteBelief)
    max_value = -Inf
    best_idx = 1
    for i = 1:n_actions(p.pomdp)
        a = p.action_map[i]
        action_val = 0.0
        for (bel,state) in zip(b.b, b.state_list)
            action_val += bel*reward(p.pomdp, state, a)
        end
        
        if action_val > max_value
            best_idx = i
            max_value = action_val
        end
    end
    
    return p.action_map[best_idx]
end

### Benchmarking the Heuristic Policy
Note that unlike the other examples on using *solvers*, here we have already specified a *policy*. Therefore, we can just evaluate the policy on a problem, in this case, `TigerPOMDP`(defined in [POMDPModels](https://github.com/JuliaPOMDP/POMDPModels.jl)). We define the POMDP problem and create the policies based on it.

Since we only care about the discounted reward, we can use the rollout simulator defined in [POMDPSimulators](https://github.com/JuliaPOMDP/POMDPSimulators.jl). Checkout this [notebook](https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Running-Simulations.ipynb) for ways to use the other simulators as well. Finally, we can compare the expected discounted rewards and see how the heuristic policy does quite better than random.

In [6]:
pomdp = TigerPOMDP()

TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)

In [7]:
heur_pol = HeuristicPolicy(pomdp)

HeuristicPolicy{TigerPOMDP,Int64}(TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95), [0, 1, 2])

In [8]:
# Define a random policy as a benchmark
rand_policy = RandomPolicy(pomdp);

In [9]:
using POMDPSimulators
rollout_sim = RolloutSimulator(max_steps=10);
history_heur = simulate(rollout_sim, pomdp, heur_pol, DiscreteUpdater(pomdp));
history_rand = simulate(rollout_sim, pomdp, rand_policy, DiscreteUpdater(pomdp));

┌ Info: Recompiling stale cache file /home/shushman/.julia/compiled/v1.0/POMDPSimulators/i1HOp.ji for POMDPSimulators [e0d0a172-29c6-5d4e-96d0-f262df5d01fd]
└ @ Base loading.jl:1184


In [10]:
@show history_heur;
@show history_rand;

history_heur = 9.583949041798828
history_rand = -207.38836272128898
