## Defining a Heuristic Policy
In this section, we will walk through how to
define a POMDP policy of your own. For more details on POMDPs and their
policies, please consult Chapter 6 of the DMU textbook [1] We will define a
simple greedy policy that takes the action that maximises the expected single-
step reward, given the current belief state. We will also compare it against a
policy that chooses actions at random. Please look at the documentation of
[POMDPPolicies.jl](https://github.com/JuliaPOMDP/POMDPPolicies.jl) for more on
the code structure of a policy object that is compatible with POMDPs.jl. We will
use the explicit TigerPOMDP model - see
[this](http://localhost:8888/notebooks/POMDPExamples/notebooks/Defining-a-POMDP-
with-the-Explicit-Interface.ipynb) notebook for more on that.

[1] Kochenderfer,
Mykel J. Decision Making Under Uncertainty: Theory and Application. MIT Press,
2015

In [8]:
using POMDPs
using POMDPPolicies # For defining a policy
using POMDPModels # For the TigerPOMDP Model
using BeliefUpdaters # To use DiscreteUpdater
using POMDPModelTools # For weighted_iterator

We will define a GreedyPolicy type that only requires the POMDP instance and the
set of valid actions.

In [9]:
struct GreedyPolicy{P<:POMDP} <: Policy
    pomdp::P
end

### Overriding the POMDPs.jl action function
Now we will define a new method for
the `action` function, which specifies the behavior of our policy. It requires
the belief state to be represented as a `DiscreteBelief`, i.e. a Probability
Mass Function over individual states. It computes the expected single-step
reward for each action, given the current belief state, and chooses the maximum
one. This is sometimes called a "greedy" or "myopic" policy. Note that we must
use `POMDPs.action` to add a method to the `action` function of `POMDPs.jl`

In [10]:
function POMDPs.action(p::GreedyPolicy, b)
    max_value = -Inf
    as = actions(p.pomdp)
    best_a = first(as)
    for a in as
        action_val = 0.0
        for (state, bel) in weighted_iterator(b)
            action_val += bel*reward(p.pomdp, state, a)
        end
        
        if action_val > max_value
            best_a = a
            max_value = action_val
        end
    end
    
    return best_a
end

### Benchmarking the Policy
Note that unlike the other examples on using
*solvers*, here we have already specified a *policy*. Therefore, we can just
evaluate the policy on a problem, in this case, `TigerPOMDP`(defined in
[POMDPModels](https://github.com/JuliaPOMDP/POMDPModels.jl)). We define the
POMDP problem and create the policies based on it.

Since we only care about the
discounted reward, we can use the rollout simulator defined in
[POMDPSimulators](https://github.com/JuliaPOMDP/POMDPSimulators.jl). Checkout
this
[notebook](https://github.com/JuliaPOMDP/POMDPExamples.jl/blob/master/notebooks/Running-Simulations.ipynb) for ways to use the other simulators as well.
Finally, we can compare the expected discounted rewards and see how the greedy
policy does quite better than random.

In [11]:
pomdp = TigerPOMDP()

TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)

In [12]:
greedy_pol = GreedyPolicy(pomdp)

GreedyPolicy{TigerPOMDP}(TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95))

In [13]:
# Define a random policy as a benchmark
rand_policy = RandomPolicy(pomdp);

In [15]:
using POMDPSimulators
rollout_sim = RolloutSimulator(max_steps=10);
greedy_reward = simulate(rollout_sim, pomdp, greedy_pol, DiscreteUpdater(pomdp));
rand_reward = simulate(rollout_sim, pomdp, rand_policy, DiscreteUpdater(pomdp));

In [16]:
@show greedy_reward;
@show rand_reward;

greedy_reward = 7.867051041738277
rand_reward = -353.43638007562487
