Interface for exploration policy #10

MaximeBouton · 2019-12-12T00:18:39Z

What would be a good interface for specifying the exploration policy?

It is implemented differently here and in DeepQLearning.jl.

What is implemented here: Just allows a limited set of possible policy e.g. EpsGreedyPolicy and uses the internal of that policy to access the Q value. I think it is pretty bad: EpsGreedyPolicy should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improve EpsGreedyPolicy then the code here will break.
In DeepQLearning.jl, the user must pass in a function f and f(policy, env, obs, global_step, rng) will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.
A suggestion is to use a function action(::ExplorationPolicy, current_policy, env, obs, rng). Dispatching on the type of ExplorationPolicy and having users implement their own type seems more julian than passing a function. The method action is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.

Any thoughts?

The text was updated successfully, but these errors were encountered:

rejuvyesh · 2020-01-10T23:30:16Z

Yeah. We also need an interface for custom decay schedules for eps

zsunberg · 2020-01-14T22:20:34Z

Hmm... yes this is a good question.

I think the third option is reasonable. We might consider calling it ExplorationStrategy instead (an ExplorationStrategy might contain a Policy that it uses for exploration).

I think action is an acceptable name for the function. The meaning of action is just "select an action based on the inputs", so I don't think it clashes too badly with action(::Policy, s). Although it seems like the exploration strategy would often contain some state, like the position in the epsilon decay schedule, so maybe the name should have a ! in it.

We should also think about exactly what the arguments should be. Is env needed? What if we pass in the on-policy action instead of current_policy, i.e.

action(::ExplorationStrategy, on_policy_action, obs, rng)

We could also consider leaving out the on-policy action from the call altogether and say, if action returns nothing, use the on-policy action.

(note I might be saying some of the wrong words because I have less experience with RL)

rejuvyesh · 2020-01-15T00:43:19Z

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions. Otherwise the idea sounds reasonable. Should that be another package then?

zsunberg · 2020-01-15T18:30:50Z

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions.

Ah, I see - that makes sense

Should that be another package then?

Did you mean "Should that be in another package then?" or "Should we create another package?"

I think yes, it should be somewhere besides one of the learning packages, but I would hope to not create a new one. My philosophy on packages has changed a lot since we broke up POMDPToolbox. Now I think it would have been much better to create better documentation and perhaps use submodules than to have a bunch of small packages!

MaximeBouton · 2020-02-28T23:02:31Z

For now I think this could live in POMDPPolicies, I might take a shot at it next week.

MaximeBouton · 2020-02-28T23:25:16Z

Do we really want to have action and action! ?
I think it might be confusing and I am not sure we are really respecting that now, if you have an MCTS policy, we are modifying internals fields of the policy object and we still call action.

MaximeBouton · 2020-02-28T23:32:49Z

Suggestion:

abstract type AbstractSchedule end # define linear decay, exponential decay and more 

function update_value!(::AbstractSchedule, ::Real) 

mutable struct EpsGreedyPolicy <: Policy
    eps::Real
    schedule::AbstractSchedule
    policy::Policy
    rng::AbstractRNG
    actions::Vector{A}
end

function action(p::EpsGreedyPolicy, s) 
    update_value!(p.schedule, p.eps) # update the value of epsilon according to the schedule
    if rand(p.rng) < p.eps
        return rand(p.rng, p.actions)
    else 
        return action(p.policy)
    end
end

rejuvyesh · 2020-02-29T00:00:47Z

Should update_value! should have any restriction on the second argument?

MaximeBouton · 2020-02-29T00:19:45Z

No! I will submit a proper PR next week but it is just to give you an overview of the idea

MaximeBouton mentioned this issue Feb 29, 2020

Exploration Policies JuliaPOMDP/POMDPPolicies.jl#20

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface for exploration policy #10

Interface for exploration policy #10

MaximeBouton commented Dec 12, 2019

rejuvyesh commented Jan 10, 2020

zsunberg commented Jan 14, 2020

rejuvyesh commented Jan 15, 2020

zsunberg commented Jan 15, 2020

MaximeBouton commented Feb 28, 2020

MaximeBouton commented Feb 28, 2020

MaximeBouton commented Feb 28, 2020

rejuvyesh commented Feb 29, 2020

MaximeBouton commented Feb 29, 2020

Interface for exploration policy #10

Interface for exploration policy #10

Comments

MaximeBouton commented Dec 12, 2019

rejuvyesh commented Jan 10, 2020

zsunberg commented Jan 14, 2020

rejuvyesh commented Jan 15, 2020

zsunberg commented Jan 15, 2020

MaximeBouton commented Feb 28, 2020

MaximeBouton commented Feb 28, 2020

MaximeBouton commented Feb 28, 2020

rejuvyesh commented Feb 29, 2020

MaximeBouton commented Feb 29, 2020