Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface for exploration policy #10

Open
MaximeBouton opened this issue Dec 12, 2019 · 9 comments
Open

Interface for exploration policy #10

MaximeBouton opened this issue Dec 12, 2019 · 9 comments

Comments

@MaximeBouton
Copy link
Contributor

What would be a good interface for specifying the exploration policy?

It is implemented differently here and in DeepQLearning.jl.

  • What is implemented here: Just allows a limited set of possible policy e.g. EpsGreedyPolicy and uses the internal of that policy to access the Q value. I think it is pretty bad: EpsGreedyPolicy should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improve EpsGreedyPolicy then the code here will break.
  • In DeepQLearning.jl, the user must pass in a function f and f(policy, env, obs, global_step, rng) will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.
  • A suggestion is to use a function action(::ExplorationPolicy, current_policy, env, obs, rng). Dispatching on the type of ExplorationPolicy and having users implement their own type seems more julian than passing a function. The method action is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.

Any thoughts?

@rejuvyesh
Copy link
Member

Yeah. We also need an interface for custom decay schedules for eps

@zsunberg
Copy link
Member

Hmm... yes this is a good question.

I think the third option is reasonable. We might consider calling it ExplorationStrategy instead (an ExplorationStrategy might contain a Policy that it uses for exploration).

I think action is an acceptable name for the function. The meaning of action is just "select an action based on the inputs", so I don't think it clashes too badly with action(::Policy, s). Although it seems like the exploration strategy would often contain some state, like the position in the epsilon decay schedule, so maybe the name should have a ! in it.

We should also think about exactly what the arguments should be. Is env needed? What if we pass in the on-policy action instead of current_policy, i.e.

action(::ExplorationStrategy, on_policy_action, obs, rng)

We could also consider leaving out the on-policy action from the call altogether and say, if action returns nothing, use the on-policy action.

(note I might be saying some of the wrong words because I have less experience with RL)

@rejuvyesh
Copy link
Member

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions. Otherwise the idea sounds reasonable. Should that be another package then?

@zsunberg
Copy link
Member

We can't pass in just the actions because for certain exploration strategies like Softmax one needs the q values for all actions.

Ah, I see - that makes sense

Should that be another package then?

Did you mean "Should that be in another package then?" or "Should we create another package?"

I think yes, it should be somewhere besides one of the learning packages, but I would hope to not create a new one. My philosophy on packages has changed a lot since we broke up POMDPToolbox. Now I think it would have been much better to create better documentation and perhaps use submodules than to have a bunch of small packages!

@MaximeBouton
Copy link
Contributor Author

For now I think this could live in POMDPPolicies, I might take a shot at it next week.

@MaximeBouton
Copy link
Contributor Author

Do we really want to have action and action! ?
I think it might be confusing and I am not sure we are really respecting that now, if you have an MCTS policy, we are modifying internals fields of the policy object and we still call action.

@MaximeBouton
Copy link
Contributor Author

Suggestion:

abstract type AbstractSchedule end # define linear decay, exponential decay and more 

function update_value!(::AbstractSchedule, ::Real) 

mutable struct EpsGreedyPolicy <: Policy
    eps::Real
    schedule::AbstractSchedule
    policy::Policy
    rng::AbstractRNG
    actions::Vector{A}
end

function action(p::EpsGreedyPolicy, s) 
    update_value!(p.schedule, p.eps) # update the value of epsilon according to the schedule
    if rand(p.rng) < p.eps
        return rand(p.rng, p.actions)
    else 
        return action(p.policy)
    end
end

@rejuvyesh
Copy link
Member

Should update_value! should have any restriction on the second argument?

@MaximeBouton
Copy link
Contributor Author

No! I will submit a proper PR next week but it is just to give you an overview of the idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants