Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration Policies #20

Merged
merged 10 commits into from Mar 19, 2020
Merged

Exploration Policies #20

merged 10 commits into from Mar 19, 2020

Conversation

MaximeBouton
Copy link
Contributor

@MaximeBouton MaximeBouton commented Feb 29, 2020

Introduces an interface for Exploration Policies:

See discussion here: JuliaPOMDP/TabularTDLearning.jl#10

  • abtract type ExplorationPolicy , interface for sampling from an exploration policy: action(exploration_policy, on_policy, s)
  • abstract type ExplorationSchedule, interface for updating the parameters of exploration policies: update_value(schedule, value) , provides two schedules: LinearDecaySchedule and ConstantSchedule
  • bump version number to 0.3

Breaking changes:
EpsGreedyPolicy needs an on_policy argument when sampling actions.

TODOs:

  • write docs
  • update compat in TabularTDLearning.jl and DeepQLearning.jl
  • update TabularTDLearning.jl and DeepQLearning.jl

@coveralls
Copy link

coveralls commented Feb 29, 2020

Pull Request Test Coverage Report for Build 100

  • 14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.0%) to 90.265%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/exploration_policies.jl 14 15 93.33%
Totals Coverage Status
Change from base Build 94: 1.0%
Covered Lines: 204
Relevant Lines: 226

💛 - Coveralls

@rejuvyesh
Copy link
Member

Hmm. I'm not sure why ExplorationStrategy shouldn't just keep a reference to the on_policy in its struct?

@MaximeBouton
Copy link
Contributor Author

I think construction might be an issue. The typical use case will be in DeepQLearning.jl or TabularTDLearning.jl where you would like to pass in the exploration policy but let the solver initialize the on_policy.

Copy link
Member

@rejuvyesh rejuvyesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me other than that small change! Thanks for this!

src/exploration_policies.jl Outdated Show resolved Hide resolved
@MaximeBouton
Copy link
Contributor Author

I set up the compathelper bot for DeepQLearning and TabularTDLearning. I won't update them right now, I am going to wait for the new version of POMDPPolicies to be merged and see if it triggers the bot and fix it then.

@zsunberg
Copy link
Member

zsunberg commented Mar 3, 2020

@MaximeBouton Sorry for the delay! I was traveling this weekend. I have some comments - will post them tomorrow. Thanks for adding this!

Copy link
Member

@zsunberg zsunberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this. I left a few inline comments, but the main thing I is that we should consider a significant change:

Intuitively, when I think of schedule, I think of a function from time (or, in this case, the number of calls to the policy) to a value, so I think we should make schedules just that: Functions from n_steps to a value.

I also like the action(exp_policy, on_policy, s) interface, but I think we should add n_steps or just k as an explicit argument. This keeps everything static (the exploration policy isn't carrying any mutable state around) and explicit:

action(exp_policy, on_policy, k, s)

I don't see much downside to adding this additional argument if we are already adding the on_policy.

I have added inline comments showing how this would change the implementation of EpsGreedyPolicy and LinearDecaySchedule

.travis.yml Outdated
@@ -2,7 +2,7 @@ language: julia

julia:
- 1.0
- 1.2
- 1.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should just be

- 1.0
- 1 # this will get the latest 1.x release

update_value(::ExplorationSchedule, value)
Returns an updated value according to the schedule.
"""
function update_value(::ExplorationSchedule, value) end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be

function update_value end

so the standard method error will be thrown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my other comments, this function may cease to exist entirely though.

Comment on lines 19 to 37
"""
LinearDecaySchedule
A schedule that linearly decreases a value from `start_val` to `end_val` in `steps` steps.
if the value is greater or equal to `end_val`, it stays constant.

# Constructor

`LinearDecaySchedule(;start_val, end_val, steps)`
"""
@with_kw struct LinearDecaySchedule{R<:Real} <: ExplorationSchedule
start_val::R
end_val::R
steps::Int
end

function update_value(schedule::LinearDecaySchedule, value)
rate = (schedule.start_val - schedule.end_val) / schedule.steps
new_value = max(value - rate, schedule.end_val)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my suggestions, this would be replaced with

@with_kw struct LinearDecaySchedule{R<:Real} <: Function
    start::R
    stop::R
    steps::Int
end

(schedule::LinearDecaySchedule)(k) = # code for interpolating 

Comment on lines 68 to 73
mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy
eps::T
schedule::S
rng::R
actions::A
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my suggestions are taken, this would be replaced with

Suggested change
mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy
eps::T
schedule::S
rng::R
actions::A
end
struct EpsGreedyPolicy{T<:Union{Real,Function}, R<:AbstractRNG, A} <: ExplorationPolicy
eps::T
rng::R
actions::A
end

Then it could be constructed with e.g. EpsGreedyPolicy(m, LinearDecaySchedule(0.1, 0.01, 1000)) or EpsGreedyPolicy(m, 0.05) or EpsGreedyPolicy(m, k->0.05*0.9^(k/10)), all of which seem pretty clear.

Comment on lines 82 to 89
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real
p.eps = update_value(p.schedule, p.eps)
if rand(p.rng) < p.eps
return rand(p.rng, p.actions)
else
return action(on_policy, s)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my suggestions are taken, this would be

Suggested change
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real
p.eps = update_value(p.schedule, p.eps)
if rand(p.rng) < p.eps
return rand(p.rng, p.actions)
else
return action(on_policy, s)
end
end
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Real
if rand(p.rng) < p.eps
return rand(p.rng, p.actions)
else
return action(on_policy, s)
end
end
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Function
if rand(p.rng) < p.eps(k)
return rand(p.rng, p.actions)
else
return action(on_policy, s)
end
end

@MaximeBouton
Copy link
Contributor Author

@zsunberg I like your suggestions.
As I was trying to use this code in a project I thought missing the time step was not very convenient.
Good call for the function schedule(k) I think that it is nice.

@zsunberg
Copy link
Member

zsunberg commented Mar 4, 2020

I was thinking about this a little more. Instead of k as an argument, should we have it be a float/rational that represents the fraction of the way we are through training? i.e. schedule(0.75) should be the value of the parameter 3/4ths of the way through training.

I think this might lead to less confusion than using the number of steps because the number of steps is kind of ambiguous - is it the number of episodes or the number of (s, a, r, s') steps that have been taken?

Also, that would be fewer numbers that the user has to remember to change if they change the number of steps the algorithm uses.

@MaximeBouton
Copy link
Contributor Author

MaximeBouton commented Mar 11, 2020

The argument about using a fraction of training rather than a training step is valid (less parameters to change, less ambiguity).
My main concern is that it is not very standard and might confuse people more.

@MaximeBouton
Copy link
Contributor Author

In my experiments with DQN I have always been logging the value of epsilon.
How would that work here, would we need a function
exploration_parameter(exploration_policy, k) that would return the parameter to log?

@MaximeBouton
Copy link
Contributor Author

I implemented the changes suggested by @zsunberg.
Regarding the issue of logging epsilon or the temperature, the best I could think of right now is have a function exploration_parameter(exploration_policy, k) that returns the scalar to log.

@zsunberg
Copy link
Member

Sorry for the delay - I am going to look over it this afternoon.

Copy link
Member

@zsunberg zsunberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is almost good - see my two comments though.

I don't think I have any much better ideas for logging than exploration_parameter, although maybe it should be more general like info(p::ExplorationPolicy, k) and then it could have different variables returned in a namedtuple/dict for different policies, like loginfo(p::EpsGreedyPolicy, k) = (eps=p.eps(k),) and loginfo(p::SoftmaxPolicy, k)=(temperature=p.temperature(k),)?

@@ -7,7 +7,7 @@ Abstract type for exploration schedule.
It is useful to define the schedule of a parameter of an exploration policy.
The effect of a schedule is defined by the `update_value` function.
"""
abstract type ExplorationSchedule end
abstract type ExplorationSchedule <: Function end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this abstract type? I don't see what purpose it serves and I am afraid someone will see it and think they need to use it. I think the schedule should just be a function, so people can write eps = k->max(0, 0.1*(10000-k)/10000) for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, not really needed here since we don't have an interface for schedules anymore.

@@ -7,7 +7,7 @@ Abstract type for exploration schedule.
It is useful to define the schedule of a parameter of an exploration policy.
The effect of a schedule is defined by the `update_value` function.
"""
abstract type ExplorationSchedule end
abstract type ExplorationSchedule <: Function end

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we getting rid of update_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not updated the docstrings yet ;)

@zsunberg
Copy link
Member

Sorry again for the additional delays!

@MaximeBouton
Copy link
Contributor Author

Thanks Zach, I am going to go with the loginfo returning a namedtuple

@MaximeBouton
Copy link
Contributor Author

Let me know if you think it is good to merge now!

@zsunberg
Copy link
Member

OK, I think the code is good now!

One thing I have been thinking a lot about is how we can separate out the RL infrastructure from the POMDPs infrastructure. In my course, I really wanted to give the students an RLInterface environment without them knowing anything about the underlying MDP, but RLInterface and POMDPs are still highly coupled. This is a bigger issue and I don't think we have time to just solve it now, but it is something to think about. The relevance to this PR is that it seems like maybe this should go in a different place than POMDPPolicies, but I don't want to hold your other development back any more.

Copy link
Member

@zsunberg zsunberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good code-wise. See my comment about bigger picture stuff.

@MaximeBouton
Copy link
Contributor Author

Yes I agree that there needs to be a discussion on separating POMDP and RL, RLInterface is more of a wrapper around MDPs and POMDPs.
There is some effort in developing ReinforcementLearning.jl so maybe we should use this for RL...
Let's have this discussion somewhere else, slack/issue/discourse are fine.

@MaximeBouton MaximeBouton merged commit 43cbb42 into master Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants