New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploration Policies #20
Conversation
Pull Request Test Coverage Report for Build 100
💛 - Coveralls |
Hmm. I'm not sure why |
I think construction might be an issue. The typical use case will be in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me other than that small change! Thanks for this!
I set up the compathelper bot for DeepQLearning and TabularTDLearning. I won't update them right now, I am going to wait for the new version of POMDPPolicies to be merged and see if it triggers the bot and fix it then. |
@MaximeBouton Sorry for the delay! I was traveling this weekend. I have some comments - will post them tomorrow. Thanks for adding this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this. I left a few inline comments, but the main thing I is that we should consider a significant change:
Intuitively, when I think of schedule, I think of a function from time (or, in this case, the number of calls to the policy) to a value, so I think we should make schedules just that: Function
s from n_steps
to a value
.
I also like the action(exp_policy, on_policy, s)
interface, but I think we should add n_steps
or just k
as an explicit argument. This keeps everything static (the exploration policy isn't carrying any mutable state around) and explicit:
action(exp_policy, on_policy, k, s)
I don't see much downside to adding this additional argument if we are already adding the on_policy
.
I have added inline comments showing how this would change the implementation of EpsGreedyPolicy and LinearDecaySchedule
.travis.yml
Outdated
@@ -2,7 +2,7 @@ language: julia | |||
|
|||
julia: | |||
- 1.0 | |||
- 1.2 | |||
- 1.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should just be
- 1.0
- 1 # this will get the latest 1.x release
src/exploration_policies.jl
Outdated
update_value(::ExplorationSchedule, value) | ||
Returns an updated value according to the schedule. | ||
""" | ||
function update_value(::ExplorationSchedule, value) end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
function update_value end
so the standard method error will be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given my other comments, this function may cease to exist entirely though.
src/exploration_policies.jl
Outdated
""" | ||
LinearDecaySchedule | ||
A schedule that linearly decreases a value from `start_val` to `end_val` in `steps` steps. | ||
if the value is greater or equal to `end_val`, it stays constant. | ||
|
||
# Constructor | ||
|
||
`LinearDecaySchedule(;start_val, end_val, steps)` | ||
""" | ||
@with_kw struct LinearDecaySchedule{R<:Real} <: ExplorationSchedule | ||
start_val::R | ||
end_val::R | ||
steps::Int | ||
end | ||
|
||
function update_value(schedule::LinearDecaySchedule, value) | ||
rate = (schedule.start_val - schedule.end_val) / schedule.steps | ||
new_value = max(value - rate, schedule.end_val) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my suggestions, this would be replaced with
@with_kw struct LinearDecaySchedule{R<:Real} <: Function
start::R
stop::R
steps::Int
end
(schedule::LinearDecaySchedule)(k) = # code for interpolating
src/exploration_policies.jl
Outdated
mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy | ||
eps::T | ||
schedule::S | ||
rng::R | ||
actions::A | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my suggestions are taken, this would be replaced with
mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy | |
eps::T | |
schedule::S | |
rng::R | |
actions::A | |
end | |
struct EpsGreedyPolicy{T<:Union{Real,Function}, R<:AbstractRNG, A} <: ExplorationPolicy | |
eps::T | |
rng::R | |
actions::A | |
end |
Then it could be constructed with e.g. EpsGreedyPolicy(m, LinearDecaySchedule(0.1, 0.01, 1000))
or EpsGreedyPolicy(m, 0.05)
or EpsGreedyPolicy(m, k->0.05*0.9^(k/10))
, all of which seem pretty clear.
src/exploration_policies.jl
Outdated
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real | ||
p.eps = update_value(p.schedule, p.eps) | ||
if rand(p.rng) < p.eps | ||
return rand(p.rng, p.actions) | ||
else | ||
return action(on_policy, s) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my suggestions are taken, this would be
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real | |
p.eps = update_value(p.schedule, p.eps) | |
if rand(p.rng) < p.eps | |
return rand(p.rng, p.actions) | |
else | |
return action(on_policy, s) | |
end | |
end | |
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Real | |
if rand(p.rng) < p.eps | |
return rand(p.rng, p.actions) | |
else | |
return action(on_policy, s) | |
end | |
end | |
function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Function | |
if rand(p.rng) < p.eps(k) | |
return rand(p.rng, p.actions) | |
else | |
return action(on_policy, s) | |
end | |
end |
@zsunberg I like your suggestions. |
I was thinking about this a little more. Instead of I think this might lead to less confusion than using the number of steps because the number of steps is kind of ambiguous - is it the number of episodes or the number of (s, a, r, s') steps that have been taken? Also, that would be fewer numbers that the user has to remember to change if they change the number of steps the algorithm uses. |
The argument about using a fraction of training rather than a training step is valid (less parameters to change, less ambiguity). |
In my experiments with DQN I have always been logging the value of epsilon. |
I implemented the changes suggested by @zsunberg. |
Sorry for the delay - I am going to look over it this afternoon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this is almost good - see my two comments though.
I don't think I have any much better ideas for logging than exploration_parameter
, although maybe it should be more general like info(p::ExplorationPolicy, k)
and then it could have different variables returned in a namedtuple/dict for different policies, like loginfo(p::EpsGreedyPolicy, k) = (eps=p.eps(k),)
and loginfo(p::SoftmaxPolicy, k)=(temperature=p.temperature(k),)
?
src/exploration_policies.jl
Outdated
@@ -7,7 +7,7 @@ Abstract type for exploration schedule. | |||
It is useful to define the schedule of a parameter of an exploration policy. | |||
The effect of a schedule is defined by the `update_value` function. | |||
""" | |||
abstract type ExplorationSchedule end | |||
abstract type ExplorationSchedule <: Function end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this abstract type? I don't see what purpose it serves and I am afraid someone will see it and think they need to use it. I think the schedule should just be a function, so people can write eps = k->max(0, 0.1*(10000-k)/10000)
for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, not really needed here since we don't have an interface for schedules anymore.
src/exploration_policies.jl
Outdated
@@ -7,7 +7,7 @@ Abstract type for exploration schedule. | |||
It is useful to define the schedule of a parameter of an exploration policy. | |||
The effect of a schedule is defined by the `update_value` function. | |||
""" | |||
abstract type ExplorationSchedule end | |||
abstract type ExplorationSchedule <: Function end | |||
|
|||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we getting rid of update_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not updated the docstrings yet ;)
Sorry again for the additional delays! |
Thanks Zach, I am going to go with the |
Let me know if you think it is good to merge now! |
OK, I think the code is good now! One thing I have been thinking a lot about is how we can separate out the RL infrastructure from the POMDPs infrastructure. In my course, I really wanted to give the students an RLInterface environment without them knowing anything about the underlying MDP, but RLInterface and POMDPs are still highly coupled. This is a bigger issue and I don't think we have time to just solve it now, but it is something to think about. The relevance to this PR is that it seems like maybe this should go in a different place than POMDPPolicies, but I don't want to hold your other development back any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good code-wise. See my comment about bigger picture stuff.
Yes I agree that there needs to be a discussion on separating POMDP and RL, RLInterface is more of a wrapper around MDPs and POMDPs. |
Introduces an interface for Exploration Policies:
See discussion here: JuliaPOMDP/TabularTDLearning.jl#10
ExplorationPolicy
, interface for sampling from an exploration policy:action(exploration_policy, on_policy, s)
ExplorationSchedule
, interface for updating the parameters of exploration policies:update_value(schedule, value)
, provides two schedules:LinearDecaySchedule
andConstantSchedule
Breaking changes:
EpsGreedyPolicy
needs anon_policy
argument when sampling actions.TODOs: