Exploration Policies #20

MaximeBouton · 2020-02-29T02:25:55Z

Introduces an interface for Exploration Policies:

See discussion here: JuliaPOMDP/TabularTDLearning.jl#10

abtract type ExplorationPolicy , interface for sampling from an exploration policy: action(exploration_policy, on_policy, s)
abstract type ExplorationSchedule, interface for updating the parameters of exploration policies: update_value(schedule, value) , provides two schedules: LinearDecaySchedule and ConstantSchedule
bump version number to 0.3

Breaking changes:
EpsGreedyPolicy needs an on_policy argument when sampling actions.

TODOs:

write docs
update compat in TabularTDLearning.jl and DeepQLearning.jl
update TabularTDLearning.jl and DeepQLearning.jl

coveralls · 2020-02-29T02:39:20Z

Pull Request Test Coverage Report for Build 100

14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.0%) to 90.265%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/exploration_policies.jl	14	15	93.33%

Totals
Change from base Build 94:	1.0%
Covered Lines:	204
Relevant Lines:	226

💛 - Coveralls

rejuvyesh · 2020-03-01T23:55:42Z

Hmm. I'm not sure why ExplorationStrategy shouldn't just keep a reference to the on_policy in its struct?

MaximeBouton · 2020-03-02T02:22:41Z

I think construction might be an issue. The typical use case will be in DeepQLearning.jl or TabularTDLearning.jl where you would like to pass in the exploration policy but let the solver initialize the on_policy.

rejuvyesh

Looks good to me other than that small change! Thanks for this!

src/exploration_policies.jl

MaximeBouton · 2020-03-02T21:25:39Z

I set up the compathelper bot for DeepQLearning and TabularTDLearning. I won't update them right now, I am going to wait for the new version of POMDPPolicies to be merged and see if it triggers the bot and fix it then.

zsunberg · 2020-03-03T03:44:33Z

@MaximeBouton Sorry for the delay! I was traveling this weekend. I have some comments - will post them tomorrow. Thanks for adding this!

zsunberg

Thanks for submitting this. I left a few inline comments, but the main thing I is that we should consider a significant change:

Intuitively, when I think of schedule, I think of a function from time (or, in this case, the number of calls to the policy) to a value, so I think we should make schedules just that: Functions from n_steps to a value.

I also like the action(exp_policy, on_policy, s) interface, but I think we should add n_steps or just k as an explicit argument. This keeps everything static (the exploration policy isn't carrying any mutable state around) and explicit:

action(exp_policy, on_policy, k, s)

I don't see much downside to adding this additional argument if we are already adding the on_policy.

I have added inline comments showing how this would change the implementation of EpsGreedyPolicy and LinearDecaySchedule

zsunberg · 2020-03-04T00:53:27Z

.travis.yml

@@ -2,7 +2,7 @@ language: julia

 julia:
  - 1.0
-  - 1.2
+  - 1.3


I think it should just be

- 1.0 - 1 # this will get the latest 1.x release

zsunberg · 2020-03-04T00:59:03Z

src/exploration_policies.jl

+    update_value(::ExplorationSchedule, value)
+Returns an updated value according to the schedule.
+"""
+function update_value(::ExplorationSchedule, value) end


I think this should be

function update_value end

so the standard method error will be thrown.

Given my other comments, this function may cease to exist entirely though.

zsunberg · 2020-03-04T02:03:19Z

src/exploration_policies.jl

+"""
+    LinearDecaySchedule
+A schedule that linearly decreases a value from `start_val` to `end_val` in `steps` steps.
+if the value is greater or equal to `end_val`, it stays constant.
+
+# Constructor 
+
+`LinearDecaySchedule(;start_val, end_val, steps)`
+"""
+@with_kw struct LinearDecaySchedule{R<:Real} <: ExplorationSchedule
+    start_val::R
+    end_val::R
+    steps::Int
+end
+
+function update_value(schedule::LinearDecaySchedule, value)
+    rate = (schedule.start_val - schedule.end_val) / schedule.steps
+    new_value = max(value - rate, schedule.end_val)
+end


With my suggestions, this would be replaced with

@with_kw struct LinearDecaySchedule{R<:Real} <: Function start::R stop::R steps::Int end (schedule::LinearDecaySchedule)(k) = # code for interpolating

zsunberg · 2020-03-04T02:09:52Z

src/exploration_policies.jl

+mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy
+    eps::T
+    schedule::S
+    rng::R
+    actions::A
+end


If my suggestions are taken, this would be replaced with

Suggested change

mutable struct EpsGreedyPolicy{T<:Real, S<:ExplorationSchedule, R<:AbstractRNG, A} <: ExplorationPolicy

eps::T

schedule::S

rng::R

actions::A

end

struct EpsGreedyPolicy{T<:Union{Real,Function}, R<:AbstractRNG, A} <: ExplorationPolicy

eps::T

rng::R

actions::A

end

Then it could be constructed with e.g. EpsGreedyPolicy(m, LinearDecaySchedule(0.1, 0.01, 1000)) or EpsGreedyPolicy(m, 0.05) or EpsGreedyPolicy(m, k->0.05*0.9^(k/10)), all of which seem pretty clear.

zsunberg · 2020-03-04T02:15:26Z

src/exploration_policies.jl

+function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real
+    p.eps = update_value(p.schedule, p.eps)
+    if rand(p.rng) < p.eps
+        return rand(p.rng, p.actions)
+    else 
+        return action(on_policy, s)
+    end
+end


If my suggestions are taken, this would be

Suggested change

function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, s) where T<:Real

p.eps = update_value(p.schedule, p.eps)

if rand(p.rng) < p.eps

return rand(p.rng, p.actions)

else

return action(on_policy, s)

end

end

function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Real

if rand(p.rng) < p.eps

return rand(p.rng, p.actions)

else

return action(on_policy, s)

end

end

function POMDPs.action(p::EpsGreedyPolicy{T}, on_policy::Policy, k, s) where T<:Function

if rand(p.rng) < p.eps(k)

return rand(p.rng, p.actions)

else

return action(on_policy, s)

end

end

MaximeBouton · 2020-03-04T15:51:55Z

@zsunberg I like your suggestions.
As I was trying to use this code in a project I thought missing the time step was not very convenient.
Good call for the function schedule(k) I think that it is nice.

zsunberg · 2020-03-04T18:50:14Z

I was thinking about this a little more. Instead of k as an argument, should we have it be a float/rational that represents the fraction of the way we are through training? i.e. schedule(0.75) should be the value of the parameter 3/4ths of the way through training.

I think this might lead to less confusion than using the number of steps because the number of steps is kind of ambiguous - is it the number of episodes or the number of (s, a, r, s') steps that have been taken?

Also, that would be fewer numbers that the user has to remember to change if they change the number of steps the algorithm uses.

MaximeBouton · 2020-03-11T21:56:47Z

The argument about using a fraction of training rather than a training step is valid (less parameters to change, less ambiguity).
My main concern is that it is not very standard and might confuse people more.

MaximeBouton · 2020-03-11T22:07:50Z

In my experiments with DQN I have always been logging the value of epsilon.
How would that work here, would we need a function
exploration_parameter(exploration_policy, k) that would return the parameter to log?

MaximeBouton · 2020-03-11T23:00:42Z

I implemented the changes suggested by @zsunberg.
Regarding the issue of logging epsilon or the temperature, the best I could think of right now is have a function exploration_parameter(exploration_policy, k) that returns the scalar to log.

zsunberg · 2020-03-13T17:18:47Z

Sorry for the delay - I am going to look over it this afternoon.

zsunberg

Ok, I think this is almost good - see my two comments though.

I don't think I have any much better ideas for logging than exploration_parameter, although maybe it should be more general like info(p::ExplorationPolicy, k) and then it could have different variables returned in a namedtuple/dict for different policies, like loginfo(p::EpsGreedyPolicy, k) = (eps=p.eps(k),) and loginfo(p::SoftmaxPolicy, k)=(temperature=p.temperature(k),)?

zsunberg · 2020-03-15T02:02:47Z

src/exploration_policies.jl

@@ -7,7 +7,7 @@ Abstract type for exploration schedule.
 It is useful to define the schedule of a parameter of an exploration policy.
 The effect of a schedule is defined by the `update_value` function.
 """
-abstract type ExplorationSchedule end 
+abstract type ExplorationSchedule <: Function end 


Do we need this abstract type? I don't see what purpose it serves and I am afraid someone will see it and think they need to use it. I think the schedule should just be a function, so people can write eps = k->max(0, 0.1*(10000-k)/10000) for instance.

yep, not really needed here since we don't have an interface for schedules anymore.

zsunberg · 2020-03-15T02:04:23Z

src/exploration_policies.jl

@@ -7,7 +7,7 @@ Abstract type for exploration schedule.
 It is useful to define the schedule of a parameter of an exploration policy.
 The effect of a schedule is defined by the `update_value` function.
 """
-abstract type ExplorationSchedule end 
+abstract type ExplorationSchedule <: Function end 

 """


Aren't we getting rid of update_value?

I have not updated the docstrings yet ;)

zsunberg · 2020-03-15T03:27:20Z

Sorry again for the additional delays!

MaximeBouton · 2020-03-17T03:27:57Z

Thanks Zach, I am going to go with the loginfo returning a namedtuple

MaximeBouton · 2020-03-17T05:53:26Z

Let me know if you think it is good to merge now!

zsunberg · 2020-03-18T16:08:23Z

OK, I think the code is good now!

One thing I have been thinking a lot about is how we can separate out the RL infrastructure from the POMDPs infrastructure. In my course, I really wanted to give the students an RLInterface environment without them knowing anything about the underlying MDP, but RLInterface and POMDPs are still highly coupled. This is a bigger issue and I don't think we have time to just solve it now, but it is something to think about. The relevance to this PR is that it seems like maybe this should go in a different place than POMDPPolicies, but I don't want to hold your other development back any more.

zsunberg

Looks good code-wise. See my comment about bigger picture stuff.

MaximeBouton · 2020-03-19T00:53:27Z

Yes I agree that there needs to be a discussion on separating POMDP and RL, RLInterface is more of a wrapper around MDPs and POMDPs.
There is some effort in developing ReinforcementLearning.jl so maybe we should use this for RL...
Let's have this discussion somewhere else, slack/issue/discourse are fine.

add exploration policies

feef019

MaximeBouton requested review from rejuvyesh and zsunberg February 29, 2020 02:26

julia 1.3 on travis

14d7d10

rejuvyesh requested changes Mar 2, 2020

View reviewed changes

src/exploration_policies.jl Outdated Show resolved Hide resolved

make lineardecayschedule immutable

b812660

rejuvyesh approved these changes Mar 2, 2020

View reviewed changes

softmax numerical stability, docs

939509e

export explorationpolicy

71bfd25

zsunberg requested changes Mar 4, 2020

View reviewed changes

schedules are functors, policies take functions as arg

8d7e28f

zsunberg requested changes Mar 15, 2020

View reviewed changes

MaximeBouton added 2 commits March 16, 2020 22:27

loginfo, remove abstract schedule

29cdf39

add docs

0e6e11d

zsunberg approved these changes Mar 18, 2020

View reviewed changes

MaximeBouton and others added 2 commits March 18, 2020 19:26

docs project file

7d8b5ca

Merge branch 'master' into exploration-policies

faeaf75

MaximeBouton merged commit 43cbb42 into master Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration Policies #20

Exploration Policies #20

MaximeBouton commented Feb 29, 2020 •

edited

coveralls commented Feb 29, 2020 •

edited

rejuvyesh commented Mar 1, 2020

MaximeBouton commented Mar 2, 2020

rejuvyesh left a comment

MaximeBouton commented Mar 2, 2020

zsunberg commented Mar 3, 2020

zsunberg left a comment

zsunberg Mar 4, 2020

zsunberg Mar 4, 2020

zsunberg Mar 4, 2020

zsunberg Mar 4, 2020

zsunberg Mar 4, 2020

zsunberg Mar 4, 2020

MaximeBouton commented Mar 4, 2020

zsunberg commented Mar 4, 2020

MaximeBouton commented Mar 11, 2020 •

edited

MaximeBouton commented Mar 11, 2020

MaximeBouton commented Mar 11, 2020

zsunberg commented Mar 13, 2020

zsunberg left a comment

zsunberg Mar 15, 2020

MaximeBouton Mar 17, 2020

zsunberg Mar 15, 2020

MaximeBouton Mar 17, 2020

zsunberg commented Mar 15, 2020

MaximeBouton commented Mar 17, 2020

MaximeBouton commented Mar 17, 2020

zsunberg commented Mar 18, 2020

zsunberg left a comment

MaximeBouton commented Mar 19, 2020

@@ @@ -2,7 +2,7 @@ language: julia @@
               julia:
                 - 1.0
-                - 1.2
+                - 1.3

Exploration Policies #20

Exploration Policies #20

Conversation

MaximeBouton commented Feb 29, 2020 • edited

coveralls commented Feb 29, 2020 • edited

Pull Request Test Coverage Report for Build 100

💛 - Coveralls

rejuvyesh commented Mar 1, 2020

MaximeBouton commented Mar 2, 2020

rejuvyesh left a comment

Choose a reason for hiding this comment

MaximeBouton commented Mar 2, 2020

zsunberg commented Mar 3, 2020

zsunberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaximeBouton commented Mar 4, 2020

zsunberg commented Mar 4, 2020

MaximeBouton commented Mar 11, 2020 • edited

MaximeBouton commented Mar 11, 2020

MaximeBouton commented Mar 11, 2020

zsunberg commented Mar 13, 2020

zsunberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsunberg commented Mar 15, 2020

MaximeBouton commented Mar 17, 2020

MaximeBouton commented Mar 17, 2020

zsunberg commented Mar 18, 2020

zsunberg left a comment

Choose a reason for hiding this comment

MaximeBouton commented Mar 19, 2020

MaximeBouton commented Feb 29, 2020 •

edited

coveralls commented Feb 29, 2020 •

edited

MaximeBouton commented Mar 11, 2020 •

edited