Reinforce.jl is an interface for Reinforcement Learning. It is intended to connect modular environments, policies, and solvers with a simple interface.
Packages which build on Reinforce:
- AtariAlgos: Environment which wraps Atari games using ArcadeLearningEnvironment
- OpenAIGym: Wrapper for OpenAI's python package: gym
New environments are created by subtyping
AbstractEnvironment and implementing
a few methods:
reset!(env) -> env
actions(env, s) -> A
step!(env, s, a) -> (r, s′)
finished(env, s′) -> Bool
and optional overrides:
state(env) -> s
reward(env) -> r
which map to
env.reward respectively when unset.
ismdp(env) -> Bool
An environment may be fully observable (MDP) or partially observable (POMDP).
In the case of a partially observable environment, the state
s is really
o. To maintain consistency, we call everything a state,
and assume that an environment is free to maintain additional (unobserved)
internal state. The
ismdp query returns true when the environment is MDP,
and false otherwise.
maxsteps(env) -> Int
The terminating condition of an episode is control by
maxsteps() || finished().
It's default value is
0, indicates unlimited.
An minimal example for testing purpose is
TODO: more details and examples
Agents/policies are created by subtyping
AbstractPolicy and implementing
The built-in random policy is a short example:
struct RandomPolicy <: AbstractPolicy end action(π::RandomPolicy, r, s, A) = rand(A)
A is the action space.
action method maps the last reward and current state to the next chosen action:
(r, s) -> a.
reset!(π::AbstractPolicy) -> π
Iterate through episodes using the
(s,a,r,s′) is returned from each step of the episode:
ep = Episode(env, π) for (s, a, r, s′) in ep # do some custom processing of the sars-tuple end R = ep.total_reward T = ep.niter
There is also a convenience method
The following is an equivalent method to the last example:
R = run_episode(env, π) do # anything you want... this section is called after each step end