Reinforcement learning environments for Torch7
Latest commit 14ed4ea Dec 15, 2016 @Kaixhin committed on GitHub Update installation instructions


Reinforcement learning environments for Torch7, inspired by RL-Glue [1] and conforming to the OpenAI Gym API [2]. Supported environments:

  • rlenvs.Acrobot [3]
  • rlenvs.Atari (Arcade Learning Environment)* [4]
  • rlenvs.Blackjack [5]
  • rlenvs.CartPole [6]
  • rlenvs.Catch [7]
  • rlenvs.CliffWalking [8]
  • rlenvs.DynaMaze [9]
  • rlenvs.GridWorld [10]
  • rlenvs.JacksCarRental [8]
  • rlenvs.Minecraft (Project Malmö)* [11]
  • rlenvs.MountainCar [12]
  • rlenvs.MultiArmedBandit [13, 14]
  • rlenvs.RandomWalk [15]
  • rlenvs.Taxi [16]
  • rlenvs.WindyWorld [8]
  • rlenvs.XOWorld [17]

Run th experiment.lua (or qlua experiment.lua) to run a demo of a random agent playing Catch.

* Environments with other dependencies are installed only if those dependencies are available.


luarocks install

The old API can be installed with the v1 rockspec:

luarocks install

Atari Dependencies

luarocks install
luarocks install

Requires a supported Atari ROM to run.

Minecraft Dependencies

luarocks install luasocket

Requires Malmö (includes Minecraft), extracted with directory name MalmoPlatform. should be added to LUA_CPATH, and the level schemas should be exported to MALMO_XSD_PATH. For example, if MalmoPlatform is in /home/username, add the following to the end of your ~/.bashrc:

export LUA_CPATH='/home/username/MalmoPlatform/Torch_Examples/?.so;'$LUA_CPATH
export MALMO_XSD_PATH=/home/username/MalmoPlatform

The Malmö client ( must be operating to run.


To use an environment, require it and then create a new instance:

local MountainCar = require 'rlenvs.MountainCar'
local env = MountainCar()
local observation = env:start()


Note that the API is under development and may be subject to change


A table of all environments available in rlenvs.

observation = env:start([opts])

Starts a new episode in the environment and returns the first observation. May take opts.
Note that environments must actually implement this as _start.

reward, observation, terminal, [actionTaken] = env:step(action)

Performs a step in the environment using action (which may be a list - see below), and returns the reward, the observation of the state transitioned to, and a terminal flag. Optionally provides actionTaken, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.
Note that environments must actually implement this as _step.

stateSpace = env:getStateSpace()

Returns a state specification as a list with 3 elements:

Type Dimensionality Range
'int' 1 for a single value, or a table of dimensions for a Tensor 2-element list with min and max values (inclusive)
'real' 1 for a single value, or a table of dimensions for a Tensor 2-element list with min and max values (inclusive)
'string' TODO List of accepted strings

If several states are returned, stateSpec is itself a list of state specifications. Ranges may use nil if unknown.

actionSpace = env:getActionSpace()

Returns an action specification, with the same structure as used for state specifications.

minReward, maxReward = env:getRewardSpace()

Returns the minimum and maximum rewards produced by the environment. Values may be nil if unknown.

The following are optional parts of the API.


Changes settings for a "training mode", analogous to neural network modules.


Changes settings for an "evaluation mode", analogous to neural network modules.

displaySpec = env:getDisplaySpec()

Returns an RGB display specification, with the same structure as used for state specifications. Hence of the form {<int/real>, {3, <height>, <width>}, {<range>}}.

display = env:getDisplay()

Returns a RGB display tensor for visualising the state of the environment. Note that this may not be the same as the state provided for the agent.


Displays the environment using image. Requires the code to be run with qlua (rather than th) and getDisplay to be implemented by the environment.


Environments must inherit from Env and therefore implement the above methods (as well as a constructor). experiment.lua can be easily adapted for testing different environments. New environments should be added to rlenvs/init.lua, rocks/rlenvs-scm-1.rockspec, and be listed in this readme with an appropriate reference. For an example of a more complex environment that will only be installed if its optional dependencies are satisfied, see rlenvs/Atari.lua.


[1] Tanner, B., & White, A. (2009). RL-Glue: Language-independent software for reinforcement-learning experiments. The Journal of Machine Learning Research, 10, 2133-2136.
[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540.
[3] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In American Control Conference, 1994 (Vol. 2, pp. 2158-2162). IEEE.
[4] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. Journal of Artificial Intelligence Research, 47, 253-279.
[5] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on (Vol. 3, pp. 2022-2027). IEEE.
[6] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, (5), 834-846.
[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in Neural Information Processing Systems (pp. 2204-2212).
[8] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press.
[9] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning (pp. 216-224).
[10] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems, 369-376.
[11] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence.
[12] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1-3), 123-158.
[13] Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers (pp. 169-177). Springer New York.
[14] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied probability, 287-298.
[15] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
[16] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In Journal of Artificial Intelligence Research.
[17] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. In Workshop on Deep Reinforcement Learning, NIPS 2016.