Transfer learning: RL training using loaded reward model #81

AdamGleave · 2019-08-15T12:35:49Z

This PR creates a new registry for reward function loaders, and adds support to expert_demos to load any reward function supported in this registry, wrapping the environment.

Specifically, we actually create two registries: one for RewardNet objects, and one for callables which take obs-act-obs triples and return rewards. The latter is a more general interface, and is what is needed by expert_demos. The former, RewardNet, is more constrained (it would exclude e.g. DiscrimNetGAIL) but could be useful in cases where one needs a TensorFlow reward model (e.g. for fine-tuning/further training).

…ng environment with loaded reward

codecov · 2019-08-15T15:55:47Z

Codecov Report

Merging #81 into master will increase coverage by 0.58%.
The diff coverage is 98.51%.

@@            Coverage Diff             @@
##           master      #81      +/-   ##
==========================================
+ Coverage   80.54%   81.13%   +0.58%     
==========================================
  Files          48       49       +1     
  Lines        2832     2920      +88     
==========================================
+ Hits         2281     2369      +88     
  Misses        551      551

Impacted Files	Coverage Δ
tests/test_scripts.py	`100% <ø> (ø)`	⬆️
src/imitation/rewards/discrim_net.py	`94.65% <ø> (ø)`	⬆️
src/imitation/util/util.py	`100% <100%> (ø)`	⬆️
tests/test_reward_net.py	`100% <100%> (ø)`	⬆️
src/imitation/rewards/reward_net.py	`90.52% <100%> (-0.1%)`	⬇️
tests/test_reward_vec_env_wrapper.py	`100% <100%> (ø)`	⬆️
src/imitation/scripts/expert_demos.py	`88.7% <100%> (ø)`	⬆️
src/imitation/algorithms/density_baselines.py	`97.77% <100%> (ø)`	⬆️
tests/test_policies.py	`100% <100%> (ø)`	⬆️
src/imitation/scripts/config/expert_demos.py	`68.85% <100%> (+1.05%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4058764...2d1f29f. Read the comment docs.

shwang

Thanks for the PR. Requested changes in and made some API suggestions for rewards.serialize.

src/imitation/scripts/expert_demos.py

src/imitation/rewards/serialize.py

shwang · 2019-08-15T18:13:43Z

src/imitation/rewards/serialize.py

+  assert shaped in ["True", "False"]
+  shaped = shaped == "True"
+
+  # TODO(adam): leaks session


(One dumb way to solve this would be to make load_reward into a contextmanager itself, which closes the session automatically on exit.)

I'd be OK with this, but it feels just like punting the problem elsewhere. At the current callsite in scripts/expert_demos.py:105 it seems like most the code needs the reward model. Probably we'd want to deallocate the session when the RewardVecEnvWrapper gets close(d)?

Agree that reward_fn and its Session should be closed once venv = RewardVecEnvWrapper(venv, reward_fn) is no longer used (nearly at the end of the function).

(RewardVecWrapper wouldn't make a good contextmanager because it isn't guaranteed to hold a Session (and wouldn't have access to reward_fn's Session anyways).)

Conditionally setting up a context from reward_fn seems hard to do via a vanilla context + a no-op context switch.

imitation/src/imitation/scripts/expert_demos.py

Lines 99 to 110 in 2d1f29f

venv = util.make_vec_env(env_name, num_vec, seed=_seed,

parallel=parallel, log_dir=log_dir)

if reward_type is not None:

reward_fn = load_reward(reward_type, reward_path, venv)

venv = RewardVecEnvWrapper(venv, reward_fn)

tf.logging.info(

f"Wrapped env in reward {reward_type} from {reward_path}.")

vec_normalize = None

if normalize:

venv = vec_normalize = VecNormalize(venv)

So this might be a good use case for contextlib.ExitStack.

Something like

with ExitStack() as exit_stack: ... if reward_type is not None: reward_fn, resources = load_reward(...) # type: Tuple[RewardFn, List["Implements .close()"]] for resource in resources: exit_stack.push(contextlib.closing(resource)) venv = RewardVecEnvWrapper(venv, reward_fn)

Thanks ! I'd never seen ExitStack. I'm going to address this in a separate PR since I want to change things for policies/serialize.py as well.

src/imitation/rewards/serialize.py

src/imitation/util/registry.py

src/imitation/rewards/serialize.py

AdamGleave · 2019-08-16T09:40:01Z

Python type annotations for Callable do not support keyword arguments annoyingly, and pytype also has trouble with if-else branches leading to callables of different types, so I've: (a) made the steps argument mandatory (and adjusted existing reward functions to ignore it) and (b) made it a positional argument not a keyword argument.

shwang

LGTM

AdamGleave added 4 commits August 15, 2019 13:31

Create registry for rewards, add support in expert_demos for overridi…

dc7219d

…ng environment with loaded reward

Fix tests

193dab4

Add tests for reward serialization; bugfixes

5bf6b06

Speed up doc build by multi-processing

a0ea719

AdamGleave added 2 commits August 15, 2019 17:00

Test coverage

7c75100

More thorough testing of hardcoded rewards

d1f5851

shwang suggested changes Aug 15, 2019

View reviewed changes

AdamGleave added 2 commits August 16, 2019 10:15

Address review comments

e2dbeef

Merge remote-tracking branch 'origin/master' into reward-registry

ebcf0e7

AdamGleave requested a review from shwang August 16, 2019 09:18

Bugfix due to steps

8882b11

Bugfix: apply reward wrapping before normalizing observations

2d1f29f

shwang approved these changes Aug 19, 2019

View reviewed changes

AdamGleave merged commit 0a79409 into master Aug 20, 2019

AdamGleave deleted the reward-registry branch August 20, 2019 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer learning: RL training using loaded reward model #81

Transfer learning: RL training using loaded reward model #81

AdamGleave commented Aug 15, 2019

codecov bot commented Aug 15, 2019 •

edited

Loading

shwang left a comment

shwang Aug 15, 2019 •

edited

Loading

AdamGleave Aug 16, 2019

shwang Aug 16, 2019

shwang Aug 16, 2019

shwang Aug 16, 2019

shwang Aug 16, 2019 •

edited

Loading

AdamGleave Aug 20, 2019

AdamGleave commented Aug 16, 2019

shwang left a comment

	venv = util.make_vec_env(env_name, num_vec, seed=_seed,
	parallel=parallel, log_dir=log_dir)

	if reward_type is not None:
	reward_fn = load_reward(reward_type, reward_path, venv)
	venv = RewardVecEnvWrapper(venv, reward_fn)
	tf.logging.info(
	f"Wrapped env in reward {reward_type} from {reward_path}.")

	vec_normalize = None
	if normalize:
	venv = vec_normalize = VecNormalize(venv)

Transfer learning: RL training using loaded reward model #81

Transfer learning: RL training using loaded reward model #81

Conversation

AdamGleave commented Aug 15, 2019

codecov bot commented Aug 15, 2019 • edited Loading

Codecov Report

shwang left a comment

Choose a reason for hiding this comment

shwang Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

AdamGleave Aug 16, 2019

Choose a reason for hiding this comment

shwang Aug 16, 2019

Choose a reason for hiding this comment

shwang Aug 16, 2019

Choose a reason for hiding this comment

shwang Aug 16, 2019

Choose a reason for hiding this comment

shwang Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

AdamGleave Aug 20, 2019

Choose a reason for hiding this comment

AdamGleave commented Aug 16, 2019

shwang left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 15, 2019 •

edited

Loading

shwang Aug 15, 2019 •

edited

Loading

shwang Aug 16, 2019 •

edited

Loading