-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer learning: RL training using loaded reward model #81
Conversation
Codecov Report
@@ Coverage Diff @@
## master #81 +/- ##
==========================================
+ Coverage 80.54% 81.13% +0.58%
==========================================
Files 48 49 +1
Lines 2832 2920 +88
==========================================
+ Hits 2281 2369 +88
Misses 551 551
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Requested changes in and made some API suggestions for rewards.serialize
.
src/imitation/rewards/serialize.py
Outdated
assert shaped in ["True", "False"] | ||
shaped = shaped == "True" | ||
|
||
# TODO(adam): leaks session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(One dumb way to solve this would be to make load_reward
into a contextmanager itself, which closes the session automatically on exit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be OK with this, but it feels just like punting the problem elsewhere. At the current callsite in scripts/expert_demos.py:105 it seems like most the code needs the reward model. Probably we'd want to deallocate the session when the RewardVecEnvWrapper
gets close
(d)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that reward_fn
and its Session should be closed once venv = RewardVecEnvWrapper(venv, reward_fn)
is no longer used (nearly at the end of the function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(RewardVecWrapper
wouldn't make a good contextmanager because it isn't guaranteed to hold a Session
(and wouldn't have access to reward_fn's Session anyways).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionally setting up a context from reward_fn
seems hard to do via a vanilla context + a no-op context switch.
imitation/src/imitation/scripts/expert_demos.py
Lines 99 to 110 in 2d1f29f
venv = util.make_vec_env(env_name, num_vec, seed=_seed, | |
parallel=parallel, log_dir=log_dir) | |
if reward_type is not None: | |
reward_fn = load_reward(reward_type, reward_path, venv) | |
venv = RewardVecEnvWrapper(venv, reward_fn) | |
tf.logging.info( | |
f"Wrapped env in reward {reward_type} from {reward_path}.") | |
vec_normalize = None | |
if normalize: | |
venv = vec_normalize = VecNormalize(venv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this might be a good use case for contextlib.ExitStack.
Something like
with ExitStack() as exit_stack:
...
if reward_type is not None:
reward_fn, resources = load_reward(...) # type: Tuple[RewardFn, List["Implements .close()"]]
for resource in resources:
exit_stack.push(contextlib.closing(resource))
venv = RewardVecEnvWrapper(venv, reward_fn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! I'd never seen ExitStack
. I'm going to address this in a separate PR since I want to change things for policies/serialize.py
as well.
Python type annotations for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR creates a new registry for reward function loaders, and adds support to
expert_demos
to load any reward function supported in this registry, wrapping the environment.Specifically, we actually create two registries: one for
RewardNet
objects, and one for callables which take obs-act-obs triples and return rewards. The latter is a more general interface, and is what is needed byexpert_demos
. The former,RewardNet
, is more constrained (it would exclude e.g.DiscrimNetGAIL
) but could be useful in cases where one needs a TensorFlow reward model (e.g. for fine-tuning/further training).