### Disclaimer

Distribution authorized to U.S. Government agencies and their contractors. Other requests for this document shall be referred to the MIT Lincoln Laboratory Technology Office.

This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

© 2019 Massachusetts Institute of Technology.

The software/firmware is provided to you on an As-Is basis

Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.


### Treasure Hunt Challenge

This notebook demonstrates using [Stable Baselines](https://stable-baselines.readthedocs.io/en/master/) Proximal Policy Optimization to train a CNN-LSTM agent for the GOSEEK-Challenge. An agent must find as many treasures placed around a simulated environment as possible in the alloted time.

`tesse_gym` allows for interface customizations, some of which are demonstrated here. Specifically, this notebook contains an example of using combined rgb, segmentation, depth, and pose as the agent's observation.

__Contents__
- [Configure Environment](#Configuration)
- [Define Model](#Define-the-Model)
- [Train Model](#Train-the-Model)
- [Visualize Results](#Visualize-Results)

In [2]:
from pathlib import Path

from gym import spaces
from stable_baselines.common.policies import CnnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv, DummyVecEnv
from stable_baselines import PPO2

from tesse.msgs import *

from tesse_gym.tasks.goseek import GoSeekFullPerception
from tesse_gym import get_network_config

## Configuration

#### Set build path

In [None]:
filename = Path.home() / 'tess/builds/goseek/v0.0.2/goseek-0.0.2.x86_64'

#### Set environment parameters


In [None]:
n_environments = 2
total_timesteps = 5000000
scene_id = [1, 2, 4, 5]  # holdout scenes 3, 6 for validation
success_dist = 2
n_targets = [30, 30, 30, 30]
episode_length = 400
target_found_reward = 2
step_rate = 20


def make_unity_env(filename, num_env):
    """ Create a wrapped Unity environment. """

    def make_env(rank):
        def _thunk():
            env = GoSeekFullPerception(
                str(filename),
                network_config=get_network_config(worker_id=rank),
                n_targets=n_targets[rank],
                success_dist=success_dist,
                episode_length=episode_length,
                step_rate=step_rate,
                scene_id=scene_id[rank],
                target_found_reward=target_found_reward,
            )
            return env

        return _thunk

    return SubprocVecEnv([make_env(i) for i in range(num_env)])

#### Next, we launch environments.

In [None]:
env = make_unity_env(filename, n_environments)

# Define the Model 
The following network assumes an observation of RGB, segmentation, and depth images along with the agent's relative pose. Images are processed using the Stable-Baseline default CNN. The resulting feature vector is concatenated with the pose vector and fed into an LSTM (defined when we initialize PPO

The OpenAI Gym [dictionary space](https://github.com/openai/gym/blob/master/gym/spaces/dict.py) is not supported, so we'll flatten the images into one vector and concatenate that with pose. This 

In [None]:
import tensorflow as tf
from stable_baselines.common.policies import nature_cnn

In [None]:
def decode_observations(observation, img_shape=(-1, 240, 320, 5)):
    """ Decode observation vector into images and pose.
    
    Args:
        observation (np.ndarray): 1D observationervation array.
        img_shape (Tuple[int]): Shapes of all images stacked in (H, W, C).
    
    Returns:
        Tuple[np.ndarray, np.ndarray]: Images and pose tensors.
    """
    if isinstance(observation, np.ndarray):
        imgs = observation[:, :-3].reshape(img_shape)
    elif isinstance(observation, tf.Tensor):
        imgs = tf.reshape(observation[:, :-3], img_shape)
    else:
        raise ValueError(
            f"Expected type `np.ndarray` or `tf.Tensor`, got: {type(observation)}"
        )

    pose = observation[:, -3:]

    return imgs, pose

In [None]:
def image_and_pose_network(observation, **kwargs):
    """ Network to process image and pose data.
    
    Use the stable baselines nature_cnn to process images. The resulting
    feature vector is then combined with the pose estimate and given to an
    LSTM (LSTM not defined here).
    
    Args:
        raw_observations (tf.Tensor): 1D tensor containing image and 
            pose data.
        
    Returns:
        tf.Tensor: Feature vector. 
    """
    imgs, pose = decode_observations(observation)
    image_features = nature_cnn(imgs)
    return tf.concat((image_features, pose), axis=-1)

In [None]:
policy_kwargs = {'cnn_extractor': image_and_pose_network}

In [None]:
model = PPO2(
    CnnLstmPolicy,
    env,
    verbose=1,
    tensorboard_log="./tensorboard/",
    nminibatches=2,
    gamma=0.995,
    learning_rate=0.00025,
    policy_kwargs=policy_kwargs,
)

# Train the Model

#### Define logging directory and callback function to save checkpoints
This will save intermediate checkpoints

In [None]:
log_dir = Path("results/goseek-ppo")
log_dir.mkdir(parents=True, exist_ok=True)

def save_checkpoint_callback(local_vars, global_vars):
    total_updates = local_vars["update"]
    if total_updates % 1000 == 0:
        local_vars["self"].save(str(log_dir / f"{total_updates:09d}.pkl"))

In [None]:
model.learn(total_timesteps=total_timesteps, callback=save_checkpoint_callback)

# Visualize Results

In [None]:
%matplotlib notebook`
import matplotlib.pyplot as plt

In [None]:
MODEL_PATH = ''
assert MODEL_PATH, f"Must give a model path!"

model = PPO2.load(str(MODEL_PATH))
n_train_envs = model.act_model.initial_state.shape[0]

In [None]:
obs = env.reset()
imgs, pose = decode_observations(obs)
lstm_state = None

assert (
    n_train_envs % obs.shape[0] == 0
), f"The number of visualization environments must be a multiple of the training environments"

In [None]:
fig, ax = plt.subplots(1, 2)
ax[0].imshow(imgs[0, ..., :3])
ax[1].imshow(imgs[0, ..., 3])

In [None]:
done = False
fig, ax = plt.subplots()

for i in range(max_steps):
    actions, lstm_state = model.predict(
        np.repeat(obs, n_train_envs // obs.shape[0], 0),
        state=lstm_state,
        deterministic=False,
    )

    action = actions[::obs.shape[0]]
    obs, reward, done, _ = env.step(action)

    plt.cla()
    imgs, pose = decode_observations(obs)

    # display RGB image
    ax.imshow((255 * imgs[0, ..., :3]).astype(np.uint8))
    fig.canvas.draw()

obs = env.reset()
imgs, pose = decode_observations(obs)
lstm_state = None