## Our challenge: Automated Parking System

We consider the **parking-v0** task of the [highway-env](https://github.com/eleurent/highway-env) environment. It is a **goal-conditioned continuous control** task where an agent **drives a car** by controlling the gaz pedal and steering angle and must **park in a given location** with the appropriate heading.

This MDP has several properties wich justifies using model-based methods:
* The policy/value is highly dependent on the goal which adds a significant level of complexity to a model-free learning process, whereas the dynamics are completely independent of the goal and hence can be simpler to learn.
* In the context of an industrial application, we can reasonably expect for safety concerns that the planned trajectory is required to be known in advance, before execution.

###  Warming up
We start with a few useful installs and imports:

In [1]:
# Install environment and visualization dependencies 
!pip install highway-env
!pip install gym pyvirtualdisplay
!apt-get update
!apt-get install -y xvfb python-opengl ffmpeg -y

# Environment
import gym
import highway_env

# Models and computation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import namedtuple
# torch.set_default_tensor_type("torch.cuda.FloatTensor")

# Visualization
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm.notebook import trange
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gym.wrappers import Monitor
import base64

# IO
from pathlib import Path

Collecting highway-env
  Downloading highway_env-1.3-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.7 MB/s 
Collecting pygame
  Downloading pygame-2.0.1-cp37-cp37m-manylinux1_x86_64.whl (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 152 kB/s 
Installing collected packages: pygame, highway-env
Successfully installed highway-env-1.3 pygame-2.0.1
Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-2.2-py3-none-any.whl (15 kB)
Collecting EasyProcess
  Downloading EasyProcess-0.3-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: EasyProcess, pyvirtualdisplay
Successfully installed EasyProcess-0.3 pyvirtualdisplay-2.2
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic

We also define a simple helper function for visualization of episodes:

In [2]:
display = Display(visible=0, size=(1400, 900))
display.start()

def show_video(path):
    html = []
    for mp4 in Path(path).glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 400px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### Let's try it!

Make the environment, and run an episode with random actions:

In [3]:
env = gym.make("parking-v0")
env = Monitor(env, './video', force=True, video_callable=lambda episode: True)
env.reset()
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video('./video')

The environment is a `GoalEnv`, which means the agent receives a dictionary containing both the current `observation` and the `desired_goal` that conditions its policy.

In [4]:
print("Observation format:", obs)

Observation format: {'observation': array([0.06141458, 0.02539331, 0.25540411, 0.1294416 , 0.89198387,
       0.45206722]), 'achieved_goal': array([0.06141458, 0.02539331, 0.25540411, 0.1294416 , 0.89198387,
       0.45206722]), 'desired_goal': array([ 2.600000e-01, -1.400000e-01,  0.000000e+00,  0.000000e+00,
        6.123234e-17, -1.000000e+00])}


There is also an `achieved_goal` that won't be useful here (it only serves when the state and goal spaces are different, as a projection from the observation to the goal space).

Alright! We are now ready to apply the model-based reinforcement learning paradigm.

## Experience collection
First, we randomly interact with the environment to produce a batch of experiences 

$$D = \{s_t, a_t, s_{t+1}\}_{t\in[1,N]}$$

In [5]:
Transition = namedtuple('Transition', ['state', 'action', 'next_state'])

def collect_interaction_data(env, size=1000, action_repeat=2):
    data, done = [], True
    for _ in trange(size, desc="Collecting interaction data"):
        action = env.action_space.sample()
        for _ in range(action_repeat):
            previous_obs = env.reset() if done else obs
            obs, reward, done, info = env.step(action)
            data.append(Transition(torch.Tensor(previous_obs["observation"]),
                                   torch.Tensor(action),
                                   torch.Tensor(obs["observation"])))
    return data

data = collect_interaction_data(env)
print("Sample transition:", data[0])

Collecting interaction data:   0%|          | 0/1000 [00:00<?, ?it/s]

Sample transition: Transition(state=tensor([ 0.0000,  0.0000,  0.0000, -0.0000,  0.7845, -0.6201]), action=tensor([-0.6474,  0.5144]), next_state=tensor([-3.8691e-04,  1.9127e-04, -1.0129e-01,  8.0659e-02,  7.8227e-01,
        -6.2294e-01]))


# Buffers

## ReplayBuffer
note: there are two versions: dict and non-dict depending on the state formats

## HERBuffer

# Training
Try with both buffers

Check DDPG, update to TD3  
TQC?

In [None]:
!pip install stable-baselines3
!pip install sb3-contrib

In [16]:
# Agent
from stable_baselines3 import HerReplayBuffer, SAC
from stable_baselines3 import DDPG
from stable_baselines3.common.buffers import ReplayBuffer, DictReplayBuffer
from sb3_contrib import TQC


env = gym.make("parking-v0")
her_kwargs = dict(n_sampled_goal=4, goal_selection_strategy='future', 
                  online_sampling=True, max_episode_length=100)

# You can replace TQC with SAC agent
# model = TQC('MultiInputPolicy', env, replay_buffer_class=HerReplayBuffer,
#             replay_buffer_kwargs=her_kwargs, verbose=1, buffer_size=int(1e6),
#             learning_rate=1e-3,
#             gamma=0.95, batch_size=1024, tau=0.05,
#             policy_kwargs=dict(net_arch=[512, 512, 512]))

model = DDPG(policy='MultiInputPolicy',
             env=env,
             verbose=1,
             replay_buffer_class=HerReplayBuffer,
             replay_buffer_kwargs=her_kwargs)

model.learn(int(5e4))

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -67.3    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 48       |
|    time_elapsed    | 8        |
|    total timesteps | 400      |
| train/             |          |
|    actor_loss      | 0.958    |
|    critic_loss     | 0.032    |
|    learning_rate   | 0.001    |
|    n_updates       | 200      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -69      |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 40       |
|    time_elapsed    | 19       |
|    total timesteps | 800      |
| train/             |

<stable_baselines3.ddpg.ddpg.DDPG at 0x7f927f5eb390>

# Test the policy

In [18]:
# import os
# os.environ["SDL_VIDEODRIVER"] = "dummy"

In [20]:
env = gym.make("parking-v0")
env = Monitor(env, './video', force=True, video_callable=lambda episode: True)
for episode in trange(3, desc="Test episodes"):
    obs, done = env.reset(), False
    env.unwrapped.automatic_rendering_callback = env.video_recorder.capture_frame
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
env.close()
show_video('./video')

Test episodes:   0%|          | 0/3 [00:00<?, ?it/s]