## Our challenge: Automated Parking System

We consider the **parking-v0** task of the [highway-env](https://github.com/eleurent/highway-env) environment. It is a **goal-conditioned continuous control** task where an agent **drives a car** by controlling the gaz pedal and steering angle and must **park in a given location** with the appropriate heading.

This MDP has several properties wich justifies using model-based methods:
* The policy/value is highly dependent on the goal which adds a significant level of complexity to a model-free learning process, whereas the dynamics are completely independent of the goal and hence can be simpler to learn.
* In the context of an industrial application, we can reasonably expect for safety concerns that the planned trajectory is required to be known in advance, before execution.

###  Warming up
We start with a few useful installs and imports:

In [None]:
# Install environment and visualization dependencies 
!pip install highway-env
!pip install gym pyvirtualdisplay
!apt-get update
!apt-get install -y xvfb python-opengl ffmpeg -y

# Environment
import gym
import highway_env

# Models and computation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import namedtuple
# torch.set_default_tensor_type("torch.cuda.FloatTensor")

# Visualization
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm.notebook import trange
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gym.wrappers import Monitor
import base64

# IO
from pathlib import Path

We also define a simple helper function for visualization of episodes:

In [None]:
display = Display(visible=0, size=(1400, 900))
display.start()

def show_video(path):
    html = []
    for mp4 in Path(path).glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 400px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### Let's try it!

Make the environment, and run an episode with random actions:

In [None]:
env = gym.make("parking-v0")
env = Monitor(env, './video', force=True, video_callable=lambda episode: True)
env.reset()
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video('./video')

The environment is a `GoalEnv`, which means the agent receives a dictionary containing both the current `observation` and the `desired_goal` that conditions its policy.

In [None]:
print("Observation format:", obs)

In [None]:
env.observation_space

In [None]:
env.action_space

There is also an `achieved_goal` that won't be useful here (it only serves when the state and goal spaces are different, as a projection from the observation to the goal space).

# Modeling
Try with both buffers

In [None]:
!pip install stable-baselines3
!pip install sb3-contrib

## Buffers

### Dict replay buffer

In [None]:
from typing import Optional
from stable_baselines3.common.vec_env import VecEnv, VecNormalize

from actor_critic_highway.dict_buffer import DictReplayBufferBase, DictReplayBufferSamples


class DictReplayBuffer(DictReplayBufferBase):

    def sample(self, batch_size: int, env: Optional[VecNormalize] = None) -> DictReplayBufferSamples:
        """
        Sample elements from the replay buffer.
        :param batch_size: Number of element to sample
        :param env: associated gym VecEnv
            to normalize the observations/rewards when sampling
        :return:
        """
        ### YOUR CODE HERE ###
        batch_inds = []  # <-- Get random indexes
        ######################
        return self._get_samples(batch_inds, env=env)

### HER

In [None]:
from actor_critic_highway.her_buffers import HerReplayBufferBase


# For convenience
# that way, we can use string to select a strategy
GOAL_STRATEGY_MAPPING = {
    "future": 0,
    "final": 1,
    "episode": 2,
}


class HerReplayBuffer(HerReplayBufferBase):

    def sample(
            self,
            batch_size: int,
            env: Optional[VecNormalize],
    ) -> DictReplayBufferSamples:
        """
        Sample function for online sampling of HER transition,
        this replaces the "regular" replay buffer ``sample()``
        method in the ``train()`` function.
        :param batch_size: Number of element to sample
        :param env: Associated gym VecEnv
            to normalize the observations/rewards when sampling
        :return: Samples.
        """
        # TODO: for student
        # If online sampling, use self._sample_transitions();
        # if offline, use self.replay_buffer.sample():
        return None

    def _sample_her_transitions(self) -> None:
        """
        Sample additional goals and store new transitions in replay buffer
        when using offline sampling.
        """

        # Sample goals to create virtual transitions for the last episode.
        # TODO: for student
        observations, next_observations, actions, rewards = self._sample_offline(n_sampled_goal=self.n_sampled_goal)

        if len(observations) > 0:
            for i in range(len(observations["observation"])):
              ### YOUR CODE HERE ###
              # Store virtual transitions in the replay buffer
              # use self.replay_buffer.add() method:
              pass
              ######################
              

    def sample_goals(
            self,
            episode_indices: np.ndarray,
            her_indices: np.ndarray,
            transitions_indices: np.ndarray,
    ) -> np.ndarray:
        """
        Sample goals based on goal_selection_strategy.
        This is a vectorized (fast) version.
        :param episode_indices: Episode indices to use.
        :param her_indices: HER indices.
        :param transitions_indices: Transition indices to use.
        :return: Return sampled goals.
        """
        her_episode_indices = episode_indices[her_indices]

        if self.goal_selection_strategy == 1:
            ### YOUR CODE HERE ###
            transitions_indices = []  # <-- Get index of the final state of the current episode
            ######################

        elif self.goal_selection_strategy == 0:
            ### YOUR CODE HERE ###
            transitions_indices = []  # <-- Get index of a random state happened after the current one
            ######################

        elif self.goal_selection_strategy == 2:
            ### YOUR CODE HERE ###
            transitions_indices = []  # <-- Get random index of the current episode
            ######################

        else:
            raise ValueError(f"Strategy {self.goal_selection_strategy} for sampling goals not supported!")

        return self._buffer["achieved_goal"][her_episode_indices, transitions_indices]


## Models

### TD3

In [None]:
from actor_critic_highway.td3 import TD3_base
from stable_baselines3.common.utils import polyak_update
from stable_baselines3.common.vec_env import VecEnv
from typing import Optional


class TD3(TD3_base):

      def train(self, gradient_steps: int, batch_size: int = 100) -> None:
        # Switch to train mode (this affects batch norm / dropout)
        self.policy.set_training_mode(True)

        # Update learning rate according to lr schedule
        self._update_learning_rate([self.actor.optimizer, self.critic.optimizer])

        actor_losses, critic_losses = [], []

        for _ in range(gradient_steps):

            self._n_updates += 1
            # Sample replay buffer
            replay_data = self.replay_buffer.sample(batch_size, env=self._vec_normalize_env)

            with th.no_grad():
                # Select action according to policy and add clipped noise:

                # 1. Generate noise with shape == replay_data.actions.shape
                # 2. Clip noise with a constant self.target_noise_clip
                # 3. Add noise to the action tensor:
                # YOUR CODE HERE:
                #################
                noise = None
                next_actions = None

                # 1. Get next Q values from critic network (min over all critics targets)
                # 2. Update Q values for target network using specific formula
                # YOUR CODE HERE:
                #################
                pass  # remove after completing gaps

            # Get current Q-values estimates for each critic network
            current_q_values = self.critic(replay_data.observations, replay_data.actions)

            # 1. Compute MSE loss between current and target Q values:
            # YOUR CODE HERE:
            #################
            critic_loss = None

            critic_losses.append(critic_loss.item())

            # Optimize the critics
            self.critic.optimizer.zero_grad()
            critic_loss.backward()
            self.critic.optimizer.step()

            # Delayed policy updates
            if self._n_updates % self.policy_delay == 0:

                # 1. Compute actor loss using critic network:
                # YOUR CODE HERE:
                #################
                actor_loss: th.Tensor = 0  # Replace it

                # Optimize the actor
                self.actor.optimizer.zero_grad()
                actor_loss.backward()
                self.actor.optimizer.step()

                polyak_update(self.critic.parameters(), self.critic_target.parameters(), self.tau)
                polyak_update(self.actor.parameters(), self.actor_target.parameters(), self.tau)

        self.logger.record("train/n_updates", self._n_updates, exclude="tensorboard")
        if len(actor_losses) > 0:
            self.logger.record("train/actor_loss", np.mean(actor_losses))
        self.logger.record("train/critic_loss", np.mean(critic_losses))

### DDPG

In [None]:
from typing import Any, Dict, Optional, Tuple, Type, Union

from stable_baselines3.td3.policies import TD3Policy
import torch as th

from stable_baselines3.common.noise import ActionNoise
from stable_baselines3.common.type_aliases import GymEnv, MaybeCallback, Schedule


class DDPG(TD3):
    """
    Deep Deterministic Policy Gradient (DDPG).
    Deterministic Policy Gradient: http://proceedings.mlr.press/v32/silver14.pdf
    DDPG Paper: https://arxiv.org/abs/1509.02971
    Introduction to DDPG: https://spinningup.openai.com/en/latest/algorithms/ddpg.html
    Note: we treat DDPG as a special case of its successor TD3.
    :param policy: The policy model to use (MlpPolicy, CnnPolicy, ...)
    :param env: The environment to learn from (if registered in Gym, can be str)
    :param learning_rate: learning rate for adam optimizer,
        the same learning rate will be used for all networks (Q-Values, Actor and Value function)
        it can be a function of the current progress remaining (from 1 to 0)
    :param buffer_size: size of the replay buffer
    :param learning_starts: how many steps of the model to collect transitions for before learning starts
    :param batch_size: Minibatch size for each gradient update
    :param tau: the soft update coefficient ("Polyak update", between 0 and 1)
    :param gamma: the discount factor
    :param train_freq: Update the model every ``train_freq`` steps. Alternatively pass a tuple of frequency and unit
        like ``(5, "step")`` or ``(2, "episode")``.
    :param gradient_steps: How many gradient steps to do after each rollout (see ``train_freq``)
        Set to ``-1`` means to do as many gradient steps as steps done in the environment
        during the rollout.
    :param action_noise: the action noise type (None by default), this can help
        for hard exploration problem. Cf common.noise for the different action noise type.
    :param replay_buffer_class: Replay buffer class to use (for instance ``HerReplayBuffer``).
        If ``None``, it will be automatically selected.
    :param replay_buffer_kwargs: Keyword arguments to pass to the replay buffer on creation.
    :param optimize_memory_usage: Enable a memory efficient variant of the replay buffer
        at a cost of more complexity.
        See https://github.com/DLR-RM/stable-baselines3/issues/37#issuecomment-637501195
    :param create_eval_env: Whether to create a second environment that will be
        used for evaluating the agent periodically. (Only available when passing string for the environment)
    :param policy_kwargs: additional arguments to be passed to the policy on creation
    :param verbose: the verbosity level: 0 no output, 1 info, 2 debug
    :param seed: Seed for the pseudo random generators
    :param device: Device (cpu, cuda, ...) on which the code should be run.
        Setting it to auto, the code will be run on the GPU if possible.
    :param _init_setup_model: Whether or not to build the network at the creation of the instance
    """

    def __init__(
        self,
        policy: Union[str, Type[TD3Policy]],
        env: Union[GymEnv, str],
        learning_rate: Union[float, Schedule] = 1e-3,
        buffer_size: int = 1000000,  # 1e6
        learning_starts: int = 100,
        batch_size: int = 100,
        tau: float = 0.005,
        gamma: float = 0.99,
        train_freq: Union[int, Tuple[int, str]] = (1, "episode"),
        gradient_steps: int = -1,
        action_noise: Optional[ActionNoise] = None,
        replay_buffer_class: Optional[DictReplayBufferBase] = None,
        replay_buffer_kwargs: Optional[Dict[str, Any]] = None,
        optimize_memory_usage: bool = False,
        tensorboard_log: Optional[str] = None,
        create_eval_env: bool = False,
        policy_kwargs: Optional[Dict[str, Any]] = None,
        verbose: int = 0,
        seed: Optional[int] = None,
        device: Union[th.device, str] = "auto",
        _init_setup_model: bool = True,
    ):

        super(DDPG, self).__init__(
            policy=policy,
            env=env,
            learning_rate=learning_rate,
            buffer_size=buffer_size,
            learning_starts=learning_starts,
            batch_size=batch_size,
            tau=tau,
            gamma=gamma,
            train_freq=train_freq,
            gradient_steps=gradient_steps,
            action_noise=action_noise,
            replay_buffer_class=replay_buffer_class,
            replay_buffer_kwargs=replay_buffer_kwargs,
            policy_kwargs=policy_kwargs,
            tensorboard_log=tensorboard_log,
            verbose=verbose,
            device=device,
            create_eval_env=create_eval_env,
            seed=seed,
            optimize_memory_usage=optimize_memory_usage,
            # Remove all tricks from TD3 to obtain DDPG:
            # we still need to specify target_policy_noise > 0 to avoid errors
            policy_delay=1,
            target_noise_clip=0.0,
            target_policy_noise=0.1,
            _init_setup_model=False,
        )

        # Use only one critic
        if "n_critics" not in self.policy_kwargs:
            self.policy_kwargs["n_critics"] = 1

        if _init_setup_model:
            self._setup_model()

    def learn(
        self,
        total_timesteps: int,
        callback: MaybeCallback = None,
        log_interval: int = 4,
        eval_env: Optional[GymEnv] = None,
        eval_freq: int = -1,
        n_eval_episodes: int = 5,
        tb_log_name: str = "DDPG",
        eval_log_path: Optional[str] = None,
        reset_num_timesteps: bool = True,
    ) :

        return super(DDPG, self).learn(
            total_timesteps=total_timesteps,
            callback=callback,
            log_interval=log_interval,
            eval_env=eval_env,
            eval_freq=eval_freq,
            n_eval_episodes=n_eval_episodes,
            tb_log_name=tb_log_name,
            eval_log_path=eval_log_path,
            reset_num_timesteps=reset_num_timesteps,
        )

## Train

In [None]:
# Agent
from stable_baselines3.td3.policies import MultiInputPolicy
from actor_critic_highway.dict_buffer import DictReplayBuffer
from actor_critic_highway.her_buffers import HerReplayBuffer


env = gym.make("parking-v0")
her_kwargs = dict(
    n_sampled_goal=4, 
    goal_selection_strategy='future', 
    online_sampling=True,
     max_episode_length=100
     )

model = TD3(policy=MultiInputPolicy,
             env=env,
             verbose=1,
            #  replay_buffer_class=DictReplayBuffer,
             replay_buffer_class=HerReplayBuffer,
             replay_buffer_kwargs=her_kwargs,
            policy_kwargs=dict()
             )

model.learn(int(1000))

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -59.1    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 50       |
|    time_elapsed    | 7        |
|    total timesteps | 400      |
| train/             |          |
|    actor_loss      | 0.505    |
|    critic_loss     | 0.00998  |
|    learning_rate   | 0.001    |
|    n_updates       | 200      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -86.4    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 44       |
|    time_elapsed    | 18       |
|    total timesteps | 800      |
| train/             

<__main__.TD3 at 0x7f2d2191fad0>

In [None]:
model = DDPG(policy=MultiInputPolicy,
             env=env,
             verbose=1,
             replay_buffer_class=DictReplayBuffer,
            #  replay_buffer_class=HerReplayBuffer,
            #  replay_buffer_kwargs=her_kwargs,
            policy_kwargs=dict()
             )

model.learn(int(5e4))

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
<class '__main__.DictReplayBuffer'>
<__main__.DictReplayBuffer object at 0x7f1a174d8b50>
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -79.9    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 50       |
|    time_elapsed    | 7        |
|    total timesteps | 400      |
| train/             |          |
|    actor_loss      | 1.74     |
|    critic_loss     | 0.0931   |
|    learning_rate   | 0.001    |
|    n_updates       | 200      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -65.2    |
|    success rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 43       |


<__main__.DDPG at 0x7f1a18df5f90>

# Test the policy

In [None]:
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"

In [None]:
env = gym.make("parking-v0")
env = Monitor(env, './video', force=True, video_callable=lambda episode: True)
for episode in trange(3, desc="Test episodes"):
    obs, done = env.reset(), False
    env.unwrapped.automatic_rendering_callback = env.video_recorder.capture_frame
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        
env.close()
show_video('./video')