Notebooks created by Oliver Hausdörfer, Jonathan Külz, Hannah Markgraf.

**How to use this notebook:** You can upload this .ipynb-file to google colab for editing and running (go to [colab.research.google.com](colab.research.google.com) -> File -> Open Notebook -> Upload).

**How to submit this notebook:** After finishing the HandsOn on google colab, share the uploaded notebook (click share -> Anyone with link) and send the link with the <font color='red'>completed notebook and cell outputs</font> to the course supervisor oliver.hausdoerfer@tum.de. Include your name and student ID in this e-mail.

**All tasks that need to be completed by you are marked with:** "👉 TODO: ..."

# HandsOn 2: Your second Deep Reinforcement Learning Agent 🤖

In this notebook, you'll train your second Deep Reinforcement Learning agent - a [CartPole](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) that will learn to balance an inverted pendulum. For the project, we again use [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and [Gymnasium](https://gymnasium.farama.org/).




## Objectives of this Hands-On

At the end of the notebook, you will:

- Be able to use [OpenAI Gymnasium](https://gymnasium.farama.org/index.html), the environment library.
- Be able to use [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/) (SB3), the deep reinforcement learning library.




## Prerequisites

- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type > Hardware Accelerator > GPU`. **Important**: You need to reconnect your colab notebook afterward!

In [None]:
%load_ext tensorboard

!apt-get update && apt-get install ffmpeg freeglut3-dev xvfb  # For visualization. If you are working on your own machine, run these in your terminal as sudo
!pip install -q "stable-baselines3"
!pip install -q "swig"
!pip install -q "gymnasium[box2d]"

In [None]:
import stable_baselines3
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
import gymnasium as gym
import numpy as np
import torch

import os
import base64
from pathlib import Path
from IPython import display as ipythondisplay

# for local usage
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

print(f"{torch.__version__=}")
print(f"{stable_baselines3.__version__=}")
print(f"{gym.__version__=}")
if not torch.cuda.is_available():
    print("Did you activate the GPU? It doesn't seem to be available")

### Reinforcement Learning Agent 🤖


👉 TODO: Familiarize yourself with the CartPole-v1 environment. Then, fillout the following code cell.

In [None]:
env = gym.make("CartPole-v1")
env.reset()

print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", '''👉 TODO: insert line of code to print action space shape''')
print("Action Space Sample", '''👉 TODO: insert line of code to randomly sample from action space''')


print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", '''👉 TODO: insert line of code to print observation space shape''')
print("Sample observation", '''👉 TODO: insert line of code to get a random observation''')

Similar to our last notebook we first create the agent-environment-interaction loop by sampling random actions.

In [None]:
# Reset the environment
observation, info = env.reset()

# Move 20 timesteps
for _ in range(20):
  # Take a random action
  action = env.action_space.sample()
  print("Action taken:", action)

  # Exectue this action in the environment and get information
  observation, reward, terminated, truncated, info = env.step(action)

  if terminated or truncated:
      # Reset the environment
      print("Environment is reset")
      observation, info = env.reset()

env.close()

**Hint: Vectorized Environments**

In practise when you implement your own DRL frameworks, you should aim at using [Vectorized environments](https://stable-baselines3.readthedocs.io/en/master/common/env_util.html#stable_baselines3.common.env_util.make_vec_env). They are a method for stacking multiple independent environments, which diversifies experience collection and potentially improves training speed through parallelization. SB3 implements [vectorized environments](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#) 👉 TODO: Make sure to understand the different types of vec envs provided by SB3 [SubprocVecEnv](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#stable_baselines3.common.vec_env.SubprocVecEnv) and [DummyVecEnv](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#dummyvecenv).

In [None]:
# SB3 provides the convinience function to create a vectorized environment
env = make_vec_env('CartPole-v1', n_envs=16)

In [None]:
# 👉 TODO: briefly explain what type of vectorized environment (Subproc or Dummy) you would use to speed up simulation on a multi-core cpu machine. (You don't need to go into very technical details.)
print("I would use ..., because ...")

Now we solve (=train) our agent using [PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D). PPO is an actor-critic algorithm and one of the SOTA Deep Reinforcement Learning algorithms. Actor-critic means that it learns a policy function and a value function simultaneously. The policy learns which action to take in which state (s->a) and the value function tells the value of each state (s->value). The goal is for the agent to explore the full state space. PPO implements some additional tweaks to make the training more stable, such as clipping the loss function.

We already created our environment above. Now we need to setup the DRL model we want to use.

In [None]:
import torch as th
# 👉 TODO: Initialize the PPO agent. Use the following parameters:
    # policy = 'MlpPolicy',
    # env = env,
    # n_steps = 1024,
    # batch_size = 64,
    # n_epochs = 4,
    # gamma = 0.99,
    # gae_lambda = 0.98,
    # ent_coef = 0.01
# Documentation: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
# Additionally, define a custom policy and value network using 2 hidden layers, 32 neurons and ReLU as activation.
# The policy and value network should be independent networks.
# Documentation: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html
# Requires ~10 LOC










We inspect the network layouts to see if it is initialized correctly.

In [None]:
model.policy.mlp_extractor

Lastly, we simply train the agent.

👉 TODO: try different number of agent-environment interactions for learning (```total_timesteps```) in the range of 10_000 to 200_000 and see how the achieved reward changes using the ```evaluate_policy``` function given below.

In [None]:
model.learn(total_timesteps=10_000)

---

**Evaluating the agent**

Stable-Baselines3 provides a method to evaluate the agent: `evaluate_policy`. [Under the hood](https://stable-baselines3.readthedocs.io/en/master/_modules/stable_baselines3/common/evaluation.html#evaluate_policy) this method simply implements an environment-interaction-loop. When you evaluate your agent, you should not use your training environment but create an evaluation environment.

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy
eval_env = gym.make("CartPole-v1",)
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Finally, we prepare a video of the trained agent.

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

import base64
from pathlib import Path

from IPython import display as ipythondisplay

from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv



def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))



def record_video(env_id, model, video_length=500, prefix="", video_folder="videos/"):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make("CartPole-v1", render_mode="rgb_array")])
    # Start the video at step=0 and record 500 steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder=video_folder,
        record_video_trigger=lambda step: step == 0,
        video_length=video_length,
        name_prefix=prefix,
    )

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs)
        obs, _, _, _ = eval_env.step(action)

    # Close the video recorder
    eval_env.close()

In [None]:
record_video("CartPole-v1", model, video_length=500, prefix="ppo-cartpole")
show_videos("videos", prefix="ppo")

**Important note**

Usually crafting a reward function is a central task of solving DRL problems. For the gym environments we use, reward functions are already provided and we don't need to worry about them. For your own projects, the crafting of rewards might be critical. Before creating a reward function it makes sense to look at similar problems in the literature and try to copy their rewards.

👉 TODO: Look at [this](https://www.nature.com/articles/s41598-023-38259-7) paper and how they design a reward function for a quadruped to get an idea of the complexity of the reward signal for a more complicated problem.

# HandsOn 2: Finished 🤖

**How to submit this notebook:** After finishing the HandsOn on google colab, share the uploaded notebook (click share -> Anyone with link) and send the link with the <font color='red'>completed notebook and cell outputs</font> to the course supervisor oliver.hausdoerfer@tum.de. Include your name and student ID in this e-mail.