Adapt from: [here](https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/rl-baselines-zoo.ipynb#scrollTo=oKOjFuwK9HI0)

In [None]:
!pip install cmake swig
!pip install gymnasium[Box2D]==0.28.1
!pip install rl_zoo3==2.0.0

Collecting swig
  Downloading swig-4.1.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.1.1
Collecting gymnasium[Box2D]
  Downloading gymnasium-0.29.0-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.8/953.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium[Box2D])
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting box2d-py==2.3.5 (from gymnasium[Box2D])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py

So we have previously studied how value and policy-based method works and also the pros and cons of the 2 type of methods. In policy gradient (ex: Reinforce algorithm), you probably see that your training loss is not so stable (fluctuate a lot). Hence, here is when **A2C** play its role !!!


**A2C** is an abbreviation for Advantage Actor-Critic. It is an RL algorithm that combined both policy and value-based methods!!

How does A2C works in general?

When we are training the agent, we are changing the parameters on 2 things:
  - An *Actor*: performs action(s) (Policy-Based)
  - A *Critic*: evaluate the quality of the action(s) performs by the *Actor* (Valued-Based)

### Step 1

![Huggingface RL](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step1.jpg)

- The current state $S_t$ from the *Environment* being feed to the *Actor*, and *Actor* performs the action $A_t$

### Step 2

![Huggingface RL](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg)

- The state $S_t$ and the action $A_t$ are then pass into the *Crtic* to compute the Q-value at that state (do you still remember Q-value?)

### Step 3

![](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step3.jpg)

- After *Action* $A_t$, the $S_{t+1}$ and $R_{t+1}$ (reward).

### Step 4

![](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step4.jpg)

- The *Critic* is having a say on how much to change the policy parameters (by the action value estimate $\hat{q}_w(s,a)$ )

### Step 5

![](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step5.jpg)

- Here is how the *Critic* will update its parameters (**don't worry** if this look too complicated!!!)

We will use A2C to train a [Lunar Lander](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)!!! Though we can use the [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) algorithm inside of stablebaseline3 to train as usual, there is a more convenient way: using the [rl-baseline3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework for Stablebaseline3 agents!

![Lunar Lander](https://gymnasium.farama.org/_images/lunar_lander.gif)

The following code will train the agent to play **LunarLander-v2** using 100 timesteps (and using the pre-defined hyperparameters), feel free to train it more if you want!

In [None]:
!python -m rl_zoo3.train --algo a2c --env LunarLander-v2 --n-timesteps 100 --progress

2023-08-09 20:13:05.165159: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Seed: 3689662668
Loading hyperparameters from: /usr/local/lib/python3.10/dist-packages/rl_zoo3/hyperparams/a2c.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('ent_coef', 1e-05),
             ('gamma', 0.995),
             ('learning_rate', 'lin_0.00083'),
             ('n_envs', 8),
             ('n_steps', 5),
             ('n_timesteps', 200000.0),
             ('policy', 'MlpPolicy')])
Using 8 environments
Overwriting n_timesteps with n=100
Creating test environment
Using cuda device
Log path: logs/a2c/LunarLander-v2_1
[2K[35m 100%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120/100 [0m [ [33m0:00:03[0m < [36m0:00:

Next, we will record and play a video of running **100** steps of our trained agents!

In [None]:
# Set up display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [None]:
!python -m rl_zoo3.record_video --algo a2c --env LunarLander-v2  -f logs/ --exp-id 0  -n 200 -o logs/videos

2023-08-09 20:13:37.497168: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading latest experiment, id=1
Loading logs/a2c/LunarLander-v2_1/LunarLander-v2.zip
Loading logs/a2c/LunarLander-v2_1/LunarLander-v2.zip
Saving video to /content/logs/videos/final-model-a2c-LunarLander-v2-step-0-to-step-200.mp4
Moviepy - Building video /content/logs/videos/final-model-a2c-LunarLander-v2-step-0-to-step-200.mp4.
Moviepy - Writing video /content/logs/videos/final-model-a2c-LunarLander-v2-step-0-to-step-200.mp4

Moviepy - Done !
Moviepy - video ready /content/logs/videos/final-model-a2c-LunarLander-v2-step-0-to-step-200.mp4


In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
show_videos(video_path='logs/videos', prefix='')

By using [Optuna](https://optuna.org/). We can also tune the **hyperparameters** of our model to improve training result!

In [None]:
!python -m rl_zoo3.train --algo a2c --env LunarLander-v2 --n-timesteps 1 --progress -optimize --n-jobs 3 --verbose 1

2023-08-04 21:44:22.529563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Seed: 598707094
Loading hyperparameters from: /usr/local/lib/python3.10/dist-packages/rl_zoo3/hyperparams/a2c.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('ent_coef', 1e-05),
             ('gamma', 0.995),
             ('learning_rate', 'lin_0.00083'),
             ('n_envs', 8),
             ('n_steps', 5),
             ('n_timesteps', 200000.0),
             ('policy', 'MlpPolicy')])
Using 8 environments
Overwriting n_timesteps with n=100
Doing 1 intermediate evaluations for pruning based on the number of timesteps. (1 evaluation every 100k timesteps)
Optimizing hyperparameters
Sampler: tpe - Pruner: median
[32m[I 202