<div class="alert alert-block alert-info">
    <h1>A practial introduction to policy search in RL </h1><br>
    <span>Algerian AI Summer University, August 2021</span>
</div>

### Outline
- Part I: Introduction to the RL paradigm and getting familiar with `gym`
- Part II:Implementing Policy Gradient

# 1. Introduction to the RL paradigm

## Imports and useful functions

First, installing some libraries if not done already in your terminal

In [None]:
!pip install torch==1.9.0 torchvision pyvirtualdisplay gym numpy pandas python-box plotly tqdm
!sudo apt-get install xvfb

Gym (it provides RL environments) with a simple interface.

In [8]:
import gym
from gym.wrappers import Monitor

A utility function to display short videos of the environments

In [9]:
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from pathlib import Path
import base64

def show_video(directory):
    html = []
    for mp4 in Path(directory).glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 400px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [10]:
import tempfile

def show_episode(env, policy):
    with tempfile.TemporaryDirectory() as tmpdir:
        logs_dir = tmpdir
        env = Monitor(env, logs_dir, force=True, video_callable=lambda episode: True)
        interact(env, policy)
        html = []
        for mp4 in Path(logs_dir).glob("*.mp4"):
            video_b64 = base64.b64encode(mp4.read_bytes())
            html.append('''<video alt="{}" autoplay 
                          loop controls style="height: 400px;">
                          <source src="data:video/mp4;base64,{}" type="video/mp4" />
                     </video>'''.format(mp4, video_b64.decode('ascii')))
        ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))
    env.close()

In [11]:
display = Display(visible=0, size=(1400, 900))
display.start();

Setting the random seeds to make the random outputs reproducible. Mandatory for debugging, developing algorithms, but proscribed when running exeperiments.

In [12]:
seed = 1337
np.random.seed(seed=seed)

## The gym library

Please consult the online documentation of the [OpenAI's Gym library](https://gym.openai.com/envs/) if needed.

Let's start with a simple environment, the `CartPole` environment. 

In [13]:
env = gym.make("CartPole-v1")  # creates the environment with a specific version for reproducibilty.

In [14]:
env.reset()  # returns the first observation

array([ 0.01874202, -0.00183776, -0.00499034, -0.01881972])

These correspond to 

| Position  | Observation          | Min      | Max     |
| ---- | -------------------- | -------- | ------- |
| 0    | Cart Position        | -2.4     | 2.4     |
| 1    | Cart Velocity        | -Inf     | Inf     |
| 2    | Pole Angle           | ~ -41.8° | ~ 41.8° |
| 3    | Pole Velocity At Tip | -Inf     | Inf     |

Source: https://github.com/openai/gym/wiki/CartPole-v0

One can also lookup the type and boundaries of the variables

In [15]:
env.observation_space

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

Similarly, for the actions...

In [16]:
env.action_space

Discrete(2)

In [17]:
action = 0

In [18]:
env.step(action)

(array([ 0.01870527, -0.19688779, -0.00536674,  0.27228453]), 1.0, False, {})

This gives `next observation`, `the reward`, the `done` boolean flag, and some extra (optional) `info` .

Notice how the same action results in a different observation. 

In [19]:
env.step(action)

(array([ 1.47675138e-02, -3.91932755e-01,  7.89530596e-05,  5.63269944e-01]),
 1.0,
 False,
 {})

And then we close the env.

In [20]:
env.close()

Let's make a full episode and log it. 

In [21]:
env = gym.make("CartPole-v1")
logs_dir = "./Results"
# `Monitor` will log the episodes and produce a video.
env = Monitor(env, logs_dir, force=True, video_callable=lambda episode: True)

In [22]:
done = False  # terminal condition
obs = env.reset()  # initial state
while not done:  # the RL loop
    action = env.action_space.sample()  # random action
    obs, reward, done, info = env.step(action)  # the interaction with the environment
env.close()

In [23]:
show_video(logs_dir)

In [24]:
ls Results/

openaigym.episode_batch.0.9058.stats.json
openaigym.manifest.0.9058.manifest.json
openaigym.video.0.9058.video000000.meta.json
[0m[01;35mopenaigym.video.0.9058.video000000.mp4[0m


Let's try another classic. 

In [25]:
env = gym.make('Pendulum-v0')
logs_dir = "./Results"
env = Monitor(env, logs_dir, force=True, video_callable=lambda episode: True)

In [26]:
done = False  # terminal condition
obs = env.reset()  # initial state
while not done:  # the RL loop
    action = env.action_space.sample()  # random action
    obs, reward, done, info = env.step(action)  # the interaction with the environment
env.close()
show_video(logs_dir)

Another classical environment. 

In [27]:
env = gym.make('Acrobot-v1')
logs_dir = "./Results"
env = Monitor(env, logs_dir, force=True, video_callable=lambda episode: True)

In [28]:
done = False  # terminal condition
obs = env.reset()  # initial state
while not done:  # the RL loop
    action = env.action_space.sample()  # random action
    obs, reward, done, info = env.step(action)  # the interaction with the environment
env.close()
show_video(logs_dir)

## What are the obversation and action spaces of these environment?

Table of the environment specs https://github.com/openai/gym/wiki/Table-of-environments

- CartPole specifications https://github.com/openai/gym/wiki/CartPole-v0
- Pendulum specifications https://github.com/openai/gym/wiki/Pendulum-v0
- Acrobot specifications https://gym.openai.com/envs/Acrobot-v1/

In [29]:
gym.envs.classic_control.AcrobotEnv?

In [30]:
env  #.env.env

<Monitor<TimeLimit<AcrobotEnv<Acrobot-v1>>>>

In [31]:
env.observation_space

Box(-28.274333953857422, 28.274333953857422, (6,), float32)

In [32]:
env.action_space

Discrete(3)