<a href="https://colab.research.google.com/github/RL-Starterpack/rl-starterpack/blob/main/exercises/AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RL Tutorial - **AC Exercise**

## Setup

In [None]:
#@title Run this cell to clone the RL tutorial repository and install it
try:
  import rl_starterpack
  print('RL-Starterpack repo succesfully installed!')
except ImportError:
  print('Cloning RL-Starterpack package...')

  !git clone https://github.com/RL-Starterpack/rl-starterpack.git
  print('Installing RL-StarterPack package...')
  !pip install -e rl-starterpack &> /dev/null
  print('\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
  print('Please restart the runtime to use the newly installed package!')
  print('Runtime > Restart Runtime')
  print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

In [None]:
#@title Run this cell to install additional dependencies (will take ~30s)
!pip install torchviz > /dev/null
!pip install gym pyvirtualdisplay > /dev/null
!apt-get remove ffmpeg > /dev/null # Removing due to restrictive license
!apt-get install -y xvfb python-opengl > /dev/null

In [None]:
#@title Run this cell to import the required libraries
try:
    from rl_starterpack import AC, OpenAIGym, PG, experiment, vis_utils
except ImportError:
    print('Please run the first cell! If you already ran it, make sure '
          'to restart the runtime after the package is installed.')
    raise

import gym
from itertools import chain
import numpy as np
import pandas as pd
import torch
import torchviz
from tqdm.auto import tqdm
%matplotlib inline
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

# Setup display to show video renderings
if 'display' not in globals():
    display = Display(visible=0, size=(1400, 900))
    display.start()

You've seen most of these steps before, so we'll quickly go through them here, focusing on various actor-critic configurations.

## CartPole

Setup CartPole environment:

In [None]:
env = OpenAIGym(level='CartPole', max_timesteps=300)
num_episodes = 500

Actor and critic network constructors:

In [None]:
hidden_size = 16

def actor_fn():
    return torch.nn.Sequential(
        torch.nn.Linear(in_features=env.state_space['shape'][0], out_features=hidden_size),
        torch.nn.Tanh(),
        torch.nn.Linear(in_features=hidden_size, out_features=env.action_space['num_values'])
    )

def critic_fn():
    return torch.nn.Sequential(
        torch.nn.Linear(in_features=env.state_space['shape'][0], out_features=hidden_size),
        torch.nn.Tanh(),
        torch.nn.Linear(in_features=hidden_size, out_features=1)
    )

### Q actor-critic using estimated returns

Agent configuration, without any optional extensions:

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=1e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95
)

Training loop:

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for n in pbar:
    returns.append(0.0)

    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

Plot of episode returns: occasionally successful, but generally not stable

In [None]:
vis_utils.draw_returns_chart(returns)

Plot of actor loss: consistently relatively high, due to being weighted by cumulative return

In [None]:
vis_utils.draw_loss_chart(actor_loss)

Plot of critic loss: converging, occasionally increasing

In [None]:
vis_utils.draw_loss_chart(critic_loss)

Visualization:

In [None]:
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

### Q actor-critic using normalized estimated returns

Agent configuration, this time with `normalize_returns` set:

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=1e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95, normalize_returns=True
)

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

# Training loop
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

Plot of episode returns: consistently improving

In [None]:
vis_utils.draw_returns_chart(returns)

Plot of actor loss: consistently around zero

In [None]:
vis_utils.draw_loss_chart(actor_loss)

Plot of critic loss: converging to zero

In [None]:
vis_utils.draw_loss_chart(critic_loss)

It is interesting to compare this to the performance of our policy-gradient algorithm with normalized returns:

In [None]:
agent = PG(
    state_space=env.state_space, action_space=env.action_space,
    network_fn=actor_fn, learning_rate=1e-3,
    discount=0.95, normalize_returns=True
)

# Training loop
returns = list()
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

vis_utils.draw_returns_chart(returns)

Visualization:

In [None]:
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

### TD actor-critic using estimated advantage

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=1e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95, compute_advantage=True
)

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

# Training loop
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

Plot of episode returns: improving, occasionally unstable

In [None]:
vis_utils.draw_returns_chart(returns)

Plot of actor loss: converging to zero very quickly

In [None]:
vis_utils.draw_loss_chart(actor_loss)

Plot of critic loss: converging to zero

In [None]:
vis_utils.draw_loss_chart(critic_loss)

Visualization:

In [None]:
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

## FrozenLake

In [None]:
# reward_threshold: 0.78, optimum: 0.8196, max_timesteps: 100
env = OpenAIGym(level='FrozenLake', max_timesteps=100)
num_episodes = 1000

In [None]:
def reward_shaping_fn(reward, terminal, state):
    if terminal == 1 and reward == 0.0:
        return -1.0, terminal
    elif terminal == 2 and reward == 0.0:
        return -0.5, terminal
    else:
        return reward, terminal

In [None]:
hidden_size = 16

def actor_fn():
    return torch.nn.Sequential(
        torch.nn.Embedding(num_embeddings=env.state_space['num_values'], embedding_dim=hidden_size),
        torch.nn.Tanh(),
        torch.nn.Linear(in_features=hidden_size, out_features=env.action_space['num_values']),
    )

def critic_fn():
    return torch.nn.Sequential(
        torch.nn.Embedding(num_embeddings=env.state_space['num_values'], embedding_dim=hidden_size),
        torch.nn.Tanh(),
        torch.nn.Linear(in_features=hidden_size, out_features=1),
    )

### Q actor-critic using estimated returns

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=3e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95
)

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

# Training loop
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

In [None]:
vis_utils.draw_returns_chart(returns)

In [None]:
vis_utils.draw_loss_chart(actor_loss)

In [None]:
vis_utils.draw_loss_chart(critic_loss)

### Q actor-critic using normalized estimated returns

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=1e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95, normalize_returns=True
)

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

# Training loop
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

In [None]:
vis_utils.draw_returns_chart(returns)

In [None]:
vis_utils.draw_loss_chart(actor_loss)

In [None]:
vis_utils.draw_loss_chart(critic_loss)

### TD actor-critic using estimated advantage

In [None]:
agent = AC(
    state_space=env.state_space, action_space=env.action_space,
    actor_fn=actor_fn, actor_learning_rate=1e-3,
    critic_fn=critic_fn, critic_learning_rate=3e-3,
    discount=0.95, compute_advantage=True
)

In [None]:
returns = list()
actor_loss = list()
critic_loss = list()

# Training loop
pbar = tqdm(range(num_episodes))
pbar.set_postfix({'return': 'n/a'})
for _ in pbar:
    returns.append(0.0)

    # Episode loop
    state = env.reset()
    terminal = 0
    while terminal == 0:
        action = agent.act(state)
        next_state, reward, terminal = env.step(action)
        updated = agent.observe(state, action, reward, terminal, next_state)
        state = next_state
        returns[-1] += reward
        if updated:
            actor_loss.append(agent.last_actor_loss_value)
            critic_loss.append(agent.last_critic_loss_value)

    pbar.set_postfix({'return': '{:.2f}'.format(returns[-1])})

In [None]:
vis_utils.draw_returns_chart(returns)

In [None]:
vis_utils.draw_loss_chart(actor_loss)

In [None]:
vis_utils.draw_loss_chart(critic_loss)