# PFRL Quickstart Guide

This is a quickstart guide for users who just want to try PFRL for the first time.

If you have not yet installed PFRL, run the command below to install it:
```
pip install pfrl
```

If you have already installed PFRL, let's begin!

First, you need to import necessary modules. The module name of PFRL is `pfrl`. Let's import `torch`, `gym`, and `numpy` as well since they are used later.

In [None]:
#installing prerequisite display packages
!apt update && apt install xvfb python-opengl ffmpeg
#install torch and plotting packages
!pip install torchvision matplotlib seaborn pandas numpy pathlib 
#install gym and physics engine for box2d environments
!pip install gym box2d-py

#install wrapper to visualize environment
!pip install gym-notebook-wrapper
!pip install pyvirtualdisplay
import pyvirtualdisplay
disp = pyvirtualdisplay.Display()
disp.start() # Start Xvfb and set "DISPLAY" environment properly.
!pip install pfrl
import pfrl
import torch
import torch.nn
import gym
import numpy

import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import random, os.path, math, glob, csv, base64, itertools, sys
import gym
from gym.wrappers import Monitor
import gnwrapper

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[33m0% [Waiting for headers] [1 InRelease 1,138 B/88.7 kB 1%] [Connected to cloud.r[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
[33m0% [Waiting for headers] [1 InRelease 88.7 kB/88.7 kB 100%] [Connected to cloud[0m                                                                               Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:8 h

PFRL can be used for any problems if they are modeled as "environments". [OpenAI Gym](https://github.com/openai/gym) provides various kinds of benchmark environments and defines the common interface among them. PFRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: `reset` and `step`.

- `env.reset` will reset the environment to the initial state and return the initial observation.
- `env.step` will execute a given action, move to the next state and return four values:
  - a next observation
  - a scalar reward
  - a boolean value indicating whether the current state is terminal or not
  - additional information
- `env.render` will render the current state. (optional)

Let's try `CartPole-v0`, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

In [None]:
env = gym.make('CartPole-v1')
env.seed(0)
env = gnwrapper.Monitor(env,directory="./train", force=True, video_callable=lambda num: num % 50 == 0) # Start Xvfb, if force=True, overwrites exisiting saved videos
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

# Uncomment to open a GUI window rendering the current state of the environment
env.render()

observation space: Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
action space: Discrete(2)
initial observation: [-0.04456399  0.04653909  0.01326909 -0.02099827]
next observation: [-0.04363321 -0.14877061  0.01284913  0.2758415 ]
reward: 1.0
done: False
info: {}


True

Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.

PFRL provides various agents, each of which implements a deep reinforcement learning algorithm.

Let's try using the DoubleDQN algorithm (https://arxiv.org/abs/1509.06461), which is implemented by `pfrl.agents.DoubleDQN`. This algorithm trains a Q-function that receives an observation and returns an expected future return for each action the agent can take. In PFRL, you can define your Q-function as `torch.nn.Module` as below. Note that the outputs are wrapped by `pfrl.action_value.DiscreteActionValue`. By wrapping the outputs of Q-functions, PFRL can support not only discrete-action Q-functions like this but also continuous-action Q-functions (via [Normalized Advantage Functions](https://arxiv.org/abs/1603.00748)) in the same way.

In [None]:
class QFunction(torch.nn.Module):

    def __init__(self, obs_size, n_actions):
        super().__init__()
        self.l1 = torch.nn.Linear(obs_size, 50)
        self.l2 = torch.nn.Linear(50, 50)
        self.l3 = torch.nn.Linear(50, n_actions)

    def forward(self, x):
        h = x
        h = torch.nn.functional.relu(self.l1(h))
        h = torch.nn.functional.relu(self.l2(h))
        h = self.l3(h)
        return pfrl.action_value.DiscreteActionValue(h)

obs_size = env.observation_space.low.size
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)
print(q_func)

QFunction(
  (l1): Linear(in_features=4, out_features=50, bias=True)
  (l2): Linear(in_features=50, out_features=50, bias=True)
  (l3): Linear(in_features=50, out_features=2, bias=True)
)


It is also possible to define the same model using `torch.nn.Sequential`. `pfrl.q_functions.DiscreteActionValueHead` is just a `torch.nn.Module` that packs its input to `pfrl.action_value.DiscreteActionValue`.

In [None]:
q_func2 = torch.nn.Sequential(
    torch.nn.Linear(obs_size, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, n_actions),
    pfrl.q_functions.DiscreteActionValueHead(),
)
print(q_func2)

Sequential(
  (0): Linear(in_features=4, out_features=50, bias=True)
  (1): ReLU()
  (2): Linear(in_features=50, out_features=50, bias=True)
  (3): ReLU()
  (4): Linear(in_features=50, out_features=2, bias=True)
  (5): DiscreteActionValueHead()
)


As usual in PyTorch, `torch.optim.Optimizer` is used to optimize a model.

In [None]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)

To create a DoubleDQN agent with the Q-function and optimizer, you need to specify a bit more parameters and configurations.

In [None]:
# Set the discount factor that discounts future rewards.
gamma = 0.9

# Use epsilon-greedy for exploration
explorer = pfrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# As PyTorch only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(numpy.float32, copy=False)

# Set the device id to use GPU. To use CPU only, set it to -1.
gpu = -1

# Now create an agent that will interact with the environment.
agent = pfrl.agents.DoubleDQN(
    q_func,
    optimizer,
    replay_buffer,
    gamma,
    explorer,
    replay_start_size=500,
    update_interval=1,
    target_update_interval=100,
    phi=phi,
    gpu=gpu,
)

Now you have an agent and an environment. It's time to start reinforcement learning!

During training, two methods of `agent` must be called: `agent.act` and `agent.observe`. `agent.act(obs)` takes the current observation as input and returns an exploratory action. Once the returned action is processed in the env, `agent.observe(obs, reward, done, reset)` then observes the consequences:
- `obs`: next observation.
- `reward`: an immediate reward.
- `done`: a boolean value set to True if it reached a terminal state.
- `reset`: a boolean value set to True if an episode is interrupted at a non-terminal state, typically by a time limit.

Optionally, you can get training statistics of the agent via `agent.get_statistics`.

In [None]:
n_episodes = 300
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while True:
        # Uncomment to watch the behavior in a GUI window
        env.render()
        action = agent.act(obs)
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
        reset = t == max_episode_len
        agent.observe(obs, reward, done, reset)
        if done or reset:
            break
    if i % 10 == 0:
        print('episode:', i, 'R:', R)
    if i % 50 == 0:
        print('statistics:', agent.get_statistics())
print('Finished.')

episode: 10 R: 11.0
episode: 20 R: 17.0
episode: 30 R: 9.0
episode: 40 R: 13.0
episode: 50 R: 12.0
statistics: [('average_q', 1.0619414), ('average_loss', 0.16395461067557335), ('cumulative_steps', 574), ('n_updates', 75), ('rlen', 574)]
episode: 60 R: 12.0
episode: 70 R: 15.0
episode: 80 R: 22.0
episode: 90 R: 13.0
episode: 100 R: 12.0
statistics: [('average_q', 5.1343384), ('average_loss', 0.21474976122844963), ('cumulative_steps', 1284), ('n_updates', 785), ('rlen', 1284)]
episode: 110 R: 46.0
episode: 120 R: 37.0
episode: 130 R: 64.0
episode: 140 R: 107.0
episode: 150 R: 110.0
statistics: [('average_q', 9.557208), ('average_loss', 0.17757614892208948), ('cumulative_steps', 4576), ('n_updates', 4077), ('rlen', 4576)]
episode: 160 R: 200.0
episode: 170 R: 197.0
episode: 180 R: 168.0
episode: 190 R: 200.0
episode: 200 R: 180.0
statistics: [('average_q', 10.098523), ('average_loss', 0.09004098737146705), ('cumulative_steps', 12841), ('n_updates', 12342), ('rlen', 12841)]
episode: 210 R

Now you finished training the DoubleDQN agent for 300 episodes. How good is the agent now? You can evaluate it by using `with agent.eval_mode()`. Exploration such as epsilon-greedy is not used anymore.

In [None]:
with agent.eval_mode():
    for i in range(10):
        obs = env.reset()
        R = 0
        t = 0
        while True:
            # Uncomment to watch the behavior in a GUI window
            # env.render()
            action = agent.act(obs)
            obs, r, done, _ = env.step(action)
            R += r
            t += 1
            reset = t == 200
            agent.observe(obs, r, done, reset)
            if done or reset:
                break
        print('evaluation episode:', i, 'R:', R)

evaluation episode: 0 R: 200.0
evaluation episode: 1 R: 200.0
evaluation episode: 2 R: 198.0
evaluation episode: 3 R: 200.0
evaluation episode: 4 R: 200.0
evaluation episode: 5 R: 194.0
evaluation episode: 6 R: 200.0
evaluation episode: 7 R: 200.0
evaluation episode: 8 R: 177.0
evaluation episode: 9 R: 200.0


For your information, `CartPole-v0`'s maximum achievable return is 200. If the agent could not achieve 200, it was unlucky! You can train the agent longer by running the training loop again.

If the results are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call `agent.save` to save the agent, then `agent.load` to load the saved agent.

In [None]:
env.display()
# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

'openaigym.video.0.59.video000000.mp4'

'openaigym.video.0.59.video000050.mp4'

'openaigym.video.0.59.video000100.mp4'

'openaigym.video.0.59.video000150.mp4'

'openaigym.video.0.59.video000200.mp4'

'openaigym.video.0.59.video000250.mp4'

'openaigym.video.0.59.video000300.mp4'

RL completed!

But writing code like this every time you use RL might be tedious. So, PFRL has utility functions that do these things.

In [None]:
# Set up the logger to print info messages for understandability.
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=2000,           # Train the agent for 2000 steps
    eval_n_steps=None,       # We evaluate for episodes, not time
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    train_max_episode_len=200,  # Maximum length of each episode
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result',      # Save everything to 'result' directory
)

outdir:result step:183 episode:0 R:183.0
statistics:[('average_q', 10.135443), ('average_loss', 0.050023208315251394), ('cumulative_steps', 24184), ('n_updates', 23685), ('rlen', 24184)]
outdir:result step:379 episode:1 R:196.0
statistics:[('average_q', 10.094383), ('average_loss', 0.05948881300224457), ('cumulative_steps', 24380), ('n_updates', 23881), ('rlen', 24380)]
outdir:result step:544 episode:2 R:165.0
statistics:[('average_q', 10.024766), ('average_loss', 0.053636776396306235), ('cumulative_steps', 24545), ('n_updates', 24046), ('rlen', 24545)]
outdir:result step:646 episode:3 R:102.0
statistics:[('average_q', 9.990319), ('average_loss', 0.05371900959813502), ('cumulative_steps', 24647), ('n_updates', 24148), ('rlen', 24647)]
outdir:result step:784 episode:4 R:138.0
statistics:[('average_q', 9.876981), ('average_loss', 0.06340783993829974), ('cumulative_steps', 24785), ('n_updates', 24286), ('rlen', 24785)]
outdir:result step:925 episode:5 R:141.0
statistics:[('average_q', 9.9

(<pfrl.agents.double_dqn.DoubleDQN at 0x7f63e4334550>,
 [{'average_loss': 0.06374016008689068,
   'average_q': 10.000138,
   'cumulative_steps': 25058,
   'eval_score': 131.2,
   'n_updates': 24559,
   'rlen': 25058},
  {'average_loss': 0.056886804039822894,
   'average_q': 9.949673,
   'cumulative_steps': 26001,
   'eval_score': 147.0,
   'n_updates': 25502,
   'rlen': 26001}])

That's all of the PFRL quickstart guide. To know more about PFRL, please look into the `examples` directory and read and run the examples. Thank you!