# Reinforcement
In this notebook, we are going to be focusing on the DQNAgent and the PGAgent in the reinforcement module in PAI-Utils.

## Import Packages

In [1]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras

from paiutils import neural_network as nn
from paiutils import reinforcement as rl


# see if using GPU and if so enable memory growth
gpus = tf.config.list_physical_devices('GPU')
print(gpus)
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Create Environment
We are going to be using the CartPole-v0 environment. For more information on this environment, click [this](https://github.com/openai/gym/wiki/CartPole-v0).

In [2]:
genv = gym.make('CartPole-v0')
max_steps = genv._max_episode_steps
print(max_steps)
print(genv.observation_space, genv.action_space)

env = rl.GymWrapper(genv)

200
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32) Discrete(2)


## DQNAgent

### Create Q-Network

In [3]:
x0 = keras.layers.Input(shape=env.state_shape)
x = nn.dense(32)(x0)
x1 = nn.dense(16)(x)
x2 = nn.dense(16)(x)
#outputs = keras.layers.Dense(action_shape[0])(x)
outputs = rl.DQNAgent.get_dueling_output_layer(
    env.action_size, dueling_type='avg'
)(x1, x2)
qmodel = keras.Model(inputs=x0,
                     outputs=outputs)
qmodel.compile(optimizer=keras.optimizers.Adam(.001),
               loss='mse')
qmodel.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 4)]          0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           160         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 32)           128         dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 16)           528         batch_normalization[0][0]        
_______________________________________________________________________________________

### Create Agent

In [4]:
policy = rl.StochasticPolicy(
    rl.GreedyPolicy(),
    rl.ExponentialDecay(.5, .1, .01, step_every_call=False),
    0, env.action_size
)
discounted_rate = .99
agent = rl.DQNAgent(
    policy, qmodel, discounted_rate,
    enable_target=False, enable_double=False,
    enable_per=False
)

### Train the Agent

In [5]:
# Warmup
agent.set_playing_data(memorizing=True, verbose=True)
env.play_episodes(agent, 8, max_steps, random=True,
                  verbose=True, episode_verbose=False,
                  render=False)

agent.set_playing_data(
    training=True, memorizing=True,
    learns_in_episode=False, batch_size=16,
    mini_batch=0, epochs=1, repeat=50,
    target_update_interval=1, tau=1.0,
    verbose=False
)
save_dir = ''
num_episodes = 6
for ndx in range(1):
    print(f'Save Loop: {ndx}')
    result = env.play_episodes(
        agent, num_episodes, max_steps,
        verbose=True, episode_verbose=False,
        render=False
    )
    agent.save(save_dir, note=f'DQN_{ndx}_{result}')

Time: 17:02:39 - Episode: 1 - Steps: 33 - Total Reward: 33.0 - Best Total Reward: 33.0 - Average Total Reward: 33.0 - Memory Size: 33
Time: 17:02:39 - Episode: 2 - Steps: 24 - Total Reward: 24.0 - Best Total Reward: 33.0 - Average Total Reward: 28.5 - Memory Size: 57
Time: 17:02:39 - Episode: 3 - Steps: 33 - Total Reward: 33.0 - Best Total Reward: 33.0 - Average Total Reward: 30.0 - Memory Size: 90
Time: 17:02:39 - Episode: 4 - Steps: 13 - Total Reward: 13.0 - Best Total Reward: 33.0 - Average Total Reward: 25.75 - Memory Size: 103
Time: 17:02:39 - Episode: 5 - Steps: 31 - Total Reward: 31.0 - Best Total Reward: 33.0 - Average Total Reward: 26.8 - Memory Size: 134
Time: 17:02:39 - Episode: 6 - Steps: 18 - Total Reward: 18.0 - Best Total Reward: 33.0 - Average Total Reward: 25.333333333333332 - Memory Size: 152
Time: 17:02:39 - Episode: 7 - Steps: 17 - Total Reward: 17.0 - Best Total Reward: 33.0 - Average Total Reward: 24.142857142857142 - Memory Size: 169
Time: 17:02:39 - Episode: 8 -

### Test the Agent

In [6]:
agent.set_playing_data(training=False,
                       memorizing=False)
step, total_reward = env.play_episode(
    agent, max_steps,
    verbose=True, render=False
)
print(total_reward)

Step: 1 - Reward: 1.0 - Action: 0
Step: 2 - Reward: 1.0 - Action: 1
Step: 3 - Reward: 1.0 - Action: 1
Step: 4 - Reward: 1.0 - Action: 0
Step: 5 - Reward: 1.0 - Action: 0
Step: 6 - Reward: 1.0 - Action: 1
Step: 7 - Reward: 1.0 - Action: 0
Step: 8 - Reward: 1.0 - Action: 1
Step: 9 - Reward: 1.0 - Action: 1
Step: 10 - Reward: 1.0 - Action: 0
Step: 11 - Reward: 1.0 - Action: 0
Step: 12 - Reward: 1.0 - Action: 1
Step: 13 - Reward: 1.0 - Action: 0
Step: 14 - Reward: 1.0 - Action: 1
Step: 15 - Reward: 1.0 - Action: 1
Step: 16 - Reward: 1.0 - Action: 0
Step: 17 - Reward: 1.0 - Action: 0
Step: 18 - Reward: 1.0 - Action: 1
Step: 19 - Reward: 1.0 - Action: 0
Step: 20 - Reward: 1.0 - Action: 1
Step: 21 - Reward: 1.0 - Action: 0
Step: 22 - Reward: 1.0 - Action: 1
Step: 23 - Reward: 1.0 - Action: 1
Step: 24 - Reward: 1.0 - Action: 0
Step: 25 - Reward: 1.0 - Action: 0
Step: 26 - Reward: 1.0 - Action: 1
Step: 27 - Reward: 1.0 - Action: 0
Step: 28 - Reward: 1.0 - Action: 1
Step: 29 - Reward: 1.0 - Acti

#### Solved?
The CartPole environment is considered solved if we get an average reward of 195.0 over 100 consecutive episodes.

In [7]:
num_episodes = 100
agent.set_playing_data(training=False,
                       memorizing=False)
result = env.play_episodes(
    agent, num_episodes, max_steps,
    verbose=True, episode_verbose=False,
    render=False
)
print(f'Solved: {result > 195}')

Time: 17:03:25 - Episode: 1 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:25 - Episode: 2 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:25 - Episode: 3 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:26 - Episode: 4 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:26 - Episode: 5 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:27 - Episode: 6 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:27 - Episode: 7 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 763
Time: 17:03:28 - Episode: 8

Time: 17:03:51 - Episode: 58 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.06896551724137 - Memory Size: 763
Time: 17:03:51 - Episode: 59 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.08474576271186 - Memory Size: 763
Time: 17:03:52 - Episode: 60 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.1 - Memory Size: 763
Time: 17:03:52 - Episode: 61 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.11475409836066 - Memory Size: 763
Time: 17:03:53 - Episode: 62 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.1290322580645 - Memory Size: 763
Time: 17:03:53 - Episode: 63 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 199.14285714285714 - Memory Size: 763
Time: 17:03:54 - Episode: 64 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Ave

## PGAgent

### Create Actor Model

In [67]:
inputs = keras.layers.Input(shape=env.state_shape)
x = nn.dense(32)(inputs)
x = nn.dense(32)(x)
outputs = nn.dense(env.action_size,
                   activation='softmax',
                   batch_norm=False)(x)

amodel = keras.Model(inputs=inputs,
                     outputs=outputs)
amodel.compile(optimizer=keras.optimizers.Adam(.01),
               loss='mse')
amodel.summary()

Model: "functional_33"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_17 (InputLayer)        [(None, 4)]               0         
_________________________________________________________________
dense_48 (Dense)             (None, 32)                160       
_________________________________________________________________
batch_normalization_32 (Batc (None, 32)                128       
_________________________________________________________________
dense_49 (Dense)             (None, 32)                1056      
_________________________________________________________________
batch_normalization_33 (Batc (None, 32)                128       
_________________________________________________________________
dense_50 (Dense)             (None, 2)                 66        
Total params: 1,538
Trainable params: 1,410
Non-trainable params: 128
_________________________________________________

### Create Agent

In [68]:
discounted_rate = .99
agent = rl.PGAgent(
    amodel, discounted_rate
)

### Train the Agent

In [69]:
# No warmup needed
#agent.set_playing_data(memorizing=True, verbose=True)
#env.play_episodes(agent, 1, max_steps, random=True,
#                  verbose=True, episode_verbose=False,
#                  render=False)

agent.set_playing_data(
    training=True, memorizing=True,
    batch_size=16, mini_batch=0,
    epochs=5, repeat=1,
    entropy_coef=0,
    verbose=False
)
save_dir = ''
num_episodes = 30
for ndx in range(1):
    print(f'Save Loop: {ndx}')
    result = env.play_episodes(
        agent, num_episodes, max_steps,
        verbose=True, episode_verbose=False,
        render=False
    )
    agent.save(save_dir, note=f'PG_{ndx}_{result}')

Save Loop: 0
Time: 20:18:00 - Episode: 1 - Steps: 13 - Total Reward: 13.0 - Best Total Reward: 13.0 - Average Total Reward: 13.0 - Memory Size: 13
Time: 20:18:00 - Episode: 2 - Steps: 26 - Total Reward: 26.0 - Best Total Reward: 26.0 - Average Total Reward: 19.5 - Memory Size: 39
Time: 20:18:00 - Episode: 3 - Steps: 27 - Total Reward: 27.0 - Best Total Reward: 27.0 - Average Total Reward: 22.0 - Memory Size: 66
Time: 20:18:00 - Episode: 4 - Steps: 10 - Total Reward: 10.0 - Best Total Reward: 27.0 - Average Total Reward: 19.0 - Memory Size: 76
Time: 20:18:00 - Episode: 5 - Steps: 8 - Total Reward: 8.0 - Best Total Reward: 27.0 - Average Total Reward: 16.8 - Memory Size: 84
Time: 20:18:00 - Episode: 6 - Steps: 10 - Total Reward: 10.0 - Best Total Reward: 27.0 - Average Total Reward: 15.666666666666666 - Memory Size: 94
Time: 20:18:01 - Episode: 7 - Steps: 46 - Total Reward: 46.0 - Best Total Reward: 46.0 - Average Total Reward: 20.0 - Memory Size: 140
Time: 20:18:01 - Episode: 8 - Steps:

### Test the Agent

In [70]:
agent.set_playing_data(training=False,
                       memorizing=False)
step, total_reward = env.play_episode(
    agent, max_steps,
    verbose=True, render=False
)
print(total_reward)

Step: 1 - Reward: 1.0 - Action: 1
Step: 2 - Reward: 1.0 - Action: 0
Step: 3 - Reward: 1.0 - Action: 1
Step: 4 - Reward: 1.0 - Action: 0
Step: 5 - Reward: 1.0 - Action: 0
Step: 6 - Reward: 1.0 - Action: 1
Step: 7 - Reward: 1.0 - Action: 0
Step: 8 - Reward: 1.0 - Action: 1
Step: 9 - Reward: 1.0 - Action: 0
Step: 10 - Reward: 1.0 - Action: 1
Step: 11 - Reward: 1.0 - Action: 0
Step: 12 - Reward: 1.0 - Action: 1
Step: 13 - Reward: 1.0 - Action: 0
Step: 14 - Reward: 1.0 - Action: 1
Step: 15 - Reward: 1.0 - Action: 0
Step: 16 - Reward: 1.0 - Action: 1
Step: 17 - Reward: 1.0 - Action: 0
Step: 18 - Reward: 1.0 - Action: 1
Step: 19 - Reward: 1.0 - Action: 0
Step: 20 - Reward: 1.0 - Action: 1
Step: 21 - Reward: 1.0 - Action: 0
Step: 22 - Reward: 1.0 - Action: 1
Step: 23 - Reward: 1.0 - Action: 0
Step: 24 - Reward: 1.0 - Action: 0
Step: 25 - Reward: 1.0 - Action: 1
Step: 26 - Reward: 1.0 - Action: 0
Step: 27 - Reward: 1.0 - Action: 1
Step: 28 - Reward: 1.0 - Action: 0
Step: 29 - Reward: 1.0 - Acti

#### Solved?
As previously mentioned, the CartPole environment is considered solved if we get an average reward of 195.0 over 100 consecutive episodes.

In [71]:
num_episodes = 100
agent.set_playing_data(training=False,
                       memorizing=False)
result = env.play_episodes(
    agent, num_episodes, max_steps,
    verbose=True, episode_verbose=False,
    render=False
)
print(f'Solved: {result > 195}')

Time: 20:18:35 - Episode: 1 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:35 - Episode: 2 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:35 - Episode: 3 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:35 - Episode: 4 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:36 - Episode: 5 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:36 - Episode: 6 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:36 - Episode: 7 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:37 - Epi

Time: 20:18:51 - Episode: 60 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:51 - Episode: 61 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:51 - Episode: 62 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:51 - Episode: 63 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:52 - Episode: 64 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:52 - Episode: 65 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:52 - Episode: 66 - Steps: 200 - Total Reward: 200.0 - Best Total Reward: 200.0 - Average Total Reward: 200.0 - Memory Size: 3147
Time: 20:18:5

## Gameplay
[![Gameplay of the DQN and PG Agents](http://img.youtube.com/vi/0d3U2tkhEkM/0.jpg)](https://www.youtube.com/watch?v=0d3U2tkhEkM)