## Apress - Industrialized Machine Learning Examples

Andreas Francois Vermeulen
2019

### This is an example add-on to a book and needs to be accepted as part of that copyright.

## Chapter-009-015-Q-Learn-01

### Install keras-rl library

In [1]:
!pip install keras-rl

Collecting keras-rl
  Downloading https://files.pythonhosted.org/packages/ab/87/4b57eff8e4bd834cea0a75cd6c58198c9e42be29b600db9c14fafa72ec07/keras-rl-0.4.2.tar.gz (40kB)
Collecting keras>=2.0.7 (from keras-rl)
  Downloading https://files.pythonhosted.org/packages/5e/10/aa32dad071ce52b5502266b5c659451cfd6ffcbf14e6c8c4f16c0ff5aaab/Keras-2.2.4-py2.py3-none-any.whl (312kB)
Building wheels for collected packages: keras-rl
  Building wheel for keras-rl (setup.py): started
  Building wheel for keras-rl (setup.py): finished with status 'done'
  Stored in directory: C:\Users\AndreVermeulen\AppData\Local\pip\Cache\wheels\7d\4d\84\9254c9f2e8f51865cb0dac8e79da85330c735551d31f73c894
Successfully built keras-rl
Installing collected packages: keras, keras-rl
Successfully installed keras-2.2.4 keras-rl-0.4.2


In [2]:
!pip install pyglet

Collecting pyglet
  Downloading https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl (1.0MB)
Installing collected packages: pyglet
Successfully installed pyglet-1.3.2


### Install h5py

In [3]:
!pip install h5py



 ### Install dependencies for CartPole environment

In [4]:
!pip install gym

Collecting gym
  Downloading https://files.pythonhosted.org/packages/7b/57/e2fc4123ff2b4e3d61ae9b3d08c6878aecf2d5ec69b585ed53bc2400607f/gym-0.12.1.tar.gz (1.5MB)
Building wheels for collected packages: gym
  Building wheel for gym (setup.py): started
  Building wheel for gym (setup.py): finished with status 'done'
  Stored in directory: C:\Users\AndreVermeulen\AppData\Local\pip\Cache\wheels\57\b0\13\4153e1acab826fbe612c95b1336a63a3fa6416902a8d74a1b7
Successfully built gym
Installing collected packages: gym
Successfully installed gym-0.12.1


# You are ready to perform the Q-Learning

In [5]:
%matplotlib inline
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


You need to set several variables

In [6]:
ENV_NAME = 'CartPole-v0'

Get the environment and extract the number of actions available in the Cartpole problem

In [7]:
env = gym.make(ENV_NAME)
np.random.seed(20)
env.seed(20)
nb_actions = env.action_space.n

Create a single hidden layer neural network model

In [8]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))

Instructions for updating:
Colocations handled automatically by placer.


In [9]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


Next you configure and compile our agent. Suggest you use the policy as Epsilon Greedy and you set the memory as Sequential Memory because you must to store the result of actions you Cart performed and the rewards it gets for each action.

In [10]:
policy = EpsGreedyQPolicy()

memory = SequentialMemory(limit=50000, 
                          window_length=1
                         )

dqn = DQNAgent(model=model, 
               nb_actions=nb_actions, 
               memory=memory, 
               nb_steps_warmup=1000, 
               target_model_update=1e-2, 
               policy=policy,
               enable_dueling_network=False,
               dueling_type='avg'
              )

dqn.compile(Adam(lr=1e-3), 
            metrics=['mae']
           )

Time to perform the training process.

In [11]:
try:
  dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
except:
  dqn.fit(env, nb_steps=5000, visualize=False, verbose=2)

Training for 5000 steps ...




Instructions for updating:
Use tf.cast instead.
   27/5000: episode: 1, duration: 6.879s, episode steps: 27, steps per second: 4, episode reward: 27.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.259 [0.000, 1.000], mean observation: 0.024 [-2.483, 3.517], loss: 0.488774, mean_absolute_error: 0.559568, mean_q: 0.032028




   39/5000: episode: 2, duration: 0.392s, episode steps: 12, steps per second: 31, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.083 [0.000, 1.000], mean observation: 0.109 [-1.985, 3.012], loss: 0.425160, mean_absolute_error: 0.618904, mean_q: 0.251136
   47/5000: episode: 3, duration: 0.266s, episode steps: 8, steps per second: 30, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.153 [-1.585, 2.572], loss: 0.368349, mean_absolute_error: 0.598510, mean_q: 0.366937
   57/5000: episode: 4, duration: 0.316s, episode steps: 10, steps per second: 32, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.129 [-1.943, 2.996], loss: 0.318007, mean_absolute_error: 0.579923, mean_q: 0.488692
   67/5000: episode: 5, duration: 0.316s, episode steps: 10, steps per second: 32, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0

  326/5000: episode: 32, duration: 0.299s, episode steps: 9, steps per second: 30, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.150 [-1.810, 2.834], loss: 0.315337, mean_absolute_error: 1.057048, mean_q: 2.423726
  336/5000: episode: 33, duration: 0.333s, episode steps: 10, steps per second: 30, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.124 [-1.603, 2.504], loss: 0.459477, mean_absolute_error: 1.161498, mean_q: 2.567931
  346/5000: episode: 34, duration: 0.316s, episode steps: 10, steps per second: 32, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.131 [-1.783, 2.718], loss: 0.443892, mean_absolute_error: 1.210896, mean_q: 2.607481
  354/5000: episode: 35, duration: 0.268s, episode steps: 8, steps per second: 30, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action:

  628/5000: episode: 61, duration: 0.365s, episode steps: 11, steps per second: 30, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.182 [0.000, 1.000], mean observation: 0.130 [-1.563, 2.453], loss: 0.281247, mean_absolute_error: 1.779749, mean_q: 3.984269
  637/5000: episode: 62, duration: 0.299s, episode steps: 9, steps per second: 30, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.146 [-1.605, 2.494], loss: 0.350577, mean_absolute_error: 1.821586, mean_q: 4.126432
  647/5000: episode: 63, duration: 0.332s, episode steps: 10, steps per second: 30, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.141 [-1.530, 2.549], loss: 0.334946, mean_absolute_error: 1.839072, mean_q: 4.064948
  659/5000: episode: 64, duration: 0.383s, episode steps: 12, steps per second: 31, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean actio

  926/5000: episode: 91, duration: 0.366s, episode steps: 11, steps per second: 30, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.182 [0.000, 1.000], mean observation: 0.131 [-1.605, 2.504], loss: 0.221861, mean_absolute_error: 2.394267, mean_q: 5.227825
  940/5000: episode: 92, duration: 0.466s, episode steps: 14, steps per second: 30, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.214 [0.000, 1.000], mean observation: 0.076 [-1.767, 2.639], loss: 0.234456, mean_absolute_error: 2.383928, mean_q: 5.185652
  950/5000: episode: 93, duration: 0.333s, episode steps: 10, steps per second: 30, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.153 [-1.743, 2.734], loss: 0.209340, mean_absolute_error: 2.435891, mean_q: 5.386909
  960/5000: episode: 94, duration: 0.332s, episode steps: 10, steps per second: 30, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean act

 1206/5000: episode: 120, duration: 0.353s, episode steps: 11, steps per second: 31, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.273 [0.000, 1.000], mean observation: 0.125 [-1.343, 2.076], loss: 0.086425, mean_absolute_error: 2.800220, mean_q: 5.657528
 1215/5000: episode: 121, duration: 0.296s, episode steps: 9, steps per second: 30, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.222 [0.000, 1.000], mean observation: 0.131 [-1.348, 2.119], loss: 0.091371, mean_absolute_error: 2.925205, mean_q: 5.917897
 1226/5000: episode: 122, duration: 0.334s, episode steps: 11, steps per second: 33, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.273 [0.000, 1.000], mean observation: 0.143 [-0.946, 1.816], loss: 0.078870, mean_absolute_error: 2.886313, mean_q: 5.823234
 1239/5000: episode: 123, duration: 0.432s, episode steps: 13, steps per second: 30, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean a

 1578/5000: episode: 149, duration: 0.799s, episode steps: 24, steps per second: 30, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.542 [0.000, 1.000], mean observation: -0.035 [-1.204, 0.637], loss: 0.934159, mean_absolute_error: 3.897507, mean_q: 7.342466
 1592/5000: episode: 150, duration: 0.466s, episode steps: 14, steps per second: 30, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.643 [0.000, 1.000], mean observation: -0.056 [-1.592, 1.027], loss: 0.907176, mean_absolute_error: 3.992537, mean_q: 7.586948
 1602/5000: episode: 151, duration: 0.316s, episode steps: 10, steps per second: 32, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.700 [0.000, 1.000], mean observation: -0.126 [-1.632, 0.947], loss: 1.038503, mean_absolute_error: 3.974708, mean_q: 7.551618
 1611/5000: episode: 152, duration: 0.299s, episode steps: 9, steps per second: 30, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mea

 1894/5000: episode: 178, duration: 0.282s, episode steps: 9, steps per second: 32, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.889 [0.000, 1.000], mean observation: -0.144 [-2.253, 1.369], loss: 1.988425, mean_absolute_error: 4.427250, mean_q: 8.125143
 1907/5000: episode: 179, duration: 0.416s, episode steps: 13, steps per second: 31, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.846 [0.000, 1.000], mean observation: -0.090 [-2.728, 1.804], loss: 1.714531, mean_absolute_error: 4.638865, mean_q: 8.543249
 1917/5000: episode: 180, duration: 0.333s, episode steps: 10, steps per second: 30, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.132 [-2.602, 1.617], loss: 1.580971, mean_absolute_error: 4.708157, mean_q: 8.626022
 1926/5000: episode: 181, duration: 0.298s, episode steps: 9, steps per second: 30, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean 

 2686/5000: episode: 207, duration: 4.335s, episode steps: 135, steps per second: 31, episode reward: 135.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.496 [0.000, 1.000], mean observation: -0.137 [-0.759, 0.594], loss: 1.271002, mean_absolute_error: 5.550431, mean_q: 10.531247
 2788/5000: episode: 208, duration: 3.384s, episode steps: 102, steps per second: 30, episode reward: 102.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.510 [0.000, 1.000], mean observation: -0.278 [-1.536, 0.859], loss: 1.552405, mean_absolute_error: 5.714554, mean_q: 10.818386
 2824/5000: episode: 209, duration: 1.183s, episode steps: 36, steps per second: 30, episode reward: 36.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.472 [0.000, 1.000], mean observation: -0.112 [-0.674, 0.190], loss: 1.674960, mean_absolute_error: 5.868788, mean_q: 11.099743
 2854/5000: episode: 210, duration: 0.999s, episode steps: 30, steps per second: 30, episode reward: 30.000, mean reward: 1.000 [1.000, 1.

 4475/5000: episode: 236, duration: 2.502s, episode steps: 76, steps per second: 30, episode reward: 76.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.474 [0.000, 1.000], mean observation: -0.102 [-0.856, 0.245], loss: 2.155948, mean_absolute_error: 8.938640, mean_q: 17.464560
 4551/5000: episode: 237, duration: 2.500s, episode steps: 76, steps per second: 30, episode reward: 76.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.474 [0.000, 1.000], mean observation: -0.117 [-0.766, 0.207], loss: 2.764911, mean_absolute_error: 9.155194, mean_q: 17.809622
 4618/5000: episode: 238, duration: 2.184s, episode steps: 67, steps per second: 31, episode reward: 67.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.463 [0.000, 1.000], mean observation: -0.113 [-0.855, 0.244], loss: 2.674924, mean_absolute_error: 9.253394, mean_q: 18.076935
 4759/5000: episode: 239, duration: 4.652s, episode steps: 141, steps per second: 30, episode reward: 141.000, mean reward: 1.000 [1.000, 1.00

In [12]:
try:
  dqn.test(env, nb_episodes=5, visualize=True, verbose=2)
except:
  dqn.test(env, nb_episodes=5, visualize=False, verbose=2)

Testing for 5 episodes ...
Episode 1: reward: 68.000, steps: 68
Episode 2: reward: 52.000, steps: 52
Episode 3: reward: 49.000, steps: 49
Episode 4: reward: 54.000, steps: 54
Episode 5: reward: 128.000, steps: 128


## Done

In [13]:
import datetime
now = datetime.datetime.now()
print('Done!',str(now))

Done! 2019-04-24 22:01:56.467509


Your can now test the reinforcement learning model