# Pipple Lecture #12 - Reinforcement Learning
Now, you have seen quite some information relating to Reinforcement Learning. In this notebook, you will have the chance to program your own Deep Reinforcement Learning model. At least... tune its parameters. The programming of the game-environment, state-transitions, reward-calculations and training of the model has already been prepared for you. It is your job to focus on one task and one task only: keep your pole as straight as possible!

During the lecture, we have not been able to discuss all elements of a DRL-model, as there are many aspects which can be tuned to perfection (or far from it). Some additional explanation will be given in the notebook where deemed necessary, but don't be shy to ask more!

Let's get started. First, import necessary modules (and suppress some unwanted warnings). The 'gym' package is imported to be able to create a Cart Pole environment for you to play with. Further on, 'keras' enables the usage of a neural network, while 'keras-rl' contains a whole bunch of interesting Reinforcement Learning functions.

In [1]:
import numpy as np
import gym
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


Then, set the relevant variables. Get the environment and extract the number of actions available in the Cartpole problem. The seed settings can be useful to compare your results over different runs. However, both a neural network as the RL framework itself still contain a high level of randomization, which may make comparison of distinct runs difficult. Keep this in mind when trying different parameter settings

In [2]:
env = gym.make('CartPole-v0')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

Next, build a neural network model. Initially, it is set to a simple feed-forward neural net, with a single hidden layer and 4 nodes (hint; this is probably quite low). Try different settings by yourself, to find your optimal set-up! Unfortunately, until the day of today, there are no clear rules for choosing how many layers or nodes to use. Google may give you some idea, but most decisions still follow the famous method of trial-and-error.

Try tuning the number of hidden layers, the number of nodes per hidden layer, and the type of activation functions in the hidden and output layers. Use the 'print(model.summary())' to get an overview of the complexity of your model.

In [10]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(4))
model.add(Activation('relu'))
#model.add(Dense(4))
#model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 20        
_________________________________________________________________
activation_6 (Activation)    (None, 4)                 0         
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 10        
_________________________________________________________________
activation_7 (Activation)    (None, 2)                 0         
Total params: 30
Trainable params: 30
Non-trainable params: 0
_________________________________________________________________
None


Now, configure and compile your agent. The memory is set to Sequential Memory, storing the result of performed actions and obtained rewards. Try using different types of action-selection policies, memory sizes, learning rates, training steps, or w/e you can think of. Settings you can tune:

* policy: the way in which actions are selected over time, following some balancing method. This RL-concept is very important, incorporating a trade-off between exploring unknown parts of the environment, and exploiting known information. (possible policies: EpsGreedyQPolicy, LinearAnnealedPolicy, SoftmaxPolicy, GreedyQPolicy, BoltzmannQPolicy, MaxBoltzmannQPolicy, BoltzmannGumbelQPolicy)
* 

In [11]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=20000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

Training for 5000 steps ...
   97/5000: episode: 1, duration: 1.308s, episode steps: 97, steps per second: 74, episode reward: 97.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.485 [0.000, 1.000], mean observation: -0.118 [-0.757, 0.370], loss: 0.443661, mean_absolute_error: 0.495302, mean_q: 0.054613
  151/5000: episode: 2, duration: 0.133s, episode steps: 54, steps per second: 406, episode reward: 54.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.537 [0.000, 1.000], mean observation: 0.132 [-0.237, 0.692], loss: 0.377329, mean_absolute_error: 0.513735, mean_q: 0.187438
  217/5000: episode: 3, duration: 0.177s, episode steps: 66, steps per second: 374, episode reward: 66.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.530 [0.000, 1.000], mean observation: 0.153 [-0.210, 0.907], loss: 0.351174, mean_absolute_error: 0.549699, mean_q: 0.297660
  288/5000: episode: 4, duration: 0.181s, episode steps: 71, steps per second: 392, episode reward: 71.000, mean reward: 1.

  818/5000: episode: 36, duration: 0.031s, episode steps: 10, steps per second: 323, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.134 [-3.033, 1.914], loss: 0.280539, mean_absolute_error: 2.027750, mean_q: 3.787840
  827/5000: episode: 37, duration: 0.050s, episode steps: 9, steps per second: 181, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.171 [-2.884, 1.742], loss: 0.445651, mean_absolute_error: 2.134821, mean_q: 3.904449
  838/5000: episode: 38, duration: 0.045s, episode steps: 11, steps per second: 247, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.117 [-3.298, 2.182], loss: 0.407343, mean_absolute_error: 2.168484, mean_q: 4.011133
  849/5000: episode: 39, duration: 0.040s, episode steps: 11, steps per second: 278, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mea

 1154/5000: episode: 71, duration: 0.040s, episode steps: 8, steps per second: 200, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.151 [-2.580, 1.613], loss: 0.699010, mean_absolute_error: 3.112987, mean_q: 5.978102
 1165/5000: episode: 72, duration: 0.036s, episode steps: 11, steps per second: 307, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.909 [0.000, 1.000], mean observation: -0.114 [-2.746, 1.752], loss: 0.775495, mean_absolute_error: 3.199697, mean_q: 6.012989
 1173/5000: episode: 73, duration: 0.021s, episode steps: 8, steps per second: 388, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.164 [-2.579, 1.552], loss: 0.591940, mean_absolute_error: 3.205439, mean_q: 6.047392
 1183/5000: episode: 74, duration: 0.027s, episode steps: 10, steps per second: 372, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean 

 1454/5000: episode: 102, duration: 0.034s, episode steps: 10, steps per second: 298, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.118 [-2.392, 1.557], loss: 0.716644, mean_absolute_error: 3.755096, mean_q: 7.029856
 1467/5000: episode: 103, duration: 0.036s, episode steps: 13, steps per second: 358, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.692 [0.000, 1.000], mean observation: -0.106 [-1.901, 1.158], loss: 0.646353, mean_absolute_error: 3.723486, mean_q: 6.975456
 1478/5000: episode: 104, duration: 0.029s, episode steps: 11, steps per second: 376, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.727 [0.000, 1.000], mean observation: -0.108 [-2.367, 1.600], loss: 0.557665, mean_absolute_error: 3.719927, mean_q: 7.020200
 1491/5000: episode: 105, duration: 0.036s, episode steps: 13, steps per second: 363, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000

 2022/5000: episode: 136, duration: 0.038s, episode steps: 9, steps per second: 238, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.222 [0.000, 1.000], mean observation: 0.155 [-1.147, 1.927], loss: 0.986024, mean_absolute_error: 4.802388, mean_q: 9.050751
 2034/5000: episode: 137, duration: 0.034s, episode steps: 12, steps per second: 352, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.250 [0.000, 1.000], mean observation: 0.100 [-1.383, 2.147], loss: 1.059962, mean_absolute_error: 4.775156, mean_q: 8.999541
 2045/5000: episode: 138, duration: 0.030s, episode steps: 11, steps per second: 369, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.182 [0.000, 1.000], mean observation: 0.109 [-1.415, 2.192], loss: 0.912245, mean_absolute_error: 4.867273, mean_q: 9.215308
 2054/5000: episode: 139, duration: 0.034s, episode steps: 9, steps per second: 264, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean

 2322/5000: episode: 165, duration: 0.034s, episode steps: 11, steps per second: 327, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.132 [-1.733, 2.768], loss: 1.826614, mean_absolute_error: 5.622189, mean_q: 10.496082
 2330/5000: episode: 166, duration: 0.029s, episode steps: 8, steps per second: 280, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.125 [0.000, 1.000], mean observation: 0.131 [-1.390, 2.169], loss: 2.751743, mean_absolute_error: 5.686408, mean_q: 10.482021
 2340/5000: episode: 167, duration: 0.027s, episode steps: 10, steps per second: 371, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.200 [0.000, 1.000], mean observation: 0.120 [-1.383, 2.173], loss: 3.520200, mean_absolute_error: 5.809319, mean_q: 10.588628
 2349/5000: episode: 168, duration: 0.028s, episode steps: 9, steps per second: 323, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], m

 2693/5000: episode: 199, duration: 0.034s, episode steps: 9, steps per second: 267, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.222 [0.000, 1.000], mean observation: 0.132 [-1.157, 1.940], loss: 3.408504, mean_absolute_error: 6.297012, mean_q: 11.509092
 2702/5000: episode: 200, duration: 0.027s, episode steps: 9, steps per second: 337, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.135 [-1.423, 2.274], loss: 3.537980, mean_absolute_error: 6.351237, mean_q: 11.700573
 2712/5000: episode: 201, duration: 0.027s, episode steps: 10, steps per second: 364, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.200 [0.000, 1.000], mean observation: 0.145 [-1.345, 2.169], loss: 2.966020, mean_absolute_error: 6.416731, mean_q: 11.887039
 2722/5000: episode: 202, duration: 0.028s, episode steps: 10, steps per second: 360, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], m

 3068/5000: episode: 229, duration: 0.113s, episode steps: 21, steps per second: 186, episode reward: 21.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.476 [0.000, 1.000], mean observation: 0.086 [-0.585, 0.874], loss: 3.930601, mean_absolute_error: 6.442440, mean_q: 11.671156
 3095/5000: episode: 230, duration: 0.080s, episode steps: 27, steps per second: 339, episode reward: 27.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.519 [0.000, 1.000], mean observation: 0.096 [-0.429, 0.676], loss: 2.869599, mean_absolute_error: 6.456454, mean_q: 11.815352
 3123/5000: episode: 231, duration: 0.075s, episode steps: 28, steps per second: 373, episode reward: 28.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.083 [-0.225, 0.970], loss: 3.479470, mean_absolute_error: 6.519836, mean_q: 11.895141
 3191/5000: episode: 232, duration: 0.191s, episode steps: 68, steps per second: 356, episode reward: 68.000, mean reward: 1.000 [1.000, 1.000

 4398/5000: episode: 258, duration: 0.102s, episode steps: 39, steps per second: 384, episode reward: 39.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.462 [0.000, 1.000], mean observation: -0.106 [-0.909, 0.565], loss: 2.532679, mean_absolute_error: 7.364540, mean_q: 13.858647
 4429/5000: episode: 259, duration: 0.088s, episode steps: 31, steps per second: 354, episode reward: 31.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.452 [0.000, 1.000], mean observation: -0.121 [-0.832, 0.405], loss: 2.437815, mean_absolute_error: 7.377513, mean_q: 13.931731
 4473/5000: episode: 260, duration: 0.137s, episode steps: 44, steps per second: 321, episode reward: 44.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.477 [0.000, 1.000], mean observation: -0.093 [-0.706, 0.225], loss: 3.513305, mean_absolute_error: 7.448384, mean_q: 13.881970
 4500/5000: episode: 261, duration: 0.086s, episode steps: 27, steps per second: 314, episode reward: 27.000, mean reward: 1.000 [1.000, 1.

<keras.callbacks.History at 0x1c9ae8732b0>

Now it's time to learn something! There are four settings you can consider changing, however, only one which has an effect on your training performance:

* nb_steps: the larger, the more time your bot gets for trying to find a good strategy, but the longer you'll have to wait.
* verbose: printing running status. 0 for no logging, 1 for interval logging, 2 for episode logging
* visualize: you can visualize the training for show, but this mostly slows down training
* log_interval: if verbose=1, the number of steps that are considered to be an interval

In [None]:
dqn.fit(env, nb_steps=5000, verbose=2, visualize=False, log_interval=10000)

Run the below code to test your DRL model. The larger the reward and number of steps per episode, the better your model performs. Running about 10 episodes will give you a proper overall status.

NOTE: Don't close the graph after/while running it. This will reset the kernel and cause you having to re-run everything. You can simply re-run the below code instead, each time.

In [12]:
dqn.test(env, nb_episodes=10, visualize=True)

Testing for 10 episodes ...
Episode 1: reward: 46.000, steps: 46
Episode 2: reward: 29.000, steps: 29
Episode 3: reward: 45.000, steps: 45
Episode 4: reward: 47.000, steps: 47
Episode 5: reward: 50.000, steps: 50
Episode 6: reward: 33.000, steps: 33
Episode 7: reward: 29.000, steps: 29
Episode 8: reward: 43.000, steps: 43
Episode 9: reward: 39.000, steps: 39
Episode 10: reward: 41.000, steps: 41


<keras.callbacks.History at 0x1c928b77198>