## <center>Keras-RL DQN Model</center>


In this notebook we will create our first Reinforcement Learning agent via keras-RL2 taking the *Cartpole* as example

In [1]:
import time  # to reduce the game speed when playing manually

import gym  # Contains the game we want to play

# import necessary blocks from keras to build the Deep Learning backbone of our agent
from tensorflow.keras.models import Sequential  
from tensorflow.keras.layers import Dense, Activation, InputLayer
from tensorflow.keras.layers import Flatten
from tensorflow.keras.optimizers import Adam  # Adam optimizer

# Now the keras-rl2 agent. Dont get confused as it is only called rl and not keras-rl
from rl.agents.dqn import DQNAgent

### a. Environment set up

In [2]:
def recall():
    env = gym.make('CartPole-v1')
    return env

env = recall()
env.reset()

for _ in range(9):
    env.render(mode="human")  
    random_action = env.action_space.sample()
    env.step(random_action)

env.close()

In [3]:
num_actions = env.action_space.n
num_observations = env.observation_space.shape[0]
print(f"There are {num_actions} possible actions and {num_observations} observations")

There are 2 possible actions and 4 observations


### b. DQN agent set up
The DQN agent created with keras-RL2, needs the following parameters to be created:

**1. Model**

The model is the ANN, in this case we will use the same as the one implemented in the Manual DQN notebook

In [4]:
model = Sequential()
model.add(InputLayer(input_shape=(1, num_observations)))
model.add(Flatten())

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(num_actions))
model.add(Activation('linear'))

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 4)                 0         
                                                                 
 dense (Dense)               (None, 32)                160       
                                                                 
 activation (Activation)     (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 activation_1 (Activation)   (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 32)                1056      
                                                                 
 activation_2 (Activation)   (None, 32)                0

**2. nb_actions**

Number of actions --> already defined

**3. memory**

The action replay memory. You can choose between the *SequentialMemory()* and *EpisodeParameterMemory()* which is only used for one RL agent called *CEM*. Sequential Memory is for storing observations (optimized circular buffer)

Here we initialize the circular buffer with a limit of 20000 and a window length of 1. The window length describes the number of subsequent actions stored for a state. This will be demonstrated in the next lecture, when we start dealing with images


In [5]:
from rl.memory import SequentialMemory  
memory = SequentialMemory(limit=20000, window_length=1)

**4. nb_steps_warmup**

How many iterations without training - Used to fill the memory

**5. target_model_update**

When do we update the target model?

**6. Action Selection policy**

There are many policies to chose from, some of them like the *LinearAnnealedPolicy()*, are referred as outter policies and take an inner policy such as *SoftmaxPolicy()*, *EpsGreedyQPolicy()*, *GreedyQPolicy()*, *GreedyQPolicy()*, *MaxBoltzmannQPolicy()* and *BoltzmannGumbelQPolicy()*. 

In [6]:
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), # inner policy
                              attr='eps', # attribute 
                              value_max=1.0, # max value of the attribute
                              value_min=0.001, # min value of the attribute 
                              value_test=0.0005, # small value to test the model --> explotation
                              nb_steps=200000) 

Now we create the DQN Agent based on the defined model (**model**), the possible actions (**nb_actions**) (left and right in this case), the circular buffer (**memory**), the burnin or warmup phase (**10**), how often the target model gets updated (**100**) and the policy (**policy**)


In [7]:
dqn = DQNAgent(model=model, 
               nb_actions=num_actions, 
               memory=memory, 
               nb_steps_warmup=10,
               target_model_update=100, 
               policy=policy)

# Compilation
dqn.compile(Adam(learning_rate=0.0001), metrics=['mae']) 

# Now we run the training for 20000 steps. You can change visualize=True if you want to watch your model learning. 
# Keep in mind that this increases the running time

dqn.fit(env, nb_steps=200000, visualize=False, verbose=1)

Training for 200000 steps ...
Interval 1 (0 steps performed)
    1/10000 [..............................] - ETA: 8:12 - reward: 1.0000

  updates=self.state_updates,


421 episodes - episode_reward: 23.717 [8.000, 121.000] - loss: 8.657 - mae: 13.803 - mean_q: 26.795 - mean_eps: 0.975

Interval 2 (10000 steps performed)
368 episodes - episode_reward: 27.196 [8.000, 102.000] - loss: 35.264 - mae: 42.799 - mean_q: 87.094 - mean_eps: 0.925

Interval 3 (20000 steps performed)
328 episodes - episode_reward: 30.485 [8.000, 115.000] - loss: 139.249 - mae: 109.835 - mean_q: 226.613 - mean_eps: 0.875

Interval 4 (30000 steps performed)
284 episodes - episode_reward: 35.070 [9.000, 131.000] - loss: 506.604 - mae: 229.148 - mean_q: 473.489 - mean_eps: 0.825

Interval 5 (40000 steps performed)
238 episodes - episode_reward: 42.139 [8.000, 188.000] - loss: 1307.097 - mae: 398.757 - mean_q: 822.530 - mean_eps: 0.775

Interval 6 (50000 steps performed)
193 episodes - episode_reward: 51.653 [11.000, 216.000] - loss: 3026.643 - mae: 610.585 - mean_q: 1256.975 - mean_eps: 0.725

Interval 7 (60000 steps performed)
161 episodes - episode_reward: 62.410 [12.000, 250.000]

<keras.callbacks.History at 0x1b6ea0a7f10>

In [9]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 220.000, steps: 220
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 208.000, steps: 208
Episode 4: reward: 239.000, steps: 239
Episode 5: reward: 283.000, steps: 283
