## <center>Manually Creating a DQN Model </center>


## Deep-Q-Learning
In this notebook we will create our first Deep Reeinforcement Learning model, called Deep-Q-Network (DQN).
We are again using a simple environment from openai gym. However, you will soon see the enormous gain we will get by switching from standard Q-Learning to Deep Q Learning.

In this notebook we again take a look at the CartPole problem (https://gym.openai.com/envs/CartPole-v1/)



In [1]:
from collections import deque
import random
import time

import numpy as np
import gym
from tensorflow.keras.models import Sequential  # To compose multiple Layers
from tensorflow.keras.layers import Dense, InputLayer  # Fully-Connected layer
from tensorflow.keras.layers import Activation  # Activation functions
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import clone_model

In [2]:
def recall(mode=None):
    env = gym.make('CartPole-v1', render_mode=mode)

    return env

Remember, the goal of the CartPole challenge was to balance the stick upright

In [3]:
env = recall("human")
env.reset()  # reset the environment to the initial state
for _ in range(9):  # play for max 200 iterations
    env.render()  # render the current game state on your screen
    random_action = env.action_space.sample()  # chose a random action
    env.step(random_action)  # execute that action
env.close()  # close the environment

### The Artificial Neural Network
To build our network, we first need to find out how many actions and observation our environment has.

In [4]:
num_actions = env.action_space.n
num_observations = env.observation_space.shape[0]  # You can use this command to get the number of observations
print(f"There are {num_actions} possible actions and {num_observations} observations")

There are 2 possible actions and 4 observations


So our network needs to have an input dimension of 4 and an output dimension of 2.
In between we are free to chose.

Let's just say we want to use a four layer architecture:


1. The first layer has 16 neurons
2. The second layer has 32 neurons
4. The fourth layer (output layer) has 2 neurons

This yields 690 parameters
$$ \text{4 observations} * 16 (\text{neurons}) + 16 (\text{bias}) + (16*32) + 32 + (32*2)+2 = 690$$

In [5]:
model = Sequential()

model.add(InputLayer(input_shape=(1, num_observations)))
model.add(Dense(16,))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))


model.add(Dense(num_actions))
model.add(Activation('linear')) # used to chose one of the neurons

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1, 16)             80        
                                                                 
 activation (Activation)     (None, 1, 16)             0         
                                                                 
 dense_1 (Dense)             (None, 1, 32)             544       
                                                                 
 activation_1 (Activation)   (None, 1, 32)             0         
                                                                 
 dense_2 (Dense)             (None, 1, 2)              66        
                                                                 
 activation_2 (Activation)   (None, 1, 2)              0         
                                                                 
Total params: 690
Trainable params: 690
Non-trainable pa

Now we have our model which takes an observation as input and outputs a value for each action.
The higher the value, the more likely that this value is a suitable action for the current observation

As stated in the lecture, Deep-Q-Learning works better when using a target network.
So let's just copy the above network

In [6]:
target_model = clone_model(model)

### Hyperparameters and Update Function

In [7]:
EPOCHS = 300

epsilon = 1.0
EPSILON_REDUCE = 0.995  # is multiplied with epsilon each epoch to reduce it
GAMMA = 0.95

Let us use the epsilon greedy action selection function once again:

In [8]:
def action_selection(model, epsilon, observation):
    random_number = np.random.random()
    if random_number > epsilon:
        prediction = model.predict(observation.reshape(-1, 1, 4))  # perform the prediction on the observation
        action = np.argmax(prediction)  # Chose the action with the higher value
    else:
        action = np.random.randint(0, env.action_space.n)  # Else use random action
    return action

As shown in the lecture, we need a replay buffer.
We can use the **deque** data structure for this, which already implements the circular behavior.

The *maxlen* argument specifies the number of elements the buffer can store between he overwrites them at the beginning

The following cell shows an example usage of the deque function. You can see, that in the first example all values fit into the deque, so nothing is overwritten. 

In the second example, the deque is printed in each iteration. It can hold all values in the first five iterations but then needs to delete the oldest value in the deque to make room for the new value 

In [9]:
### deque examples
deque_1 = deque(maxlen=5)
for i in range(5):  # all values fit into the deque, no overwriting
    deque_1.append(i)
print(deque_1)
print("---------------------")
deque_2 = deque(maxlen=5)

# after the first 5 values are stored, it needs to overwrite the oldest value to store the new one
for i in range(10):  
    deque_2.append(i)
    print(deque_2)

deque([0, 1, 2, 3, 4], maxlen=5)
---------------------
deque([0], maxlen=5)
deque([0, 1], maxlen=5)
deque([0, 1, 2], maxlen=5)
deque([0, 1, 2, 3], maxlen=5)
deque([0, 1, 2, 3, 4], maxlen=5)
deque([1, 2, 3, 4, 5], maxlen=5)
deque([2, 3, 4, 5, 6], maxlen=5)
deque([3, 4, 5, 6, 7], maxlen=5)
deque([4, 5, 6, 7, 8], maxlen=5)
deque([5, 6, 7, 8, 9], maxlen=5)


Let's say we allow our replay buffer a maximum size of 20000

In [10]:
replay_buffer = deque(maxlen=20000)
update_target_model = 10

As mentioned in the lecture, action replaying is crucial for Deep Q-Learning. <br />
The following cell implements one version of the action replay algorithm. <br />
It uses the zip statement paired with the * (Unpacking Argument Lists) operator to create batches from the samples for efficient prediction and training.<br />
The zip statement returns all corresponding pairs from each entry. <br />
It might look confusing but the following example should clarify it

In [11]:
test_tuple = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
zipped_list = list(zip(*test_tuple))
a, b, c = zipped_list
print(a, b, c)

(1, 4, 7) (2, 5, 8) (3, 6, 9)


Now it's time to write the replay function

In [12]:
def replay(replay_buffer, batch_size, model, target_model):
    
    # As long as the buffer has not enough elements we do nothing
    if len(replay_buffer) < batch_size: 
        return
    
    # Take a random sample from the buffer with size batch_size
    samples = random.sample(replay_buffer, batch_size)  
    
    # to store the targets predicted by the target network for training
    target_batch = []  
    
    # Efficient way to handle the sample by using the zip functionality
    zipped_samples = list(zip(*samples))  
    states, actions, rewards, new_states, dones = zipped_samples  
    
    # Predict targets for all states from the sample
    targets = target_model.predict(np.array(states))
    
    # Predict Q-Values for all new states from the sample
    q_values = model.predict(np.array(new_states))  
    
    # Now we loop over all predicted values to compute the actual targets
    for i in range(batch_size):  
        
        # Take the maximum Q-Value for each sample
        q_value = max(q_values[i][0])  
        
        # Store the ith target in order to update it according to the formula
        target = targets[i].copy()  
        if dones[i]:
            target[0][actions[i]] = rewards[i]
        else:
            target[0][actions[i]] = rewards[i] + q_value * GAMMA
        target_batch.append(target)

    # Fit the model based on the states and the updated targets for 1 epoch
    model.fit(np.array(states), np.array(target_batch), epochs=1, verbose=0)  


We need to update our target network every once in a while. <br />
Keras provides the *set_weights()* and *get_weights()* methods which do the work for us, so we only need to check whether we hit an update epoch

In [13]:
def update_model_handler(epoch, update_target_model, model, target_model):
    if epoch > 0 and epoch % update_target_model == 0:
        target_model.set_weights(model.get_weights())


# Part 4: Training the Model

Now it is time to write the training loop! <br />
First we compile the model

In [14]:
model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))

In [15]:
best_so_far = 0
for epoch in range(EPOCHS):
    env = recall()
    observation = env.reset()
    observation = observation[0]  # Get inital state
    
    # Keras expects the input to be of shape [1, X] thus we have to reshape
    observation = observation.reshape([1, 4])  
    done = False  
    
    points = 0
    while not done:  # as long current run is active
        
        # Select action acc. to strategy
        action = action_selection(model, epsilon, observation)
        
        # Perform action and get next state
        next_observation, reward, done, *info = env.step(action)  
        next_observation = next_observation.reshape([1, 4])  # Reshape!!
        replay_buffer.append((observation, action, reward, next_observation, done))  # Update the replay buffer
        observation = next_observation  # update the observation
        points+=1

        # Most important step! Training the model by replaying
        replay(replay_buffer, 32, model, target_model)

    
    epsilon *= EPSILON_REDUCE  # Reduce epsilon
    
    # Check if we need to update the target model
    update_model_handler(epoch, update_target_model, model, target_model)
    
    if points > best_so_far:
        best_so_far = points

    if epoch == 0 or epoch %10 == 0:
        print(f"{epoch}: Points reached: {points} - epsilon: {epsilon} - Best: {best_so_far}")


0: Points reached: 35 - epsilon: 0.995 - Best: 35
10: Points reached: 26 - epsilon: 0.946354579813443 - Best: 44
20: Points reached: 37 - epsilon: 0.9000874278732445 - Best: 44
30: Points reached: 54 - epsilon: 0.8560822709551227 - Best: 54
40: Points reached: 19 - epsilon: 0.8142285204175609 - Best: 54
50: Points reached: 25 - epsilon: 0.7744209942832988 - Best: 55
60: Points reached: 32 - epsilon: 0.736559652908221 - Best: 60
70: Points reached: 21 - epsilon: 0.7005493475733617 - Best: 138
80: Points reached: 49 - epsilon: 0.6662995813682115 - Best: 138
90: Points reached: 22 - epsilon: 0.6337242817644086 - Best: 138
100: Points reached: 51 - epsilon: 0.6027415843082742 - Best: 143
110: Points reached: 171 - epsilon: 0.5732736268885887 - Best: 184
120: Points reached: 86 - epsilon: 0.5452463540625918 - Best: 184
130: Points reached: 74 - epsilon: 0.5185893309484582 - Best: 184
140: Points reached: 28 - epsilon: 0.4932355662165453 - Best: 184
150: Points reached: 127 - epsilon: 0.4691

KeyboardInterrupt: 

In [16]:
epoch

193

# Part 5: Using Trained Model

In [19]:
env = recall("human")
observation = env.reset()
observation = observation[0]
for counter in range(300):
    env.render()
    
    # Get discretized observation
    action = np.argmax(model.predict(observation.reshape([-1,1,4])))
    
    # Perform the action 
    observation, reward, done, *info = env.step(action) # Finally perform the action
    
    if done:
        print(f"done")
        break
    time.sleep(0.1)

env.close()

done
