<a href="https://colab.research.google.com/github/Ingasha-Sharon/DEEP-Q-LEARNING/blob/main/Deep_Q_Learning_Network_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MTR -- 20200621


### Mathematical Setting


## Toy Example: CartPole game - balance the pole



In [None]:
import os
# import time
import random
import gym # https://gym.openai.com
import numpy as np
from collections import deque # special list: you can add things in the front or the end of the list
from keras.models import Sequential # we use a sequential model to approximate the Q
from keras.layers import Dense # we use only dense layers in the NN
from keras.optimizers import Adam # with stochastic gradient descent

### Set (hyper-)parameters

Set up the openai gym environment:

In [None]:
env = gym.make('CartPole-v0') # remember: max length for version v0 is 200 timeframes
print(env.observation_space)
print(env.action_space)

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Discrete(2)


  logger.warn(
  deprecation(
  deprecation(


Read out the shape of the state vector:

In [None]:
state_size = env.observation_space.shape[0]
state_size

4

Read out the shape of the action vector:

In [None]:
action_size = env.action_space.n
action_size

2

Set the batch size for the SGD:

In [None]:
batch_size = 64 # size of batches for stochastic gradient descent

In [None]:
n_episodes = 1001   # we play a number of episodes and will randomly
                    # remember some of the things which happened within
                    # each episode; we use this memory to train our network
                    # see Idea 1

In [None]:
output_dir = 'model_output/cartpole'

In [None]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

##### Idea 1. (not mine of course ;-) experience replay

=> Introduce self.memory

##### Idea 2. (not mine of course ;-) Exploitation vs. Exploration tradeoff
If we are only exploiting best practices we already know we migth not uncover something new which is helpful or better.
It is especially important when the environment is evolving.
Therefore, define an exploration rate $\epsilon$ which is decreasing over time by a decay rate to a lower bound.

---
**Algoritm 1** DEEP Q-Learning with Experience Replay
1. Initialize replay memory with certain length
2. Initialize Q-function approximator as a neural network
3. **for** episode in 1 to $N$ **do**

 A. Initialize current state $s_1$ from the environment

 B. **for** timeStep = 1 to $T$ **do**
  
  a. With probability *epsilon* select random action $a_t$
  
  b. otherwise select action $a_t$ with max reward predicted by current Model
  
  c. Execute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$
  
  d. Store transition $(s_t, a_t, r_t, s_{t+1})$ in memory
  
  e. Set $s_t = s_{t+1}$
  
  f. Sample random minibatch of transitions $(s_i, a_i, r_i, s_{i+1})$ from memory
  
  g. Set
$$y_i = \left\{
    \begin{array}\\
        r_i \text{ for episode has ended} \\
        r_i + \gamma\max_{a^{\prime}}Q(s_{j+1}, a^{\prime};\theta) \text{ for episode has not ended} \\
    \end{array}
\right.$$

   h. Perform a gradient descent step on $(y_i - Q(s_i, a_i; \theta))^2$

 C. **end for**

4. **end for**


### Define agent

In [None]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size

        # 1. initialize replay memory
        self.memory = deque(maxlen = 2000)

        self.gamma = 0.95 # discount factor: how much worth are future rewards to us

        self.epsilon = 1.0 # exploration rate
        self.epsilon_decay = 0.9975 # exploration rate decreases over time
        self.epsilon_min = 0.01 # we define a lower bound the exploration rate

        self.learning_rate = 0.001 # step size for SGD optimizer

        # 2. Initialize Q-function approximator as a neural network
        self.model = self._build_model()

    def _build_model(self): # private function to set up model
        model = Sequential()
        model.add(Dense(24,
                  input_dim = self.state_size, # 24 Neurons, inputs are the possible states
                  activation='relu')) # Activation function ReLU;
        model.add(Dense(24, activation='relu' ))
        model.add(Dense(self.action_size, activation='linear')) # linear activation because estimates should
                                                                # lead directly to actions
                                                                # -> take the max for the 'best' action

        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))

        return model

    def remember(self, state, action, reward, next_state, done): # done represents information if episode is ended or not
        self.memory.append((state, action, reward, next_state, done))

    # 3.B.a + b
    def act(self, state):
        if np.random.rand() <= self.epsilon: # decide randomly if explore or exploit according to exploration rate epsilon
            return random.randrange(self.action_size) # case 1: exploration

        act_values = self.model.predict(state) # case 2: exploit - choose the best action predicted by current model
        return np.argmax(act_values[0])

    def replay(self, batch_size):

        minibatch = random.sample(self.memory, batch_size) # sample randomly from memory

        for state, action, reward, next_state, done in minibatch:
            target = reward
            # 3.B.g  Set target y_i
            if not done: # episode is not over yet
                target = (reward + self.gamma*np.amax(self.model.predict(next_state)[0]))   # feed forward:
                                                                                            # predict future rewards
                                                                                            # of all possible actions
                                                                                            # and choose the action with the
                                                                                            # maximum predicted reward
            target_f = self.model.predict(state)
            target_f[0][action] = target  # update Q value of current state in the model

            # 3.B.h Perform a gradient descent step
            self.model.fit(state, target_f, epochs=1, verbose=0) # train the model for only one epoch since we have
                                                                 # information only for one single moment
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay # as long the minimum is not reached decrease the exploration rate

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

Start the Agent:

In [None]:
agent = DQNAgent(state_size, action_size)
agent.model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 24)                120       
                                                                 
 dense_1 (Dense)             (None, 24)                600       
                                                                 
 dense_2 (Dense)             (None, 2)                 50        
                                                                 
Total params: 770 (3.01 KB)
Trainable params: 770 (3.01 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Let's the agent work:

In [None]:
done = False


for e in range(n_episodes):
    state = env.reset()

    # 3.A Initialize current state 𝑠1 from the environment
    state = np.reshape(state, [1, state_size]) # e.g. [[-0.04211199  0.0074463   0.01872044 -0.01805317]]

    for t_ in range(201):

        # env.render() # you can watch the agent train by uncommenting this line
                       # does not work on kaggle-server without further adjustments

        # 3.B.a + b select action
        action = agent.act(state) # 0 or 1; left or right

        # 3.B.c Execute action a_t and observe reward r_t and next state s_{t+1}
        next_state, reward, done, _ = env.step(action) # feed action to environment and receive the next state,
                                                       # the reward and if the episode has ended or not
        reward = reward if not done else -10 # if we die, penalize with -10!

        next_state = np.reshape(next_state, [1, state_size])

        # 3.B.d Store transition (s_t,a_t,t_t,s_{t+1}) in memory
        agent.remember(state, action, reward, next_state, done)
        # 3.B.e set s_t = s_{t+1}
        state = next_state # update current state

        if done:
            # remember: in CartPole the score is time surviving
            print("episode: {}/{}, score: {}, eps: {:.3}".format(e, n_episodes, t_, agent.epsilon))
            break # if done leave loop

    # 3.B.f Sample random minibatch of transitions (s_i, a_i, r_i, s_{i+1}) from memory
    if len(agent.memory) > batch_size: # train network only after certain amount of experience: here batch size
        agent.replay(batch_size)

    if e % 50 == 0: # save model parameters every 50 episodes
        agent.save(output_dir + "weights_" + '{:04d}'.format(e) + ".hdf5" )

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
episode: 198/1001, score: 34, eps: 0.612
episode: 199/1001, score: 101, eps: 0.611
episode: 200/1001, score: 40, eps: 0.609
episode: 201/1001, score: 12, eps: 0.608
episode: 202/1001, score: 27, eps: 0.606
episode: 203/1001, score: 149, eps: 0.605
episode: 204/1001, score: 62, eps: 0.603
episode: 205/1001, score: 54, eps: 0.602
episode: 206/1001, score: 38, eps: 0.6
episode: 207/1001, score: 147, eps: 0.599
episode: 208/1001, score: 105, eps: 0.597
episode: 209/1001, score: 141, eps: 0.596
episode: 210/1001, score: 89, eps: 0.594
episode: 211/1001, score: 88, eps: 0.593
episode: 212/1001, score: 115, eps: 0.591
episode: 213/1001, score: 68, eps: 0.59
episode: 214/1001, score: 129, eps: 0.588
episode: 215/1001, score: 94, eps: 0.587
episode: 216/1001, score: 39, eps: 0.585
episode: 217/1001, score: 47, eps: 0.584
episode: 218/1001, score: 30, eps: 0.582
episode: 219/1001, score: 13, eps: 0.581
episode: 220/1001, score: 146

In [None]:
import os

# ... (rest of the code)

if e % 50 == 0: # save model parameters every 50 episodes
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    agent.save(output_dir + "weights_" + '{:04d}'.format(e) + ".hdf5" )

In [None]:
agent = DQNAgent(state_size, action_size)
agent.model.summary()



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 24)                120       
                                                                 
 dense_4 (Dense)             (None, 24)                600       
                                                                 
 dense_5 (Dense)             (None, 2)                 50        
                                                                 
Total params: 770 (3.01 KB)
Trainable params: 770 (3.01 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
