# Deep Q-Network : actualizations

## 1. DQN : basics

### Experience replay

**Replay memory** contains samples of $(s, a, r, s')$

**Advantages** 

+ **More efficient use of previous experience**, by learning with it multiple times, especially when gaining real-world experience is costly. The Q-learning updates are incremental and do not converge quickly, so multiple passes with the same data is beneficial, especially when there is low variance in immediate outcomes (reward, next state) given the same state, action pair.

+ **Better convergence behavior**(**stability**) when training a function approximator. Partly this is because the data is more like i.i.d. data assumed in most supervised learning convergence proofs. 

**Disadvantage**

+ It is harder to use **multi-step learning algorithms**, such as $Q(\lambda)$, which can be tuned to give better learning curves by balancing between bias (due to bootstrapping) and variance (due to delays and randomness in long-term outcomes). Multi-step DQN with experience-replay DQN is one of the extensions explored in the paper **Rainbow**: Combining Improvements in Deep Reinforcement Learning.


### Target network

**DQN loss function**

$$
MSE = \big( R_{t+1} + \gamma \max_{a'} Q(s', a', \theta) - Q(s, a, \theta) \big)^2
$$

using the **Bellman error** at state $s$

$$
\begin{aligned}
\bar{\delta}_w (s) & \overset{def}{=} \Big( \sum_a \pi (a | s) \sum_{s', r'} p (s', r | s, a) [r + \gamma v_w (s')]  \Big)  - v_w (s) \\[10pt]
&= \mathbb{E} \big[ R_{t+!} + \gamma v_w (S_{t+1}) - v_w (S_t) | S_t = s, A_t \sim \pi \big] \\[15pt]
\bar{BE}(w) & \overset{def}{=} ||\bar{\delta}_w (s) ||^2_\mu
\end{aligned}
$$

**DQN loss function** using the **target network**

$$
MSE = \big( R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a', \theta^-) - Q(S_t, A_t, \theta) \big)^2
$$

+ **Neural network** has the parameters $\theta$
+ The network for the **ground truth** (**target network**) has the parameters $\theta^-$ and stays for a given time. 

## 2. Problem formulation

**Goal** : **keep the pole vertical**, in order for the pole not to sland **AND** for the cart not to escape the screen.

**State** : $
\Biggl[ \begin{matrix}
\mathbf{x} \\ \dot{\mathbf{x}} \\ \theta \\ \dot{\theta} \end{matrix} \Biggl]$, where $\mathbf{x}$ : position, $\dot{\mathbf{x}}$ : speed, $\theta$ : angle, $\dot{\theta}$ : angular speed

## 3. Algorithm

### Overview

1. Choose an **action** from observing the **state**

2. **Take a time step** in the environment with the chosen action

3. Get the **next state** and the **reward** from the environment

4. **Save the sample** $(s, a, r, s')$ in the **replay memory**

5. **Learn** with the **randomly chosen samples** from the replay memory

6. **Update the target network** at every episode

#### Create an agent & load the env

```python
    import gym
    from gym import wrappers

    if __name__ == "__main__":

        env = gym.make('CartPole-v1')
        env = wrappers.Monitor(env,"./cartpole-experiment/",force=True, video_callable=lambda episode_id: e%10==0)

        state_size = env.observation_space.shape[0]  ## 4
        action_size = env.action_space.n             ## 2
```

#### Build a neural network

Create an ANN with the **state as an input**, **Q-fun as an output**

```python
    def build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(24, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(self.action_size, activation='linear',
                        kernel_initializer='he_uniform'))
        model.summary()
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model
```

Layer (type) | Output Shape | Param #
-|-|-
Dense1 | 24 | 120
Dense2 | 24 | 600
Dense3 | 2  | 50

#### Take a step

**get_action**(state) : $\varepsilon$-greedy

```python
    def get_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            q_value = self.model.predict(state)
            return np.argmax(q_value[0])
            
            
    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay
```

#### Save samples in the replay memory

```python
    self.memory = deque(maxlen=2000)

    def append_sample(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
```

#### Train the model

```python
    if len(agent.memory) >= agent.train_start:
        agent.train_model()

    def train_model(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            
        mini_batch = random.sample(self.memory, self.batch_size)

        states = np.zeros((self.batch_size, self.state_size))
        next_states = np.zeros((self.batch_size, self.state_size))
        actions, rewards, dones = [], [], []

        for i in range(self.batch_size):
            states[i] = mini_batch[i][0]
            actions.append(mini_batch[i][1])
            rewards.append(mini_batch[i][2])
            next_states[i] = mini_batch[i][3]
            dones.append(mini_batch[i][4])
            
        # Q-fun of the model for the current state
        # Q-fun of the target net for the next state
        target = self.model.predict(states)
        target_val = self.target_model.predict(next_states)

        for i in range(self.batch_size):
            if dones[i]:
                target[i][actions[i]] = rewards[i]
            else:
                target[i][actions[i]] = rewards[i] + self.discount_factor * (
                    np.amax(target_val[i]))

        self.model.fit(states, target, batch_size=self.batch_size,
                       epochs=1, verbose=0)
```