# A3C application to Atari breakout

## 1. (Review) A2C : Actor-critic



![](https://cdn-images-1.medium.com/max/1024/1*-GfRVLWhcuSYhG25rN0IbA.png)

The **actor** learns policies and the **critic** learns about whatever policy is currently being followed by the actor in order to **criticize** the actor's action choices.

In only the policy network(**REINFORCE**), we have

$$
\theta_{t+1} \approx \theta_t + \alpha [ \nabla_\theta \log \pi_\theta (a | s) G_t]
$$

In both the policy and value network(**A2C**), we have

$$
\theta_{t+1} \approx \theta_t + \alpha [ \nabla_\theta \log \pi_\theta (a | s) Q_w(s,a)]
$$

In **A2C**, we have high variance of loss, because

$$
Loss = cross \ \ entropy \ \ (policy \ \ network) \ \ \times \ \ Q-fun \ \ (value \ \ network)
$$

Thus, we adopt the **baseline** (which is independent of actions), and use **value function** as the baseline. Value function can also be approximated using a new variable, $v$. We define **advantage function**,

$$
A(S_t, A_t) = Q_w (S_t, A_t) - V_v (S_t)
$$

Approximating only the value function, 

$$
\delta_v = R_{t+1} + \gamma V_v (S_{t+1}) - V_v (S_t)
$$

thus we have the **updating function of actor-critic** using the **advantage function**,

$$
\theta_{t+1} \approx \theta_t + \alpha [\nabla_\theta \log \pi_\theta (a|s) \delta_v ]
$$

To update(approximate) the value function(network), we use

$$
MSE = (R_{t+1} + \gamma V_v (S_{t+1}) - V_v (S_t) )^2
$$

## 2. Limitations of DQN

![](https://dnddnjs.gitbooks.io/rl/content/dqn16.png)

The use of **replay memory** of $(s, a, r, s')$ 

**Advantages**

+ avoidance of **non-stationary** aspect of the training data

**Disadvantages**

+ requires a **large memory**
+ requires **off-policy** learning

## 3. A3C

![](https://t1.daumcdn.net/cfile/tistory/2225DE4C58A334B62D)

Asynchronous variant of actor-critic algorithm. Introduced in **_Asynchronous Methods for Deep Reinforcement Learning_** (Deepmind, 2016)

To address **nonstationarity** (i.e. to introduce a stabilizing effect), it uses several sampling agents : called **actor-learner**. Each actor-learner learns in a different environment. Thus, the samples have less correlation. 

The global network is updated from the samples from a given time step. This process is inherently **asynchronous**.

### 3.1. Overview

1. Generate the **global network** and several **environments + actor-learners**
2. Each actor-learner **samples** from the environment according to her model for a given time step
3. Each actor-learner **updates** the global network with samples
4. Each actor-learner **updates** herself from the global network

### 3.2. Multithreading

the ability of CPU to execute **multiple threads** concurrently.

In [6]:
import threading

class Agent(threading.Thread):
    def _init_(self):
        threading.Thread._init_(self)
        pass
    def run(self):
        for i in range(10):
            print (i)

agents = [Agent() for i in range(3)]
for agent in agents:
    agent.start()

000


111


22
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
8
8
8
9
9
9



### 3.3. Algorithm

#### Multithreading and Generate A3C Agent class

**train()** function generates classes **Agent** by the number of threads. 

```python
global episode
episode = 0
EPISODES = 8000000
env_name = "BreakoutDeterministic-v4"

class A3CAgent:
    def __init__(self, action_size):
        self.threads = 8
        
    def train(self):
        for agent in agents:
            time.sleep(1)
            agent.start()
    
    def build_model(self):
        pass
        
class Agent(threading.Thread):
    def __init__(self, action_size, state_size, model, sess,
                 optimizer, discount_factor, summary_ops):
        threading.Thread.__init__(self)
        
        # period of model updating
        self.t_max = 20
        self.t = 0
    
    def run(self):
        global episode
        env = gym.make(env_name)

        step = 0
        
        while not done:
            step += 1
            self.t += 1

if __name__ == "__main__":
    global_agent = A3CAgent(action_size=3)
    global_agent.train()
```

#### Build the models for the actors and the critics

Create an ANN with the **state as an input**, **actions** (actor) or **Q function** as an output.

```python
def build_model(self):
    input = Input(shape=self.state_size)
    conv = Conv2D(16, (8, 8), strides=(4, 4), activation='relu')(input)
    conv = Conv2D(32, (4, 4), strides=(2, 2), activation='relu')(conv)
    conv = Flatten()(conv)
    
    # actor and critic distinguished into policy and value
    fc = Dense(256, activation='relu')(conv)

    policy = Dense(self.action_size, activation='softmax')(fc)
    value = Dense(1, activation='linear')(fc)

    actor = Model(inputs=input, outputs=policy)
    critic = Model(inputs=input, outputs=value)

    # to erase the errors entailing multithreading in Keras
    actor._make_predict_function()
    critic._make_predict_function()

    actor.summary()
    critic.summary()

    return actor, critic
```

+ Actor 

Layer (type) | Output Shape | Param #
-|-|-
Input  | (84, 84, 4) | 0
Conv2D | (20, 20, 16)| 4112
Conv2D | (9, 9, 32)  | 8224
Flatten| 2592 | 0
Dense  | 256| 663608
Dense  | 3  | 771

+ Critic

Layer (type) | Output Shape | Param #
-|-|-
Input  | (84, 84, 4) | 0
Conv2D | (20, 20, 16)| 4112
Conv2D | (9, 9, 32)  | 8224
Flatten| 2592 | 0
Dense  | 256| 663608
Dense  | 1  | 257

#### Build the local network

```python
def build_local_model(self):
    input = Input(shape=self.state_size)
    conv = Conv2D(16, (8, 8), strides=(4, 4), activation='relu')(input)
    conv = Conv2D(32, (4, 4), strides=(2, 2), activation='relu')(conv)
    conv = Flatten()(conv)
    fc = Dense(256, activation='relu')(conv)
    policy = Dense(self.action_size, activation='softmax')(fc)
    value = Dense(1, activation='linear')(fc)

    local_actor = Model(inputs=input, outputs=policy)
    local_critic = Model(inputs=input, outputs=value)

    local_actor._make_predict_function()
    local_critic._make_predict_function()

    local_actor.set_weights(self.actor.get_weights())
    local_critic.set_weights(self.critic.get_weights())

    local_actor.summary()
    local_critic.summary()

    return local_actor, local_critic
```

#### Train by making threads

```python
def train(self):
    # Generate Agent classes
    agents = [Agent(self.action_size, self.state_size,
                    [self.actor, self.critic], self.sess,
                    self.optimizer, self.discount_factor,
                    [self.summary_op, self.summary_placeholders,
                     self.update_ops, self.summary_writer])
              for _ in range(self.threads)]

    # Start each thread
    for agent in agents:
        time.sleep(1)
        agent.start()

    # Save model every 10 minutes
    while True:
        time.sleep(60 * 10)
        self.save_model("./save_model/breakout_a3c")
```

#### Run the actor-learner

1. Choose an action according to the local network of the actor-learner
2. Receive the next state and reward from the environment
3. Save the samples
4. Agent dies, or iterate by t_max timesteps
5. Send the saved samples to the global network
6. Global network updates itself with the samples from the local network
7. Update the actor-learner with the updated global network

The main part of **run()**

```python
def run(self):
    global episode
    env = gym.make(env_name)

    step = 0

    while episode < EPISODES:
        done = False
        dead = False

        score, start_life = 0, 5
        observe = env.reset()
        next_observe = observe

        # stop for 0 - 30 states
        for _ in range(random.randint(1, 30)):
            observe = next_observe
            next_observe, _, _, _ = env.step(1)

        state = pre_processing(next_observe, observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))

        while not done:
            step += 1
            self.t += 1
            observe = next_observe
            action, policy = self.get_action(history)

            # 1: stop, 2: left, 3: right
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3

            # if dead, shoot to restart
            if dead:
                action = 0
                real_action = 1
                dead = False

            # take one step with chosen action
            next_observe, reward, done, info = env.step(real_action)

            # pre-process the state at each timestep
            next_state = pre_processing(next_observe, observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3],
                                     axis=3)

            # take max of the policy
            self.avg_p_max += np.amax(self.actor.predict(
                np.float32(history / 255.)))

            if start_life > info['ale.lives']:
                dead = True
                start_life = info['ale.lives']

            score += reward
            reward = np.clip(reward, -1., 1.)

            # save the sample
            self.append_sample(history, action, reward)

            if dead:
                history = np.stack((next_state, next_state,
                                    next_state, next_state), axis=2)
                history = np.reshape([history], (1, 84, 84, 4))
            else:
                history = next_history

            # start learning, if dead or t_max reached
            if self.t >= self.t_max or done:
                self.train_model(done)
                self.update_local_model()
                self.t = 0

            if done:
                # record the learned info at each time step
                episode += 1
                print("episode:", episode, "  score:", score, "  step:",
                      step)

                stats = [score, self.avg_p_max / float(step),
                         step]
                for i in range(len(stats)):
                    self.sess.run(self.update_ops[i], feed_dict={
                        self.summary_placeholders[i]: float(stats[i])
                    })
                summary_str = self.sess.run(self.summary_op)
                self.summary_writer.add_summary(summary_str, episode + 1)
                self.avg_p_max = 0
                self.avg_loss = 0
                step = 0
```

#### Set the optimizer to update the actor

```python
def actor_optimizer(self):
    action = K.placeholder(shape=[None, self.action_size])
    advantages = K.placeholder(shape=[None, ])

    policy = self.actor.output

    # Policy cross-entropy loss function
    action_prob = K.sum(action * policy, axis=1)
    cross_entropy = K.log(action_prob + 1e-10) * advantages
    cross_entropy = -K.sum(cross_entropy)

    # Entropy error to facilitate exploration
    entropy = K.sum(policy * K.log(policy + 1e-10), axis=1)
    entropy = K.sum(entropy)

    # Final loss = sum of two losses
    loss = cross_entropy + 0.01 * entropy

    optimizer = RMSprop(lr=self.actor_lr, rho=0.99, epsilon=0.01)
    updates = optimizer.get_updates(self.actor.trainable_weights, [],loss)
    train = K.function([self.actor.input, action, advantages],
                       [loss], updates=updates)
    return train
```

**Entropy** as a loss. To minimize the entropy, actor-learner should **equalize the policy**.

$$
Entropy = - \sum_i p_i \log p_i
$$

#### Calculate the k-time step advantage function

Advantage function used in actor-critic is **one-step temporal difference error**.

$$
Advantage = R_{t+1} + \gamma V_v (S_{t+1}) - V_v (S_t)
$$

**k-time step advantage function** calculates the advantage function after several time steps. This calibration of the advantage after k-timesteps is called **multi-step TD learning**, which is midway between SARSA and Monte Carlo.

$$
Advantage = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^k V_v (S_{t+k}) - V_v (S_t)
$$

```python
def discounted_prediction(self, rewards, done):
        discounted_prediction = np.zeros_like(rewards)
        running_add = 0

        if not done:
            running_add = self.critic.predict(np.float32(
                self.states[-1] / 255.))[0]

        for t in reversed(range(0, len(rewards))):
            running_add = running_add * self.discount_factor + rewards[t]
            discounted_prediction[t] = running_add
        return discounted_prediction
```

#### Set the optimizer to update the critic

```python
def critic_optimizer(self):
    discounted_prediction = K.placeholder(shape=(None,))

    value = self.critic.output

    # set loss as the square of [return - value]
    loss = K.mean(K.square(discounted_prediction - value))

    optimizer = RMSprop(lr=self.critic_lr, rho=0.99, epsilon=0.01)
    updates = optimizer.get_updates(self.critic.trainable_weights, [],loss)
    train = K.function([self.critic.input, discounted_prediction],
                       [loss], updates=updates)
    return train
```

$$
Loss = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^k V_v (S_{t+k}) - V_v (S_t)
$$

#### Update the global network

```python
def train_model(self, done):
    discounted_prediction = self.discounted_prediction(self.rewards, done)

    states = np.zeros((len(self.states), 84, 84, 4))
    for i in range(len(self.states)):
        states[i] = self.states[i]

    states = np.float32(states / 255.)

    values = self.critic.predict(states)
    values = np.reshape(values, len(values))

    advantages = discounted_prediction - values

    self.optimizer[0]([states, self.actions, advantages])
    self.optimizer[1]([states, discounted_prediction])
    self.states, self.actions, self.rewards = [], [], []
```

#### Update the local network

```python
def update_local_model(self):
        self.local_actor.set_weights(self.actor.get_weights())
        self.local_critic.set_weights(self.critic.get_weights())
```