_Reinforcement Learning_(RL) is currently one of the most popular fields of ML. It's been around since the 1950's, but never gained much headlines until 2013 when a startup called Deep Mind made the lines with a system that could play any Atari game from scratch! Another feat was achived when in March 2016 their system, AlphaGo, beat Lee Sedol, a Go master.

They did it using some important techniques that will be coverd later in this chapter: _policy gradients_ and _deep Q-networks_(DQN), and _Markov decision processes_(MDP).

## Learning to Optimize Rewards

In RL, a software _agent_ makes _observations_ and takes _actions_ within an _environment_, and in return recieves _rewards_. In short, it's objective is to learn to act in a way to maximize long-term rewards.

Some examples are:
- The agent controlling a walking robot
- The agent controlling pacman
- The agent playing Go
- The agent can control a thermomitor
- The agent can observe stock prices

Note that not every implementation needs to have a positive reward! Navigating a maze is a good example.

## Policy Search

The algorithm used by the software agent is it's _policy_. The policy can be any algorithm and it doesn't even have to be deterministic.

Assume you have a robot vacuum cleaner who's reward is the amount of dust it picks up in 30 min. Its policy could be move forward with probability _p_ every second, or randomly rotate with probability 1 - _p_. The rotation angle would be a random angle between -r and +r. Since this involves randomness this would be a _stochastic policy_.

How would you train this robot? There are only two _policy parameters_ to tweak: the probability _p_ and the range _r_. One algo is to try different  values for them and pick the one that works best. This is what's called _policy search_. However, when the _policy space_ is too large, finding a good set of parameters becomes unreasonable.

Another way to explore this is to use _genetic algorithms_. You can create 100 policies and test them. Then kill off the 80 worst and make the 20 survivors produce 4 offspring each. An offspring is a copy of the parent with some random variation. This continues until a good policy is found.

Yet another approach is to use optimization techniques! By evaluating the rewards with regards to the policy params, you can tweak them ot higher rewards (_gradient ascent_). This approach is called _policy gradients_(PG).

Now, let's move on to creating an environment.

## Introduction to OpenAI Gym

One of the greatest challanges of RL is the environment. If you want an agent to play an Atari game, you need an emulator. If you want a walking robot, then you need to train it in the real world. You can't undo in the real world though, so you generally need a _simulatied environment_ to bootstrap training.

Here's a quick gym:

In [2]:
import gym

env = gym.make("CartPole-v0")
print(env)

obs = env.reset()
print(obs)

env.render()

<TimeLimit<CartPoleEnv<CartPole-v0>>>
[ 0.00440016 -0.01125002  0.01610391 -0.01434425]


True

If you want `render()` to return a NumPy array, you can set the **mode** param to **rgb_array**

In [3]:
img = env.render(mode="rgb_array")
img.shape

(400, 600, 3)

Let's ask the env what actions are possible...

In [4]:
env.action_space

Discrete(2)

There are two discrete actions, 0 and 1, which represent accelerating left or right respectively.

Lets try accelerating to the right.

In [5]:
action = 1
obs, reward, done, info = env.step(action)
print(obs)
print(reward)
print(done)
print(info)

[ 0.00417516  0.18363732  0.01581702 -0.30190301]
1.0
False
{}


The step method executes and returns a given action:

obs
> The new observation/state.

reward
> In this environment, you get a reward of 1.0 every step

done
> This will be true when the _episode_ is over.

info
> Dictionary may provide debug information. This should not be used for training.

Let's hardcode a simple policy that accelerates left wehn the pole is leaning left and right when it leans right. We'll run this and see the average reward it gets over 500 episodes:

In [5]:
def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(1000):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

Let's run it!!!

In [6]:
import numpy as np

print(np.mean(totals), np.std(totals), np.min(totals), np.max(totals))

42.35 8.87780941448959 25.0 68.0


It wasn't able to hold itself up for longer than 68 steps! Big OOF.

## Neural Network Policies

We're going to create neural network policy. This will take an observation as the input and output a probability for each action, and then select an action randomly accorrding to the probabilites. It will output probabilit _p_ of action 0 (left), and the probability of action 1 (right) will be 1 - _p_.

A good question to ask here is why are we picking a random action based on probability? This lets the agent find the right balance between _exploring_ and _exploiting_. Think of the restaurant problem!

Do note that in this environment, the past state is not needed! The CartPole problem is as simple as can be! Here's the code for this policiy:

In [7]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected

# 1. Specify the neural network architecture
n_inputs = 4
n_hidden = 4
n_outputs = 1
initializer = tf.contrib.layers.variance_scaling_initializer()

# 2. Build the neural network
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = fully_connected(X, n_hidden, activation_fn=tf.nn.elu,
                        weights_initializer=initializer)
logits = fully_connected(hidden, n_outputs, activation_fn=None,
                        weights_initializer=initializer)
outputs = tf.nn.sigmoid(logits)

# 3. Select a random action based on the estimated probabilities
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
init = tf.global_variables_initializer()

A breif run through of the code above.

1. We define the architecture. 4 inputs from the CartPole, 4 hidden units.

2. Next we build the neural network. This is a vanilla MLP here. Note the output uses the logistic activation function for an ouput from 0.0 to 1.0. If there were more than two actions, there would be one neuron per action and softmax activation would be used instead.

3. Lastly, the `multinomial()` function is used to pick a random action.

Now we have a neural network policy... but how do we train it?

## Evaluating Actions: The Credit Assignment Problem

If we knew the best actions at each step, we could just train the network by minimizing the cross entropy between the estimated and target probability. That would be supervised learning however, and in RL the only guidence the agent gets in through rewards, and those are typically spare and delayed. For example, an agent balances a pole for 100 steps... how can it now which of them were bad or good? They know the last action is where the pole fell, but is not necesarally what caused it. This is called the _credit assignment problem_: when an agent gets a reward, it's hard to know what should be credited or blamed.

The common strategy for this is to evaluate the action based on the sum of all rewards that come after it, after applying a _discount rate r_ at each step. Discount rates are typically 0.95 or 0.99, with 13 steps roughly halfing the rewards of 0.95 and 69 steps half the rewards for 0.99.

Using this model, on average good actions will get a better score than bad ones. To get fairly reliable scores, we must run many episodes and normalize all the action scores. Now that we can evaluate each action, we can train our first agent using policy gradients.

## Policy Gradients

PG algos optimize the params by following the gradients toward higher rewards. One class of algos, call _REINFORCE algorithms_, was introduced in 1992. Here's a common variant:

1. First, let the neural network play the game several times. At each step compute what would make the chosen action even more likely, but don't apply the gradients yet.

2. Once you've run several episodes, compute each action's score.

3. If an actions score is positive, apply the gradients computed earlier. If negative, apply the opposite gradients to make less likely.

4. Finally, compute the mean of all the resulting gradient vectors and use it to perform a Gradient Descent step.

Let's implement this algo to train our neural network policy. Let's start by adding the target probability, the cost function, and the training operation.

In [8]:
y = 1.0 - tf.to_float(action)

Now that we have a target proba, we can define a cost function and compute the gradients.

In [9]:
learning_rate = 0.01

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y,
                                                       logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars =optimizer.compute_gradients(cross_entropy)

Note that we call `compute_gradients()` and not `minimize()`. This is because we want to tweak them before we apply them.

Lets put all the gradients in a list for ease of use.

In [10]:
gradients = [grad for grad, variable in grads_and_vars]

Now we need to save the computed gradients for each action at each step...

In [11]:
gradient_placeholders = []
grads_and_vars_feed = []

for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
    
training_op = optimizer.apply_gradients(grads_and_vars_feed)

And an initializer and saver!

In [12]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

On to the execution phase! We need a couple of funcs to compute the total discounted rewards, given the raw rewards, and to normalize the results across multiple episodes:

In [13]:
def discount_rewards(rewards, discount_rate):
    discounted_rewards = np.empty(len(rewards))
    cumulative_rewards = 0
    for step in reversed(range(len(rewards))):
        cumulative_rewards = rewards[step] + cumulative_rewards * discount_rate
        discounted_rewards[step] = cumulative_rewards
    return discounted_rewards

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                             for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean)/reward_std
           for discounted_rewards in all_discounted_rewards]

And now to check that it works!

In [14]:
res = discount_rewards([10, 0, -50], discount_rate=0.8)
print(res)

[-22. -40. -50.]


In [15]:
res = discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)
print(res)

[array([-0.28435071, -0.86597718, -1.18910299]), array([1.26665318, 1.0727777 ])]


After verifying the two funcitions, all we have to do is train the policy!

In [16]:
n_iterations = 250
n_max_steps = 1000
n_games_per_update = 10
save_iterations = 10
discount_rate = 0.95

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        all_rewards = []
        all_gradients = []
        for game in range(n_games_per_update):
            current_rewards = []
            current_gradients = []
            obs = env.reset()
            for step in range(n_max_steps):
                action_val, gradients_val = sess.run(
                [action, gradients],
                feed_dict={X: obs.reshape(1, n_inputs)})
                obs, reward, done, info = env.step(action_val[0][0])
                current_rewards.append(reward)
                current_gradients.append(gradients_val)
                env.render() #fun
                if done:
                    break
            all_rewards.append(current_rewards)
            all_gradients.append(current_gradients)
        
        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate)
        feed_dict = {}
        for var_index, grad_placeholder in enumerate(gradient_placeholders):
            mean_gradients = np.mean(
            [reward * all_gradients[game_index][step][var_index]
            for game_index, rewards in enumerate(all_rewards)
            for step, reward in enumerate(rewards)],
            axis=0)
            feed_dict[grad_placeholder] = mean_gradients
        sess.run(training_op, feed_dict=feed_dict)
        if iteration % save_iterations == 0:
            saver.save(sess, "./my_policy_net_pg.ckpt")

KeyboardInterrupt: 

Despite the relative simplicity, this algo is quite powerful! In fact, AlphaGo was based on a similar PG algo (plus _Monte Carlo Tree Search_, which you should look up!).

We'll now look at another popular familly of algos. Whereas PG directly try and optimize the policy rewards, the agent will instead estimate the sum of expected sum of discounted future rewards for each state. To understand these algos, we need to understand _Markov decision processes_(MDP).

## Markov Decision Processes

In the early 20th century, Andrey Markov studied stochastic processes with no memory called _Markov chains_. It had fixed states and randomly evolves from one state to another at each step.  The probability to evolve from _s_ to _s'_ is fixed.

Markov decision processes resemble Markov chains but with a twist: at each step the agent can chose  on of several actions, and the transition properties depended on the action. Moreover, some state transitions return some reward with the agents goal being to maximize said reward.

Refer to book for more examples.

Bellman found a way to estimate the _optimal state value_ of any state _s_, this is the sum of all discounted future rewards the agent can expect on average after reaching state _s_, assuming it acts optimally.

Refer to book for equation.

This leads to an algo that can predict the optimal state value of every state: you first initialize the state value estimates to zero, and then you iteratively update them using the _Value Iteration_ algo. Given enough time, the estimates will converge to the optimal state values.

This is great in policy evaluation, but this doesn't tell the agent what to do. Bellman found a similar algo ot estimate the optimal _state-action values_ generally called _Q-Values_. The optimal Q-Value is the sum of discounted future rewards the agent can expect on average, assuming it acts optimally.

Once the values have been found, implementing them is trivial. When in state _s_, it should choose the action with the highest Q-Value for that state. We're going to apply this to the MDP shown in the book. First, to define the MDP:

In [17]:
nan = np.nan
T = np.array([
    [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],
    [[0.0, 1.0, 0.0], [nan, nan, nan], [0.0, 0.0, 1.0]],
    [[nan, nan, nan], [0.8, 0.1, 0.1], [nan, nan, nan]]
])
R = np.array([
    [[10.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
    [[10.0, 0.0, 0.0], [nan, nan ,nan], [0.0, 0.0, -50.0]],
    [[nan, nan, nan], [40.0, 0.0, 0.0], [nan, nan, nan]],
])
possible_actions = [[0, 1, 2], [0, 2], [1]]

Now let's run the Q-Value Interation algo:

In [18]:
Q = np.full((3, 3), -np.inf)
for state, actions in enumerate(possible_actions):
    Q[state, actions] = 0.0
    
learning_rate = 0.01
discount_rate = 0.95
n_iterations = 100

for iteration in range(n_iterations):
    Q_prev = Q.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q[s, a] = np.sum([
                T[s, a, sp] * (R[s, a, sp] + discount_rate * np.max(Q_prev[sp]))
                for sp in range(3)
            ])

The Q-Values:

In [None]:
print(Q)
print(np.argmax(Q, axis=1))

This shows the optimal policy for this MDP, with a discound of 0.95. If the rate were 0.9, in state S1 the best action becomes a0 (stay put) because it values short term gain much more than long term gain.

## Temporal Difference Learning and Q-Learning

RL problems can typically be modeled into MDPs, but the agent does not know what the rewards are going to be. This means each state and transition must be expirenced at least once to know the rewards, and must be completed multiple times to have a reasonable estimate.

The _Temporal Difference Learning_(TD Learning) algorithm is simialy to the Value Iteration algo but tweaked to take into acount that the agent has partial knowledge if the MDP. We assume that the agent initially knows only the possible states and actions. The agent uses an _exploration policy_ to explore the MDP and updates the estimates of the state values based on the transitions and rewards observed.

> A quick side note, TD Learning is much like Stochastic Gradient Descent in that it handles one sample at a time. It can only truely converge if you gradually reduce the learning rate.

For each state _s_, this algo keeps track of the running average the agent gets upon leaving the state, plus the reward it expects to get later (assuming optimal actions).

The Q-Learning algo is an adaptaition of the Q-Value Iter. algo, where the transition probabilities and rewards are initially unkown.

For each state-action pair(_s_, _a_), this algo keeps track of the running average of the rewards the agent gets upon leaving the state _s_ with action _a_, plus the rewards it expects later. Since the target policy would act optimally, we take the max of the Q-Value estimates for the next state.

Next is now to implement Q-Learning:

In [19]:
import numpy.random as rnd

learning_rate0 = 0.05
learning_rate_decay = 0.1
n_iteration = 20000

s = 0 # Start in state 0

Q = np.full((3, 3), -np.inf)  # -inf for impossible actions
for state, actions in enumerate(possible_actions):
    Q[state, actions] = 0.0 # Inital value = 0.0 for all possible actions
    
for iteration in range(n_iterations):
    a = rnd.choice(possible_actions[s]) # choose an action (randomly)
    sp = rnd.choice(range(3), p=T[s,a]) # pick next state using T[s,a]
    reward = R[s, a, sp]
    learning_rate = learning_rate0 / (1 + iteration * learning_rate_decay)
    Q[s, a] = learning_rate * Q[s, a] + (1 - learning_rate) * (
    reward + discount_rate * np.max(Q[sp])
    )
    s = sp # Move to the next state

Given enough iterations, this would converge to the optimal Q-Values. This is called an _off-policy_ algorithm because the policy trained is not the one being executed.

## Exploration Policies

Q-Learning can work only if the exploration of the MDP was thourgh enough. While a purely random policy would eventually visit every state, and transition many times, it would take an extremly long time. Therefore, the better option is to use the _ε-greedy policy_: at each step it acts randomly with a probability ε, or greedily with probability 1-ε. The advantage of this policy is that both the interesting and unknown parts of the MDP are explored. It's quite common to start with a high value and then gradually reduce it (e.g., 1.0 -> 0.05).

Or, rather than relying on chance for exploration, you can encourage the exploration policy to try actions it hasn't tried much before. This can be implemented by a bonus added to the Q-Value estimates!

## Approximate Q-Learning

One of the great problems with Q-Learning is that it does't scale well to large (or even medium) MDPs with many states and actions. For example Ms. Pac-Man has 250 pellets, each having two states, making a total of 2^250 states for pellets alone! More than the number of atoms in the observable universe!

The solution is to approximate the Q-Values using a manageable number of parameters. This is called _Approximate Q-Learning_. DeepMind showed that using DNNs can work rather well, especially for complex problems. A DNN used to estimate Q-Values is called a _deep Q-network_(DQN), using a DQN for Approximate Q-Learning is called _Deep Q-Learning_.

We'll now use a Deep Q-Learning to train an agent to play Ms. Pac-Man. The code can be tweak to play most Atari Games, but it works best at action games.

## Learning to Play Ms. Pac-Man Using Deep Q-Learning

First, we make the Ms. Pac-Man environment:

In [1]:
import gym

env = gym.make("MsPacman-v0")
obs = env.reset()
print(obs.shape) # [height, width, channels]
print(env.action_space) # Valid Actions

(210, 160, 3)
Discrete(9)


As you can see, the actions are simple the 8 degrees with the last action being center stick. The Obs is just the RGB of the screen. They're a bit large, so we'll scale it down...

In [2]:
import numpy as np

mspacman_color = np.array([210, 164, 74]).mean()

def preprocess_observation(obs):
    img = obs[1:176:2, ::2] # crop and downsize
    img = img.mean(axis=2) # to greyscale
    img[img==mspacman_color] = 0 # improve contrast
    img = (img - 128) / 128 - 1 # normalize from -1.0 to 1.0
    return img.reshape(88, 80, 1)

Next we create the DQN. It could just take a state-action pair, but since the actions are descrete, it's more convinient to use a neural network that takes only a state _s_ as input and oputs one Q-Value per action. The DQN will be composed of three Convolutional layers, followed by two fully connected layers, including the output.

The training algo we use requires two DQNs with the same architecture. On to drive Ms. Pac-Man during training (the _actor_), the other to watch the actor and learn from its trials and errors (the _critic_). At regular intervals, the critic will be copied to the actor. Since we need two DQNs, we'll create a function to build them:

In [3]:
import tensorflow as tf
from tensorflow.contrib.layers import convolution2d, fully_connected

learning_rate = 0.01
input_height = 88
input_width = 80
input_channels = 1
conv_n_maps = [32, 64, 64]
conv_kernel_sizes = [(8,8), (4,4), (3,3)]
conv_strides = [4, 2, 1]
conv_paddings = ["SAME"]*3
conv_activation = [tf.nn.relu]*3
n_hidden_in = 64 * 11 * 10 # conv3 has 64 maps of 11x10 each
n_hidden = 512
hidden_activation = tf.nn.relu
n_outputs = env.action_space.n # 9 discrete actions!
initializer = tf.contrib.layers.variance_scaling_initializer()

def q_network(X_state, scope):
    prev_layer = X_state
    conv_layers = []
    with tf.variable_scope(scope) as scope:
        for n_maps, kernel_size, stride, padding, activation in zip(
                conv_n_maps, conv_kernel_sizes, conv_strides,
                conv_paddings, conv_activation):
            prev_layer = convolution2d(
                prev_layer, num_outputs=n_maps, kernel_size=kernel_size,
                stride=stride, padding=padding, activation_fn=activation,
                weights_initializer=initializer)
            conv_layers.append(prev_layer)
        last_conv_layer_flat = tf.reshape(prev_layer, shape=[-1, n_hidden_in])
        hidden = fully_connected(
            last_conv_layer_flat, n_hidden, activation_fn=hidden_activation,
            weights_initializer=initializer)
        outputs = fully_connected(
            hidden, n_outputs, activation_fn=None,
            weights_initializer=initializer)
    trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                      scope=scope.name)
    trainable_vars_by_name = {var.name[len(scope.name):]: var
                              for var in trainable_vars}
    return outputs, trainable_vars_by_name

The first part of this code defines the hyper parameters of the DQN architecture. The `trainable_vars_by_name` dictionary gathers all trainable vars of the DQN. This will help us implement the copy feature later down the road.

Now let's make the input placeholed, the two DQN, and the opretation to copy the critic DQN to the actor DQN:

In [4]:
X_state = tf.placeholder(tf.float32, shape=[None, input_height, input_width,
                                            input_channels])
actor_q_values, actor_vars = q_network(X_state, scope="q_networks/actor")
critic_q_values, critic_vars = q_network(X_state, scope="q_networks/critic")

copy_ops = [actor_var.assign(critic_vars[var_name])
            for var_name, actor_var in actor_vars.items()]
copy_critic_to_actor = tf.group(*copy_ops)

The actor DQN can play Ms. PacMan, but you probably want to combine it with the _ε-greedy policy_ or another sort of exploration strategy.

What about the critic? In short, it will learn by trying to estimate the actors Q-Values, by training againtst the _replay memory_. This will be done using supervised learning techniques. Afterwhich, the critic will be copied to the actor.

Refere to book for equation.

> While you ommit replay memory, it's highly recommended that you don't. Without it, the critic DQN would become very correlated, which would induce bias, and slow down convergence.

Let's add the DON's training ops. First be need to be able to compute its predicted Q-Values.

In [5]:
X_action = tf.placeholder(tf.int32, shape=[None])
q_value = tf.reduce_sum(critic_q_values * tf.one_hot(X_action, n_outputs),
                        axis=1, keep_dims=True)

Instructions for updating:
keep_dims is deprecated, use keepdims instead


Next, be add the training ops:

In [6]:
y = tf.placeholder(tf.float32, shape=[None, 1])
cost = tf.reduce_mean(tf.square(y - q_value))
global_step = tf.Variable(0, trainable=False, name='global_step')
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(cost, global_step=global_step)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

Before we get the execution phase, let's build the replay memory and some other tools:

In [7]:
from collections import deque

replay_memory_size = 10000
replay_memory = deque([], maxlen=replay_memory_size)

def sample_memories(batch_size):
    indicies = rnd.permutation(len(replay_memory))[:batch_size]
    cols = [[], [], [], [], []] # state, action, reward, next_state, continue
    for idx in indicies:
        memory = replay_memory[idx]
        for col, value in zip(cols, memory):
            col.append(value)
    cols = [np.array(col) for col in cols]
    return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],
            cols[4].reshape(-1, 1))

Next, we'll create the ε-greedy policy for the agent, that decreases ε from 1.0 to 0.05 in 50,000 steps:

In [8]:
eps_min = 0.05
eps_max = 1.0
eps_decay_steps = 50000

def epsilon_greedy(q_values, step):
    epsilon = max(eps_min, eps_max - (eps_max-eps_min) * step/eps_decay_steps)
    if rnd.rand() < epsilon:
        return rnd.randint(n_outputs) # random action
    else:
        return np.argmax(q_values) # optimal action

Now to start training! First the variables:

In [9]:
n_steps = 10000 
training_start = 1000
training_interval = 3
save_steps = 50
copy_steps = 25
discount_rate = 0.95
skip_start = 90
batch_size = 50
iteration = 0
checkpoint_path = "./my_dqn.ckpt"
done = True

Now for the main training loop

In [None]:
import os
import numpy.random as rnd

with tf.Session() as sess:
    if os.path.isfile(checkpoint_path):
        saver.restore(sess, checkpoint_path)
    else:
        init.run()
    while True:
        step = global_step.eval()
        if step >= n_steps:
            break
        iteration += 1
        if done: # game over, restart
            obs = env.reset()
            for skip in range(skip_start): # skip start
                obs, reward, done, info = env.step(0)
            state = preprocess_observation(obs)
        
        # actor evaluate
        q_values = actor_q_values.eval(feed_dict={X_state: [state]})
        action = epsilon_greedy(q_values, step)
        
        # actor play
        obs, reward, done, info = env.step(action)
        env.render() # For fun!
        next_state = preprocess_observation(obs)
        
        # save action to memory
        replay_memory.append((state, action,reward, next_state, 1.0 - done))
        state = next_state
        
        if iteration < training_start or iteration % training_interval != 0:
            continue
            
        # Critic learns
        X_state_val, X_action_val, rewards, X_next_state_val, continues = (
            sample_memories(batch_size))
        next_q_values = actor_q_values.eval(
            feed_dict={X_state: X_next_state_val})
        max_next_q_values = np.max(next_q_values, axis=1, keepdims=True)
        y_val = rewards + continues * discount_rate * max_next_q_values
        training_op.run(feed_dict={X_state: X_state_val,
                                   X_action: X_action_val, y: y_val})
        
        # Regularly copy critic to actor
        if step % copy_steps == 0:
            copy_critic_to_actor.run()
            
        # And sove regularly!
        if step % save_steps == 0:
            saver.save(sess, checkpoint_path)