# Human-level control through deep reinforcement learning


* 딥러닝 세미나 : 코드리뷰 [1, 2, 8]
* 김무성

# Contents
* Paper Review
* Code Review

# Paper Review

#### 참고
* [3] Playing Atari With Deep Reinforcement Learning (NIPS 2013) 논문리뷰 - http://sanghyukchun.github.io/90/
* [4] RL slide - https://computing.ece.vt.edu/~f15ece6504/slides/L26_RL.pdf
* [12] Deep Reinforcement Learning - ICLR 2015 tutorial - http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf
* [11] Deep Q-Learning - http://www.slideshare.net/nikolaypavlov/deep-qlearning

<font color="red">Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning.</font>

<img src="http://sanghyukchun.github.io/images/post/90-6.png" width=600 />

<img src="http://image.slidesharecdn.com/rl-presentation-160522151115/95/deep-qlearning-9-1024.jpg?cb=1463930058" width=600 />

<img src="http://image.slidesharecdn.com/rl-presentation-160522151115/95/deep-qlearning-10-1024.jpg?cb=1463930058" width=600 />

#### action-value function

We use a deep convolutional neural network to approximate the optimal action-value function

<img src="figures/eqn.1.png" width=600 />

#### experience replay

To perform experience replay we store 

the agent’s experiences 

$e_t = (s_t,a_t,r_t,s_{t+1})$ at each time-step $t$ 

in a data set 

$D_t$ =  {$e_1$,...,$e_t$}. 

During learning, we apply Q-learning updates, 

on samples (or minibatches) of experience 

$(s,a,r,s')$ ~ $U(D)$, 

drawn uniformly at random from the pool of stored samples. 

#### loss function

<img src="figures/capn.1.png" width=600>

The Q-learning update at iteration i uses the following loss function:

<img src="figures/eqn.2.png" width=600 />

<img src="figures/capn.2.png" width=600>

<img src="figures/capn.10.png" width=1000 />
<img src="figures/capn.11.png" width=600 />
<img src="figures/capn.12.png" width=600 />
<img src="figures/capn.13.png" width=1000 />

<img src="figures/capn.3.png" width=600>

<img src="figures/capn.4.png" width=600>

# METHODS

#### 참고
* [10] Distributed Deep Q-Learning - http://www.slideshare.net/onghaoyi/distributed-deep-qlearning

#### Preprocessing

<img src="http://image.slidesharecdn.com/dist-deep-qlearn-slides-151023232840-lva1-app6891/95/distributed-deep-qlearning-13-1024.jpg?cb=1445643171" width=600 />

#### Code availability

##### 참고
* [5] songrotek's code - https://github.com/songrotek/DQN-Atari-Tensorflow
* [6] asrivat1's code - https://github.com/asrivat1/DeepLearningVideoGames
* [7] gliese581gg's code - https://github.com/gliese581gg/DQN_tensorflow
* [8] devsisters' code - https://github.com/devsisters/DQN-tensorflow/

The source code can be accessed at https://sites.google.com/a/ deepmind.com/dqn for non-commercial uses only.

#### Model architecture

<img src="figures/capn.1.png" width=600>

<img src="http://image.slidesharecdn.com/dist-deep-qlearn-slides-151023232840-lva1-app6891/95/distributed-deep-qlearning-14-638.jpg?cb=1445643171" width=600 />

<img src="http://sanghyukchun.github.io/images/post/90-5.png" width=600 />

#### Training details

<img src="figures/capn.9.png" width=600>

#### Evaluation procedure

#### Algorithm

<img src="figures/eqn.3.png" width=600 />

This leads to a sequence of loss functions $L_i(\theta_i)$ that changes at each iteration $i$,

<img src="figures/eqn.4.png" width=600 />

Differentiating the loss function with respect to the weights we arrive at the following gradient:

<img src="figures/eqn.5.png" width=600 />

#### Training algorithm for deep Q-networks

<img src="figures/eqn.6.png" width=600 />

<img src="figures/capn.5.png" width=600>

<img src="figures/capn.6.png" width=600>

<img src="figures/capn.7.png" width=600>

<img src="figures/capn.8.png" width=600>

<img src="figures/capn.14.png" width=600 />

<img src="figures/capn.15.png" width=600 />

# Code Review

다음의 코드를 리뷰한다 
* [8] devsisters' code - https://github.com/devsisters/DQN-tensorflow/

In [1]:
!ls ../

LICENSE				assets	     config.pyc  main.py
README.md			checkpoints  dqn	 paper
Render_OpenAI_gym_as_GIF.ipynb	config.py    logs


## To train a model for Breakout:

In [None]:
python main.py --env_name=Breakout-v0 --is_train=True --display=True

# main.py

main.py<br>
 -- dqn/agent.py<br>
 -- dqn/environment.py<br> 
 -- config.py

In [None]:
import random
import tensorflow as tf

from dqn.agent import Agent
from dqn.environment import GymEnvironment, SimpleGymEnvironment
from config import get_config

flags = tf.app.flags

# Model
flags.DEFINE_string('model', 'm1', 'Type of model')
flags.DEFINE_boolean('dueling', False, 'Whether to use dueling deep q-network')
flags.DEFINE_boolean('double_q', False, 'Whether to use double q-learning')

# Environment
flags.DEFINE_string('env_name', 'Breakout-v0', 'The name of gym environment to use')
flags.DEFINE_integer('action_repeat', 4, 'The number of action to be repeated')

# Etc
flags.DEFINE_boolean('use_gpu', True, 'Whether to use gpu or not')
flags.DEFINE_string('gpu_fraction', '1/1', 'idx / # of gpu fraction e.g. 1/3, 2/3, 3/3')
flags.DEFINE_boolean('display', False, 'Whether to do display the game screen or not')
flags.DEFINE_boolean('is_train', True, 'Whether to do training or testing')
flags.DEFINE_integer('random_seed', 123, 'Value of random seed')

FLAGS = flags.FLAGS

# Set random seed
tf.set_random_seed(FLAGS.random_seed)
random.seed(FLAGS.random_seed)

if FLAGS.gpu_fraction == '':
  raise ValueError("--gpu_fraction should be defined")

def calc_gpu_fraction(fraction_string):
  idx, num = fraction_string.split('/')
  idx, num = float(idx), float(num)

  fraction = 1 / (num - idx + 1)
  print " [*] GPU : %.4f" % fraction
  return fraction

def main(_):
  gpu_options = tf.GPUOptions(
      per_process_gpu_memory_fraction=calc_gpu_fraction(FLAGS.gpu_fraction))

  with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
    config = get_config(FLAGS) or FLAGS

    if config.env_type == 'simple':
      env = SimpleGymEnvironment(config)
    else:
      env = GymEnvironment(config)

    if not FLAGS.use_gpu:
      config.cnn_format = 'NHWC'

    agent = Agent(config, env, sess)

    if FLAGS.is_train:
      agent.train()
    else:
      agent.play()

if __name__ == '__main__':
  tf.app.run()


### config.py

In [None]:
class AgentConfig(object):
  scale = 10000
  display = False

  max_step = 5000 * scale
  memory_size = 100 * scale

  ...


class EnvironmentConfig(object):
  ...


class DQNConfig(AgentConfig, EnvironmentConfig):
  model = ''
  pass


class M1(DQNConfig):
  ...


def get_config(FLAGS):
  if FLAGS.model == 'm1':
    config = M1
  elif FLAGS.model == 'm2':
    config = M2

  for k, v in FLAGS.__dict__['__flags'].items():
    if k == 'gpu':
      if v == False:
        config.cnn_format = 'NHWC'
      else:
        config.cnn_format = 'NCHW'

    if hasattr(config, k):
      setattr(config, k, v)

  return config


### dqn/agent.py

agent.py <br>
-- base.py <br>
-- history.py <br>
-- ops.py <br>
-- replay_memory.py <br>
-- utils.py 

In [None]:
from .base import BaseModel
from .history import History
from .ops import linear, conv2d
from .replay_memory import ReplayMemory
from utils import get_time, save_pkl, load_pkl

class Agent(BaseModel):
  def __init__(self, config, environment, sess):
    super(Agent, self).__init__(config)
    self.sess = sess
    self.weight_dir = 'weights'

    self.env = environment
    self.history = History(self.config)
    self.memory = ReplayMemory(self.config, self.model_dir)

    ...

    self.build_dqn()

  def train(self):
    
    ...

    num_game, self.update_count, ep_reward = 0, 0, 0.
    total_reward, self.total_loss, self.total_q = 0., 0., 0.
    max_avg_ep_reward = 0
    ep_rewards, actions = [], []

    screen, reward, action, terminal = self.env.new_random_game()

    for _ in range(self.history_length):
      self.history.add(screen)

    for self.step in tqdm(range(start_step, self.max_step), ncols=70, initial=start_step):
      
      ...
        
      # 1. predict
      action = self.predict(self.history.get())
      # 2. act
      screen, reward, terminal = self.env.act(action, is_training=True)
      # 3. observe
      self.observe(screen, reward, action, terminal)

      if terminal:
        screen, reward, action, terminal = self.env.new_random_game()

        num_game += 1
        ep_rewards.append(ep_reward)
        ep_reward = 0.
      else:
        ep_reward += reward

      actions.append(action)
      total_reward += reward

      if self.step >= self.learn_start:
        if self.step % self.test_step == self.test_step - 1:
          avg_reward = total_reward / self.test_step
          avg_loss = self.total_loss / self.update_count
          avg_q = self.total_q / self.update_count

          try:
            max_ep_reward = np.max(ep_rewards)
            min_ep_reward = np.min(ep_rewards)
            avg_ep_reward = np.mean(ep_rewards)
          except:
            max_ep_reward, min_ep_reward, avg_ep_reward = 0, 0, 0

          ...
        
        
          num_game = 0
          total_reward = 0.
          self.total_loss = 0.
          self.total_q = 0.
          self.update_count = 0
          ep_reward = 0.
          ep_rewards = []
          actions = []

  def predict(self, s_t, test_ep=None):
    ...

    return action

  def observe(self, screen, reward, action, terminal):
    ...

  def q_learning_mini_batch(self):
    ..
    
  def build_dqn(self):
    self.w = {}
    self.t_w = {}

    #initializer = tf.contrib.layers.xavier_initializer()
    initializer = tf.truncated_normal_initializer(0, 0.02)
    activation_fn = tf.nn.relu

    # training network
    with tf.variable_scope('prediction'):
       ...
    
    # target network
    with tf.variable_scope('target'):
       ...
    
    with tf.variable_scope('pred_to_target'):
       ...
    
    
    # optimizer
    with tf.variable_scope('optimizer'):
      self.optim = tf.train.RMSPropOptimizer(
          self.learning_rate_op, momentum=0.95, epsilon=0.01).minimize(self.loss)

    with tf.variable_scope('summary'):
      scalar_summary_tags = ['average.reward', 'average.loss', 'average.q', \
          'episode.max reward', 'episode.min reward', 'episode.avg reward', 'episode.num of game', 'training.learning_rate']

      ...
        
      histogram_summary_tags = ['episode.rewards', 'episode.actions']

      ...

      self.writer = tf.train.SummaryWriter('./logs/%s' % self.model_dir, self.sess.graph)

    tf.initialize_all_variables().run()

    self._saver = tf.train.Saver(self.w.values() + [self.step_op], max_to_keep=30)

    self.load_model()
    self.update_target_q_network()

    
 
  def play(self, n_step=10000, n_episode=100, test_ep=None, render=False):
    if test_ep == None:
      test_ep = self.ep_end

    test_history = History(self.config)

    if not self.display:
      gym_dir = '/tmp/%s-%s' % (self.env_name, get_time())
      self.env.env.monitor.start(gym_dir)

    best_reward, best_idx = 0, 0
    for idx in xrange(n_episode):
      screen, reward, action, terminal = self.env.new_random_game()
      current_reward = 0

      for _ in range(self.history_length):
        test_history.add(screen)

      for t in tqdm(range(n_step), ncols=70):
        # 1. predict
        action = self.predict(test_history.get(), test_ep)
        # 2. act
        screen, reward, terminal = self.env.act(action, is_training=False)
        # 3. observe
        test_history.add(screen)

        current_reward += reward
        if terminal:
          break

      if current_reward > best_reward:
        best_reward = current_reward
        best_idx = idx

      print "="*30
      print " [%d] Best reward : %d" % (best_idx, best_reward)
      print "="*30

    if not self.display:
      self.env.env.monitor.close()
      #gym.upload(gym_dir, writeup='https://github.com/devsisters/DQN-tensorflow', api_key='')
        

#### base.py 

In [None]:
class BaseModel(object):
  """Abstract object representing an Reader model."""
  ...

  def save_model(self, step=None):
    ...
    
  def load_model(self):
    ...
    
  @property
  def checkpoint_dir(self):
    return os.path.join('checkpoints', self.model_dir)

  @property
  def model_dir(self):
    ...
    return model_dir + '/'

  @property
  def saver(self):
    ...
    return self._saver


#### history.py 

In [None]:
class History:
  def __init__(self, config):
    self.cnn_format = config.cnn_format

    batch_size, history_length, screen_height, screen_width = \
        config.batch_size, config.history_length, config.screen_height, config.screen_width

    self.history = np.zeros(
        [history_length, screen_height, screen_width], dtype=np.float32)

  def add(self, screen):
    self.history[:-1] = self.history[1:]
    self.history[-1] = screen

  def reset(self):
    self.history *= 0

  def get(self):
    if self.cnn_format == 'NHWC':
      return np.transpose(self.history, (1, 2, 0))
    else:
      return self.history


#### ops.py 

In [None]:
def conv2d(x,
           output_dim,
           kernel_size,
           stride,
           initializer=tf.contrib.layers.xavier_initializer(),
           activation_fn=tf.nn.relu,
           data_format='NHWC',
           padding='VALID',
           name='conv2d'):
  with tf.variable_scope(name):
    if data_format == 'NCHW':
      stride = [1, 1, stride[0], stride[1]]
      kernel_shape = [kernel_size[0], kernel_size[1], x.get_shape()[1], output_dim]
    elif data_format == 'NHWC':
      stride = [1, stride[0], stride[1], 1]
      kernel_shape = [kernel_size[0], kernel_size[1], x.get_shape()[-1], output_dim]

    w = tf.get_variable('w', kernel_shape, tf.float32, initializer=initializer)
    conv = tf.nn.conv2d(x, w, stride, padding, data_format=data_format)

    b = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
    out = tf.nn.bias_add(conv, b, data_format)

  if activation_fn != None:
    out = activation_fn(out)

  return out, w, b

def linear(input_, output_size, stddev=0.02, bias_start=0.0, activation_fn=None, name='linear'):
  shape = input_.get_shape().as_list()

  with tf.variable_scope(name):
    w = tf.get_variable('Matrix', [shape[1], output_size], tf.float32,
        tf.random_normal_initializer(stddev=stddev))
    b = tf.get_variable('bias', [output_size],
        initializer=tf.constant_initializer(bias_start))

    out = tf.nn.bias_add(tf.matmul(input_, w), b)

    if activation_fn != None:
      return activation_fn(out), w, b
    else:
      return out, w, b


#### replay_memory.py 

In [None]:
"""Code from https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py"""

class ReplayMemory:
  def __init__(self, config, model_dir):
    self.model_dir = model_dir

    self.cnn_format = config.cnn_format
    self.memory_size = config.memory_size
    self.actions = np.empty(self.memory_size, dtype = np.uint8)
    self.rewards = np.empty(self.memory_size, dtype = np.integer)
    self.screens = np.empty((self.memory_size, config.screen_height, config.screen_width), dtype = np.float16)
    self.terminals = np.empty(self.memory_size, dtype = np.bool)
    self.history_length = config.history_length
    self.dims = (config.screen_height, config.screen_width)
    self.batch_size = config.batch_size
    self.count = 0
    self.current = 0

    # pre-allocate prestates and poststates for minibatch
    self.prestates = np.empty((self.batch_size, self.history_length) + self.dims, dtype = np.float16)
    self.poststates = np.empty((self.batch_size, self.history_length) + self.dims, dtype = np.float16)

  def add(self, screen, reward, action, terminal):
    assert screen.shape == self.dims
    # NB! screen is post-state, after action and reward
    self.actions[self.current] = action
    self.rewards[self.current] = reward
    self.screens[self.current, ...] = screen
    self.terminals[self.current] = terminal
    self.count = max(self.count, self.current + 1)
    self.current = (self.current + 1) % self.memory_size

  def getState(self, index):
    assert self.count > 0, "replay memory is empy, use at least --random_steps 1"
    # normalize index to expected range, allows negative indexes
    index = index % self.count
    # if is not in the beginning of matrix
    if index >= self.history_length - 1:
      # use faster slicing
      return self.screens[(index - (self.history_length - 1)):(index + 1), ...]
    else:
      # otherwise normalize indexes and use slower list based access
      indexes = [(index - i) % self.count for i in reversed(range(self.history_length))]
      return self.screens[indexes, ...]

  def sample(self):
    # memory must include poststate, prestate and history
    assert self.count > self.history_length
    # sample random indexes
    indexes = []
    while len(indexes) < self.batch_size:
      # find random index 
      while True:
        # sample one index (ignore states wraping over 
        index = random.randint(self.history_length, self.count - 1)
        # if wraps over current pointer, then get new one
        if index >= self.current and index - self.history_length < self.current:
          continue
        # if wraps over episode end, then get new one
        # NB! poststate (last screen) can be terminal state!
        if self.terminals[(index - self.history_length):index].any():
          continue
        # otherwise use this index
        break
      
      # NB! having index first is fastest in C-order matrices
      self.prestates[len(indexes), ...] = self.getState(index - 1)
      self.poststates[len(indexes), ...] = self.getState(index)
      indexes.append(index)

    actions = self.actions[indexes]
    rewards = self.rewards[indexes]
    terminals = self.terminals[indexes]

    if self.cnn_format == 'NHWC':
      return np.transpose(self.prestates, (0, 2, 3, 1)), actions, \
        rewards, np.transpose(self.poststates, (0, 2, 3, 1)), terminals
    else:
      return self.prestates, actions, rewards, self.poststates, terminals

  def save(self):
    ...
    
  def load(self):
    ...

#### utils.py

In [None]:
def timeit(f):
  ...

def get_time():
  ...

@timeit
def save_pkl(obj, path):
  ...

@timeit
def load_pkl(path):
  ...

@timeit
def save_npy(obj, path):
  ...

@timeit
def load_npy(path):
  ...


## dqn/environment.py

In [None]:
import cv2
import gym
import random
import numpy as np

class Environment(object):
  def __init__(self, config):
    ...
    
  def new_game(self, from_random_game=False):
    ...
    return self.screen, 0, 0, self.terminal

  def new_random_game(self):
    ...
    return self.screen, 0, 0, self.terminal

  def _step(self, action):
    self._screen, self.reward, self.terminal, _ = self.env.step(action)

  def _random_step(self):
    action = self.env.action_space.sample()
    self._step(action)

  @ property
  def screen(self):
    return cv2.resize(cv2.cvtColor(self._screen, cv2.COLOR_RGB2GRAY)/255., self.dims)
    #return cv2.resize(cv2.cvtColor(self._screen, cv2.COLOR_BGR2YCR_CB)/255., self.dims)[:,:,0]

  @property
  def action_size(self):
    return self.env.action_space.n

  @property
  def lives(self):
    return self.env.ale.lives()

  @property
  def state(self):
    return self.screen, self.reward, self.terminal

  def render(self):
    if self.display:
      self.env.render()

  def after_act(self, action):
    self.render()

class GymEnvironment(Environment):
  def __init__(self, config):
    super(GymEnvironment, self).__init__(config)

  def act(self, action, is_training=True):
    ...

class SimpleGymEnvironment(Environment):
  def __init__(self, config):
    super(SimpleGymEnvironment, self).__init__(config)

  def act(self, action, is_training=True):
    ...
    return self.state

# 참고자료
* [1] Playing Atari With Deep Reinforcement Learning - http://arxiv.org/abs/1312.5602
* [2] Human-level control through deep reinforcement learning - http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf
* [3] Playing Atari With Deep Reinforcement Learning (NIPS 2013) 논문리뷰 -  http://sanghyukchun.github.io/90/
* [4] RL slide - https://computing.ece.vt.edu/~f15ece6504/slides/L26_RL.pdf
* [5] songrotek's code - https://github.com/songrotek/DQN-Atari-Tensorflow
* [6] asrivat1's code - https://github.com/asrivat1/DeepLearningVideoGames
* [7] gliese581gg's code - https://github.com/gliese581gg/DQN_tensorflow
* [8] devsisters' code - https://github.com/devsisters/DQN-tensorflow/
* [9] 강화학습 그리고 OpenAI - http://www.modulabs.co.kr/RL_library/3237
* [10] Distributed Deep Q-Learning - http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
* [11] Deep Q-Learning - http://www.slideshare.net/nikolaypavlov/deep-qlearning
* [12] Deep Reinforcement Learning - ICLR 2015 tutorial - http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf
* [13] Dueling Deep Q-Networks - http://torch.ch/blog/2016/04/30/dueling_dqn.html