# Reinforcement Learning Reference Collections

### Status quo label
[.]:copied [-]:on progress [=]:written [%]:completed [@]:memorized

## Contents


1. [QLEARNING] : q-table-simple learning
    - [%][Q-learning tensorflow](#Q-learning tensorflow)  
2. [RL_PBA] : policy based learning  
    - [-][Policy Based Agent tensorflow](#Policy Based Agent tensorflow)  
3. [RL_MBA] : model based learning  
    - [-][Model Based Agent tensorflow](#Model Based Agent tensorflow)  
4. [DQN] : deep qlearning network  
    - [%][Deep Q-Learning Network tensorflow](#Deep Q-Learning Network tensorflow)  
5. [DDQN] : deep qlearning Network + dueling, double  
    - [=][Double-Dualing Deep Q-Learning Network tensorflow](#Double-Dualing Deep Q-Learning Network tensorflow)      
6. [A3C](#A3C) : asynchronous advantage actor critic model [paper](https://arxiv.org/pdf/1602.01783.pdf)  
    - [-][Asynchronous Advantages Actor-Critic Model tensorflow on Breakout-v0](#Asynchronous Advantages Actor-Critic Model tensorflow)  
7. [Meta_RL](#Meta_RL) : meta reinforcement learning [paper1](https://arxiv.org/pdf/1611.05763.pdf) [paper2](https://arxiv.org/abs/1611.02779)  
    - [][]()


---

[][DRQN] : deep recurrent q-network  

Partially observable Markov Decision Process to maximize cumulative reward  
Decision problems,  
Imitation learning  
- behavioral learning  
- inverse reinforcement learning  
DAgger (Dataset Aggregation)  
Q-learning  


Policy gradient learning : observation → action  

Q-learning : long term reward : state with reward → action  

Experience replay  
Freezing target network  

 

## Reference  


[Arthur Juliani](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.u07v7laru)  

[Arthur Juliani github](https://github.com/awjuliani/DeepRL-Agents)

  Aside
  [Neural Stack Machine](https://iamtrask.github.io/2016/02/25/deepminds-neural-stack-machine/)

In [1]:
DATASET_DIR = './dataset/'
PROJECT_DIR = './projects/RL_collections/'
SUMMARY_DIR = PROJECT_DIR+'summaries/'
SAVER_DIR = PROJECT_DIR+'models/'
CHECKPOINT_DIR = PROJECT_DIR+'checkpoints/'
RESULT_DIR = PROJECT_DIR+'results/'

In [1]:
# NOTE: In the verbose version, There are example codes with doc string inside function which is not a doctest.
# NOTE: For the sake of educational purpose, these codes are note well organized.
# TODO: RL_PBA agent version NOT DONE, DQN: , DDQN: MEM error

<a id='Q-learning tensorflow'></a>
## Q-learning tensorflow

In [3]:
def QLEARNING():
    """ 20 mins """
    import time

    import numpy as np
    import tensorflow as tf
    import gym

    # HYPER PARAMS
    env = gym.make("FrozenLake-v0")
    learning_rate = 0.4

    input_size = env.observation_space.n # 16
    output_size = env.action_space.n # 4
    max_episode = 2000
    dis = 0.99

    # PARAMS
    X = tf.placeholder(tf.float32, [1, input_size]) # 1x16
    W = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01)) # 16x4

    Qpred = tf.matmul(X, W) # 1x4
    Y = tf.placeholder(tf.float32, [1, output_size])

    loss = tf.reduce_mean(tf.square(Y - Qpred))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

    rList = []

    def one_hot(x, inputsize):
        return np.identity(input_size)[x:x+1]

    start = time.time()
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for episode in range(max_episode):
            # e-greedy
            e = 1. / ((episode / 50.) + 10.)
            s = env.reset() # 1
            rewardAll = 0
            done = False
            step = 1
            while not done:
                Qs = sess.run(Qpred, feed_dict={X: one_hot(s, input_size)})
                if np.random.rand(1) < e:
                    action = env.action_space.sample()
                else:
                    action = np.argmax(Qs)

                sn, reward, done, info = env.step(action)

                if done:
                    Qs[0, action] = reward
                else:
                    Qsn = sess.run(Qpred, feed_dict={X: one_hot(sn, input_size)})
                    Qs[0, action] = reward + dis*np.max(Qsn)
                sess.run(optimizer, feed_dict={X: one_hot(s, input_size), Y: Qs})
                rewardAll += reward
                s = sn
                step += 1
            print("episode : {:d}, reward : {:.5f}".format(episode, rewardAll))
            rList.append(rewardAll)
        end = time.time()
        print("acc : " + str(sum(rList)/max_episode * 100) + "%")
        print("In time : {} s".format(end-start))
    return None

QLEARNING()

[2017-03-15 22:09:21,129] Making new env: FrozenLake-v0


episode : 0, reward : 0.00000
episode : 1, reward : 0.00000
episode : 2, reward : 0.00000
episode : 3, reward : 0.00000
episode : 4, reward : 0.00000
episode : 5, reward : 1.00000
episode : 6, reward : 0.00000
episode : 7, reward : 0.00000
episode : 8, reward : 1.00000
episode : 9, reward : 0.00000
episode : 10, reward : 0.00000
episode : 11, reward : 0.00000
episode : 12, reward : 0.00000
episode : 13, reward : 0.00000
episode : 14, reward : 1.00000
episode : 15, reward : 0.00000
episode : 16, reward : 0.00000
episode : 17, reward : 1.00000
episode : 18, reward : 0.00000
episode : 19, reward : 0.00000
episode : 20, reward : 0.00000
episode : 21, reward : 0.00000
episode : 22, reward : 0.00000
episode : 23, reward : 1.00000
episode : 24, reward : 0.00000
episode : 25, reward : 0.00000
episode : 26, reward : 1.00000
episode : 27, reward : 1.00000
episode : 28, reward : 0.00000
episode : 29, reward : 0.00000
episode : 30, reward : 0.00000
episode : 31, reward : 1.00000
episode : 32, rewa

<a id='Policy Based Agent tensorflow'></a>
## Policy Based Agent

<a id='Model Based Agent tensorflow'></a>
## Model Based Agent

In [None]:
def RL_MBA():
    import math
    import matplotlib.pyplot as plt

    import numpy as np
    import tensorflow as tf	

    from tensorflow.python.framework import dtypes
    from tensorflow.python.framework import ops
    from tensorflow.python.ops import (
                array_ops,
                control_flow_ops,
                math_ops,
                nn_ops,
                rnn, rnn_cell,
                variable_scope
        )
    from tensorflow.contrib.layers import xavier_initializer

    import gym

    env = gym.make("CartPole-v0")

    # HYPER PARAMS	
    max_episode_num = 5000

    learning_rate = 1e-2
    dis = .99
    decay_rate = .99
    resume = False

    model_batch_size = 3
    real_batch_size = 3

    input_dim = 4
    n_hidden = 4	

    # POLICY NETWORK
    tf.reset_default_graph()
    class Policy_Network():
        def __init__(self, session, input_size, output_size, structure_dict=None, name="pol_main")
            self.session = session
            self.input_size = input_size
            self.output_size = output_size 
            self._build_network()

        def _build_network(self, lr=1e-2):
            #TODO: structure dict contains layer structure
            with tf.variable_scope("policy_network"):
                self._obs = tf.placeholder(tf.float32, [None, 4], name="input_x")
                self._y = tf.placeholder(tf.float32, [None, 1], name="input_y")

                w1 = tf.get_variable("w1", shape=[4, n_hidden], initializer=xavier_initializer())
                w2 = tf.get_variable("w2", shape=[n_hidden, 1], initializer=xavier_initializer())

                h1 = tf.nn.relu(tf.matmul(obs, w1))
            self._pred = tf.nn.sigmoid(tf.matmul(h1, w2))

            tvars = tf.trainable_variables()

            self._advantages = tf.placeholder(tf.float32, name="reward_signal")
            self._loglik = tf.log(input_y*(input_y - pred) + (1-input_y)*(input_y+pred))
            self._loss = -tf.reduce_mean(loglik*advantages)
            self._optm = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

        def predict(self, state):
            x = np.reshape(state, [1, self.input_size])
            return self.session.run(self._pred, feed_dict={self._obs: x})

        def update(self, x_stack, y_stack):
            return self.session.run([self._loss, self._optm], 
                        feed_dict={self._obs: x_stack, self._y: y_stack})


    class Model_Network():
        def __init__(self, session, input_size, output_size, 
                        sdict={'m_hidden':256}, name="model_main"):
            """
                input_size : [env.observation_space.shape, env.action_space.m]
            """
            self.input_size = input_size
            self.s_size = np.prod(input_size[0])
            self.a_size = input_size[1]
            self.output_size = output_size
            self.m_hidden = sdict['m_hidden']

            print_network_info()
            _build_network()

        def _build_network(self, rl=1e-3):#TODO: change to structure dict
            with tf.variable_scope("model_network"):
                m_w1 = tf.get_variable("m_w1", shape=[self.input_size, self.m_hidden],
                                                initializer=xavier_initializer())
                m_b1 = tf.Variable(tf.zeros([self.m_hidden, self.m_hidden]), name="m_b1")

                m_w2 = tf.get_variable("m_w2", shape=[self.m_hidden, self.m_hidden], 
                                                initializer=xavier_initializer())
                m_b2 = tf.Variable(tf.zeros([self.m_hidden, self.output_size]))

                m_w_obs = tf.get_variable("m_w_obs", shape=[self.m_hidden, a_size],
                                                initializer=xavier_initializer())
                m_w_reward = tf.get_variable("m_w_reward", shape=[self.m_hidden, 1])
                m_w_done = tf.get_variable("m_w_done", shape=[self.m,_hidden 1])

                m_b_obs = tf.Variable(tf.zeros([s_size]), name="m_b_obs")
                m_b_reward = tf.Variable(tf.zeros([1]), name="m_b_reward")
                m_b_done = tf.Variable(tf.zeros([1]), name="m_b_done")

                self.prev_state = tf.placeholder(tf.float32, [None, self.input_size])
                m_h1 = tf.nn.relu(tf.matmul(prev_state, m_w1) + m_b1)
                m_out = tf.nn.relu(tf.matmul(m_h1, m_w2) + m_b2)

            pred_obs = tf.matmul(m_out, m_w_obs, name="pred_obs") + m_b_obs
            pred_reward = tf.matmul(m_out, m_w_bos, name="pred_reward") + m_b_reward
            pred_done = tf.sigmoid(tf.matmul(m_out, m_w_done, name="pred_done") + m_b_done)

            self.pred_state = tf.concat(1, [pred_obs, pred_reward, pred_done])	

            true_obs = tf.placeholder(tf.float32, [None, s_size], name="true_observation")
            true_reward = tf.placeholder(tf.float32, [None, 1],name="true_reward")
            true_done = tf.placeholder(tf.float32, [None, 1])

        def predict(self, state):
            x = np.reshape(state, [1, self.input_size])
            return self.session.run(self._pred, feed_dict={self._obs: x})

        def update(self, x_stack, y_stack):
            return self.session.run([self._loss, self._optm], feed_dict={self._obs: x_stack, self._y: y_stack})

        def stepModel(sess, xs, action):
            feed_dict = {self.prev_state: np.reshape(np.hstack([xs[-1][0], np.array(action)]),
                        [1, 5])}
            myPredict = sess.run([self.pred_state], feed_dict=feed_dict)
            reward = myPredict[0][:4]
            obs = myPredict[0][:, 0:4]
            obs[:, 0] = np.clip(obs[:, 0], -2.4, 2.4)
            obs[:, 2] = np.clip(obs[:, 2], -0.4, 0.4)

            doneP = np.clip(myPredict[0][:, 5], 0, 1)
            if doneP > 0.1 or len(xs) >= 300:
                done = True
            else:
                done = False
            return obs, reward, done 

        def print_network_info(self):#TODO: infos
            print("Building Network.." + "\n" +\
                "="*10 + name +"="*10 + "\n" +\
            )


    # PARAMS

    xs, drs, ys, ds = [], [], [], []

    drawFromModel = Flase
    trainTheModel = True
    trainThePolicy = False

    with tf.Session() as sess:
        rendering = False
        sess.run(tf.global_variables_initializer())

        policy_net = Policy_NetWork(sess, input_size, output_size)
        model_net = Model_NetWork(sess, input_sizes, output_size)
        print("SET NETWORK")

        obs = env.reset()
        x = obs
        print("SET ENVIRONMENT")

        while episode <= max_episode_num:
            if episode % 50 == 0:
                print("in " + str(episode) + "th episode...")
            if (reward_sum/batch_size) > 140 and drawFromModel == False or rendering == True:
                    env.render()
                    rendering = True

            x = np.reshape(obs, [1, 4])
            tfprob = sess.run(pred, feed_dict={obs: x})

            if np.random.uniform() < tfprob:
                action = 1
            else:
                action = env.action_space.sample()

            xs.append(x)

            if drawFromModel == False:
                obs, reward, done, info = env.step(action)
            else:
                obs, reward, done = stepModel(sess, xs, action)

            reward_sum += reward

            ds.append(done*1)
            drs.append(reward)

            if done:
                if drawFromModel == False:
                    real_episodes += 1
                episode_number += 1

                epx = np.vstack(xs)
                epy = np.vstack(ys)
                epr = np.vstack(drs)
                epd = np.vstack(ds)
                xs, drs, ys, ds = [], [], [], []

            if trainTheModel == True:
                print("trainPolicy in " + str(episode_number))
                discounted_epr = discounted_rewards(epr).astype('float32')
                discounted_epr -= np.mean(discounted_epr)
                discounted_epr /= np.std(discounted_epr)

                sess.run(optimizer, feed_dict={
                            obs: epx,
                            input_y: epy,
                            advantages: discounted_epr
                        })

            if switch_point + batch_size == episode_number:
                switch_point = episode_number

                if trainThePolicy == True:
                    print("trainThePoilcy in "+str(episode_number))
                    discounted_epr = discounted_rewards(epr).astype('float32')
                    discounted_epr -= np.mean(discounted_epr)
                    discounted_epr /= np.std(discounted_epr)

                running_reward = reward_sum if running_reward is None else running_reward*.99 + reward_sum*.01
                if drawFromModel == False:
                    print("World Perf: Episode {:d}, Reward {:.5f}, Action {:.5f}"+
                            "Mean Reward {}".format(real_episodes, 
                                reward_sum/real_batch_size,
                                action,
                                running_reward/real_batch_size))
                    if reward_sum/batch_size > 200:
                        break
                reward_sum = 0
                if episode_number > 100:
                    drawFromModel = not drawFromModel
                    trainTheModel = not trainTheModel
                    trainThePolicy = not trainThePolicy

            if drawFromModel == True:
                observation = np.random.uniform(-0.1, 0.1, [4])
                batch_size = model_batch_size
            else:
                observation = env.reset()
                batch_size = real_batch_size


    plt.figure(figsize=(8, 12))	
    for i in range(6):
        plt.subplot(6, 2, 2*i+1)
        plt.plot(pstate[:, i])
        plt.subplot(6, 2, 2*i+1)
        plt.plot(state_nextsAll[:, i])	
    plt.tight_layout()
    plt.show()

    return None

<a id='Deep Q-Learning Network tensorflow'></a>
## Deep Q-Learning Network tensorflow

<a id='Double-Dualing Deep Q-Learning Network tensorflow'></a>
## Doubling-Dualing Deep Q-learning Network tensorflow

In [None]:
def DDQN():
    """ 120 mins  """
    import os
    import functools
    import random
    import matplotlib.pyplot as plt
    import scipy.misc

    import numpy as np
    import tensorflow as tf
    import tensorflow.contrib.slim as slim

    import gym

    env = gym.make("Breakout-v0")

    #	input_state_n = functools.reduce(lambda x,y : x*y, env.observation_space.shape)
    input_state_n = np.prod(env.observation_space.shape)
    input_obs_shape = list(env.observation_space.shape)
    output_action_n = env.action_space.n

    max_episode_num = 1000
    max_step_num = 500

    pre_train_step_num = 100

    starte = 1
    ende = 0.1
    explore_num = min(max_step_num, 300)
    ed = (starte - ende)/explore_num
    e = starte

    replay_buffer_size = 100
    update_freq = 5
    h_size = 512

    batch_size = 32
    tau = 0.001

    dis = .99

    load_model = False
    model_path = "./ddqn_model"

    class qnetwork():
        def __init__(self, h_size):
            self.scalarInput = tf.placeholder(tf.float32, [None, input_state_n])
            self.imageIn = tf.reshape(self.scalarInput, shape=[-1]+input_obs_shape)
            self.conv1 = slim.convolution2d(
                            inputs=self.imageIn,
                            num_outputs=32,
                            kernel_size=[8, 8],
                            stride=[4, 4],
                            padding='VALID',
                            activation_fn=tf.nn.relu,
                            biases_initializer=None
                        )
            self.conv2 = slim.convolution2d(
                            inputs=self.conv1,
                            num_outputs=64,
                            kernel_size=[3, 3],
                            stride=[1, 1],
                            padding='VALID',
                            activation_fn=tf.nn.relu,
                            biases_initializer=None
                        )
            self.conv3 = slim.convolution2d(
                            inputs=self.conv2,
                            num_outputs=512,
                            kernel_size=[7, 7],
                            stride=[1, 1],
                            padding='VALID',
                            activation_fn=tf.nn.relu,
                            biases_initializer=None
                        )

            # Dueling
            self.streamAC, self.streamVC = tf.split(3, 2, self.conv3)
            self.streamA = slim.flatten(self.streamAC)
            self.streamV = slim.flatten(self.streamVC)

            self.streamA = slim.fully_connected(self.streamA, 256)
            self.streamV = slim.fully_connected(self.streamV, 256)

            self.AW = tf.Variable(tf.random_normal([h_size//2, output_action_n]))
            self.VW = tf.Variable(tf.random_normal([h_size//2, 1]))

            self.Advantage = tf.matmul(self.streamA, self.AW)
            self.Value = tf.matmul(self.streamV, self.VW)

            self.Qout = self.Value + tf.sub(self.Advantage, tf.reduce_mean(self.Advantage, reduction_indices=1, keep_dims=True))
            self.predict = tf.argmax(self.Qout, 1)

            self.targetQ = tf.placeholder(tf.float32, [None])
            self.actions = tf.placeholder(tf.int32, [None])
            self.actions_onehot = tf.one_hot(self.actions, output_action_n, dtype=tf.float32)

            self.Q = tf.reduce_sum(tf.mul(self.Qout, self.actions_onehot), reduction_indices=1)

            self.td_error = tf.square(self.targetQ - self.Q)
            self.loss = tf.reduce_mean(self.td_error)
            self.trainer = tf.train.AdamOptimizer(learning_rate=1e-3)
            self.updateModel = self.trainer.minimize(self.loss)

    class experience_buffer():
        """ s, a, r, sn, d """
        def __init__(self, buffer_size=50000):
            self.buffer = []
            self.buffer_size = buffer_size

        def add(self, experience):
            if len(self.buffer) + len(experience) >= self.buffer_size:
                self.buffer[0:(len(experience) + len(self.buffer))] = []
            self.buffer.extend(experience)

        def sample(self, size):
            return np.reshape(np.array(random.sample(self.buffer, size)), [size, 5])

    def updateTargetGraph(tfVars, tau):
        total_vars = len(tfVars)
        op_holder = []
        for idx, var in enumerate(tfVars[0:total_vars//2]):
            op_holder.append(tfVars[idx+total_vars//2].assign((var.value())*tau) + (1-tau)*tfVars[idx+total_vars//2].value())
        return op_holder

    def updateTarget(op_holder, sess):
        for op in op_holder:
            sess.run(op)

    def processState(states):
        return np.reshape(states, input_state_n)

    tf.reset_default_graph()
    mainqn = qnetwork(h_size)
    targetqn = qnetwork(h_size)

    init = tf.global_variables_initializer()
    saver = tf.train.Saver()
    tvars = tf.trainable_variables()

    targetOps = updateTargetGraph(tvars, tau)
    mybuffer = experience_buffer(replay_buffer_size)

    stepList = []
    rList = []
    total_steps = 0

    if not os.path.exists(model_path):
        os.makedirs(model_path)

    with tf.Session() as sess:
        if load_model == True:
            print("Loading Model..")
            ckpt = tf.train.get_checkpoint_state(model_path)
            saver.restore(sess, ckpt.model_checkpoint_path)
        sess.run(init)
        updateTarget(targetOps)

        for episode in range(max_episode_num):
            episode_buffer = experience_buffer(replay_buffer_size)
            s = env.reset()
            s = processState(s)
            d = False
            rAll = 0
            step = 0

            while step < max_step_num:
                #e = 1. / ((episode + 50.) + 10.)
                if np.random.rand(1) < e or total_steps < pre_train_step_num:
                    a = np.random.randint(0, 4)
                else:
                    a = sess.run(mainqn.predict, feed_dict={mainqn.scalarInput[s]})[0]

                sn, r, d, _ = env.step(a)
                sn = processState(sn)

                total_steps += 1
                step+=1
                episode_buffer.add(np.reshape(np.array([s, a, r, sn, d]), [1, 5]))

                if total_steps > pre_train_step_num:
                    if e > ende:
                        e -= ed

                    if total_steps % update_freq == 0:
                        trainBatch = mybuffer.sample(batch_size)
                        # Doubling
                        Q1 = sess.run(mainqn.predict, feed_dict={mainqn.scalarInput: np.vstack(trainBatch[:, 3])})
                        Q2 = sess.run(targetqn.Qout, feed_dict={targetqn.scalarInput: np.vstack(trainBatch[:, 3])})
                        end_multiplier = -(trainBatch[:, 4] - 1) # filp done
                        doubleQ = Q2[range(batch_size), Q1]
                        targetQ = trainBatch[:, 2] + (dis*doubleQ*end_multiplier) # reward + gamma * (collect not done q)
                        _ = sess.run(mainqn.updateModel,
                                feed_dict={
                                    mainqn.scalarInput: np.vstack(trianBatch[:, 0]),
                                    mainqn.targetQ: targetQ,
                                    mainqn.actions: trainBatch[:, 1]
                                })
                        updateTarget(targetOps, sess)
                rAll += r
                s = sn
                if d == True:
                    break
            mybuffer.add(episode_buffer.buffer)
            stepList.append(step)
            rList.append(rAll)
            if i % 1000 == 0:
                svaer.save(sess, model_path + "/model-"+str(episode)+".ckpt")
                print("Saved Model..")
            if len(rList) % 10 == 0:
                print(total_steps, np.mean(rList[-10:]), e)
        saver.save(sess, path+"/model-"+str(episode)+".ckpt")
        bot_play(mainqn)

    print("Percent of succesful episodes : " + str(sum(rList)/num_episodes * 100) + "%")

    
    return None

<a id='A3C'></a>
## Asynchronous Advantages Actor Critic Model

The action value function and policy function are written repectively as
\begin{eqnarray}
Q^\pi (x, a) = \underset{\pi}{argmax} \textbf{E}_{s_{t+k} \sim P, r_{t+k} \sim R, a_{t+k} \sim \pi} 
\left[ \underset{k=1}{\overset{\infty}{\sum} \gamma^k r_{t+k} } | s_t = s, a_t = a \right]\\
\pi^* = \underset{\pi}{argmax} \textbf{E}_{s_0 \sim p_0, \alpha_0 \sim \pi} \left[ Q^\pi (s_0, a_0) \right] \\
\rightarrow Q^\pi = \underset{Q}{argmin} \textbf{E}_{s_t, a_t \sim \pi} \left[ P(\textbf{E}_{s_{t+1}, r_t, a_{t+1}}[\cdot||\cdot]) \right]
\end{eqnarray}

Now it becomes a bi-level optimization problem,
\begin{eqnarray}
\mathcal{F} (Q, \pi) =  \textbf{E}_{s_t, a_t \sim \pi} \left[ P(\textbf{E}_{s_{t+1}, r_t, a_{t+1}} \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) || Q(s_t, a_t) \right]) \right] \\
f (Q, \pi) = -\textbf{E}_{s_0 \sim p_0, a_0 \sim \pi} \left[ Q^\pi (s_0, a_0) \right]
\end{eqnarray}

<a id='Asynchronous Advantages Actor-Critic Model tensorflow'></a>
## Asynchronous Advantages Actor-Critic Model tensorflow

In [1]:
DATASET_DIR = './dataset/'
PROJECT_DIR = './projects/RL_collections/'
SUMMARY_DIR = PROJECT_DIR+'summaries/'
SAVER_DIR = PROJECT_DIR+'models/'
CHECKPOINT_DIR = PROJECT_DIR+'checkpoints/'
RESULT_DIR = PROJECT_DIR+'results/'

<a id='Meta_RL'></a>
## Meta_RL

#### from https://github.com/awjuliani/Meta-RL, https://hackernoon.com/learning-policies-for-learning-policies-meta-reinforcement-learning-rl%C2%B2-in-tensorflow-b15b592a2ddf#.2ck6nb2jp

x(t)
r(t-1)
a(t-1)

[IDEA] how could one cluster the tasks with similarities... for meta rl