# Deep Reinforcement Learning
<center>Shan-Hung Wu & DataLab</center>
<center>Fall 2019</center>

In the last lab, we use the tabular method (Q-learning, SARSA) to train an agent to play *Flappy Bird* with features in environments. However, it is time-costly and inefficient if more features are added to the environment because the agent can not easily generalize its experience to other states that were not seen before. Furthermore, in realistic environments with large state/action space, it requires a large memory space to store all state-action pairs.  
In this lab, we introduce deep reinforcement learning, which utilizes function approximation to estimate value/policy for all unseen states such that given a state, we can estimate its value or action. We can use what we have learned in machine learning (e.g. regression, DNN) to achieve it.

## Deep *Q*-Network
*Reference*: [Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)  
To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.  
In this lab, we are going to train an agent which takes raw frames as input instead of hand-crafted features. The network architecture is as follows:
<img src="./src/DQN-model-architecture.png" alt="DQN-Architecture" width="750"/>

In [1]:
# First, you need to install pip in the environment and use pip under the environment to install package
# conda install -c anaconda pip

# Install PLE
# !/home/ccchen/anaconda3/envs/tf2/bin/pip install git+git://github.com/ntasfi/PyGame-Learning-Environment

# However, there is a bug that PLE miss folder assets, you need to clone the repo and copy folder assets to the environment
# !git clone https://github.com/ntasfi/PyGame-Learning-Environment.git
# !cp -r ./PyGame-Learning-Environment/ple/games/flappybird/assets /home/ccchen/anaconda3/envs/tf2/lib/python3.6/site-packages/ple/games/flappybird/

# Install other packages scikit-image, pygame, moviepy
# !conda install -c anaconda scikit-image
# !conda install -c cogsci pygame
# !conda install -c conda-forge moviepy


In [2]:
# import matplotlib.pyplot as plt

# from skimage import data, color
# from skimage.transform import rescale, resize, downscale_local_mean

# image = color.rgb2gray(data.astronaut())

# image_rescaled = rescale(image, 0.25, anti_aliasing=False)
# image_resized = resize(image, (image.shape[0] // 4, image.shape[1] // 4),
#                        anti_aliasing=True)
# image_downscaled = downscale_local_mean(image, (4, 3))

# fig, axes = plt.subplots(nrows=2, ncols=2)

# ax = axes.ravel()

# ax[0].imshow(image, cmap='gray')
# ax[0].set_title("Original image")

# ax[1].imshow(image_rescaled, cmap='gray')
# ax[1].set_title("Rescaled image (aliasing)")

# ax[2].imshow(image_resized, cmap='gray')
# ax[2].set_title("Resized image (no aliasing)")

# ax[3].imshow(image_downscaled, cmap='gray')
# ax[3].set_title("Downscaled image (no aliasing)")

# ax[0].set_xlim(0, 512)
# ax[0].set_ylim(512, 0)
# plt.tight_layout()
# plt.show()

In [3]:
import tensorflow as tf
import numpy as np

In [4]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Restrict TensorFlow to only use the fourth GPU
        tf.config.experimental.set_visible_devices(gpus[2], 'GPU')

        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

4 Physical GPUs, 1 Logical GPUs


In [5]:
import os
from ple import PLE
from ple.games.flappybird import FlappyBird
# from ple.games.flappybird import FlappyBird
os.environ["SDL_VIDEODRIVER"] = "dummy"  # this line make pop-out window not appear

game = FlappyBird()
env = PLE(game, fps=30, display_screen=False)  # environment interface to game
env.reset_game()

pygame 2.0.1 (SDL 2.0.14, Python 3.6.8)
Hello from the pygame community. https://www.pygame.org/contribute.html
couldn't import doomish
Couldn't import doom


### Temporal Difference Estimation
Remind that we can use TD-estimation to update the Q-value either using *Q*-learning or SARSA. The basic idea of *Q*-learning is to approximate the Q-value by neural networks in the fashion of *Q*-learning. We can formalize the algorithm as follows:
- Use a DNN $f_{Q^*}(s,a;\theta)$ to represent $Q^*(s,a)$.
    <img src="./src/function-approximator.PNG" alt="function-approximator" width="180"/>
- Algorithm(TD): initialize $\theta$ arbitraily, iterate until converge:
    1. Take action $a$ from $s$ using some exploration policy $\pi'$ derived from $f_{Q^*}$ (e.g., $\epsilon$-greedy).
    2. Observe $s'$ and reward $R(s,a,s')$, update $\theta$ using SGD:
        $$\theta\leftarrow\theta-\eta\nabla_{\theta}C,\text{where}$$
        $$C(\theta)=[\color{blue}{R(s,a,s')+\gamma\max_{a'}f_{Q^*}(s',a';\theta)}-f_{Q^*}(s,a;\theta)]^2$$

However, DQN based on the naive TD algorithm above diverges due to:  
1. Samples are correlated (violates i.i.d. assumption of training examples).
2. Non-stationary target ($\color{blue}{f_{Q^*}(s',a';\theta)}$ changes as $\theta$ is updated for current $a$).

### Stabilization Techniques
- Experience replay: To break the correlations present in the sequence of observations.
    1. Use a replay memory $D$ to store recently seen transitions $(s,a,r,s')$.
    2. Sample a mini-batch from $D$ and update $\theta$. 
- Delayed target network: To avoid chasing a moving target.
    1. Set the target value to the output of the network parameterized by *old* $\theta^-$.
    2. Update $\theta^-\leftarrow\theta$ every $K$ iterations.

### Algorithm
Combining Algorithm(TD) with Experience replay and Delayed target network, we can formalize the complete DQN algorithm as below:  
- Algorithm(TD): initialize $\theta$ arbitraily and $\theta^-=\theta$, iterate until converge:
    1. Take action $a$ from $s$ using some exploration policy $\pi'$ derived from $f_{Q^*}$ (e.g., $\epsilon$-greedy).
    2. Observe $s'$ and reward $R(s,a,s')$, add $(s,a,R,s')$ to $D$.
    3. Sample a mini-batch of $(s,a,R,s^{'})^,\text{s}$ from $D$, do:
        $$\theta\leftarrow\theta-\eta\nabla_{\theta}C,\text{where}$$
        $$C(\theta)=[\color{blue}{R(s,a,s')+\gamma\max_{a'}f_{Q^*}(s',a';\color{red}{\theta^-})}-f_{Q^*}(s,a;\theta)]^2$$
    4. Update $\theta^-\leftarrow\theta$ every $K$ iterations.

Let's implement DQN and apply it on Flappy Bird now!

In [6]:
# Define Input Size
IMG_WIDTH = 84
IMG_HEIGHT = 84
NUM_STACK = 4
# Modify
NUM_STATE_FEATURE = 8
# For Epsilon-greedy
MIN_EXPLORING_RATE = 0.01


In [7]:
class Agent:
    def __init__(self, name, num_action, discount_factor=0.99):
        self.exploring_rate = 0.1
        self.discount_factor = discount_factor
        self.num_action = num_action
        self.model = self.build_model(name)

    def build_model(self, name):
        # input: state
        # output: each action's Q-value 
#         screen_stack = tf.keras.Input(shape=[IMG_WIDTH, IMG_HEIGHT, NUM_STACK], dtype=tf.float32)
        # Modify
        input_data = tf.keras.Input(shape=[NUM_STATE_FEATURE], dtype=tf.float32)

#         x = tf.keras.layers.Conv2D(filters=32, kernel_size=8, strides=4)(screen_stack)
#         x = tf.keras.layers.ReLU()(x)
#         x = tf.keras.layers.Conv2D(filters=64, kernel_size=4, strides=2)(x)
#         x = tf.keras.layers.ReLU()(x)
#         x = tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=1)(x)
#         x = tf.keras.layers.ReLU()(x)
#         x = tf.keras.layers.Flatten()(x)
        # Modify
        x = tf.keras.layers.Dense(units=512)(input_data)
        x = tf.keras.layers.ReLU()(x)
        x = tf.keras.layers.Dense(units=512)(x)
        x = tf.keras.layers.ReLU()(x)
        x = tf.keras.layers.Dense(units=512)(x)
        x = tf.keras.layers.ReLU()(x)
        Q = tf.keras.layers.Dense(self.num_action)(x)

        model = tf.keras.Model(name=name, inputs=input_data, outputs=Q)

        return model
    
    def loss(self, state, action, reward, tar_Q, ternimal):
        # Q(s,a,theta) for all a, shape (batch_size, num_action)
        output = self.model(state)
        index = tf.stack([tf.range(tf.shape(action)[0]), action], axis=1)
        # Q(s,a,theta) for selected a, shape (batch_size, 1)
        Q = tf.gather_nd(output, index)
        
        # set tar_Q as 0 if reaching terminal state
        tar_Q *= ~np.array(terminal)

        # loss = E[r+max(Q(s',a',theta'))-Q(s,a,theta)]
        loss = tf.reduce_mean(tf.square(reward + self.discount_factor * tar_Q - Q))

        return loss
    
    def max_Q(self, state):
        # Q(s,a,theta) for all a, shape (batch_size, num_action)
        output = self.model(state)

        # max(Q(s',a',theta')), shape (batch_size, 1)
        return tf.reduce_max(output, axis=1)
    
    def select_action(self, state):
        # epsilon-greedy
        if np.random.rand() < self.exploring_rate:
            action = np.random.choice(self.num_action)  # Select a random action
        else:
            state = np.expand_dims(state, axis = 0)
            # Q(s,a,theta) for all a, shape (batch_size, num_action)
            output = self.model(state)

            # select action with highest action-value
            action = tf.argmax(output, axis=1)[0]

        return action
    
    def update_parameters(self, episode):
        self.exploring_rate = max(MIN_EXPLORING_RATE, min(0.5, 0.99**((episode) / 30)))

    def shutdown_explore(self):
        # make action selection greedy
        self.exploring_rate = 0

In [8]:
# init agent
num_action = len(env.getActionSet())

# agent for frequently updating
online_agent = Agent('online', num_action)

# agent for slow updating
target_agent = Agent('target', num_action)
# synchronize target model's weight with online model's weight
target_agent.model.set_weights(online_agent.model.get_weights())

In [9]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
average_loss = tf.keras.metrics.Mean(name='loss')

@tf.function
def train_step(state, action, reward, next_state, ternimal):
    # Delayed Target Network
    tar_Q = target_agent.max_Q(next_state)
    with tf.GradientTape() as tape:
        loss = online_agent.loss(state, action, reward, tar_Q, ternimal)
    gradients = tape.gradient(loss, online_agent.model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, online_agent.model.trainable_variables))
    
    average_loss.update_state(loss)

In [10]:
class Replay_buffer():
    def __init__(self, buffer_size=50000):
        self.experiences = []
        self.buffer_size = buffer_size

    def add(self, experience):
        if len(self.experiences) >= self.buffer_size:
            self.experiences.pop(0)
        self.experiences.append(experience)

    def sample(self, size):
        """
        sample experience from buffer
        """
        if size > len(self.experiences):
            experiences_idx = np.random.choice(len(self.experiences), size=size)
        else:
            experiences_idx = np.random.choice(len(self.experiences), size=size, replace=False)

        # from all sampled experiences, extract a tuple of (s,a,r,s')
        states = []
        actions = []
        rewards = []
        states_prime = []
        terminal = []
        for i in range(size):
            states.append(self.experiences[experiences_idx[i]][0])
            actions.append(self.experiences[experiences_idx[i]][1])
            rewards.append(self.experiences[experiences_idx[i]][2])
            states_prime.append(self.experiences[experiences_idx[i]][3])
            terminal.append(self.experiences[experiences_idx[i]][4])

        return states, actions, rewards, states_prime, terminal

In [11]:
# init buffer
buffer = Replay_buffer()

In [12]:
import moviepy.editor as mpy

def make_anim(images, fps=60, true_image=False):
    duration = len(images) / fps

    def make_frame(t):
        try:
            x = images[int(len(images) / duration * t)]
        except:
            x = images[-1]

        if true_image:
            return x.astype(np.uint8)
        else:
            return ((x + 1) / 2 * 255).astype(np.uint8)

    clip = mpy.VideoClip(make_frame, duration=duration)
    clip.fps = fps
    return clip

W1228 02:33:35.492488 140305104299840 core.py:204] In /home/ccchen/anaconda3/envs/tf2/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
W1228 02:33:35.493338 140305104299840 core.py:204] In /home/ccchen/anaconda3/envs/tf2/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
W1228 02:33:35.493818 140305104299840 core.py:204] In /home/ccchen/anaconda3/envs/tf2/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
W1228 02:33:35.494578 140305104299840 core.py:204] In /home/ccchen/anaconda3/envs/tf2/lib/python3.6/site-

In [13]:
# import skimage.transform

# def preprocess_screen(screen):
#     screen = skimage.transform.resize(screen, [IMG_WIDTH, IMG_HEIGHT, 1])
#     return screen

# def frames_to_state(input_frames):
#     if(len(input_frames) == 1):
#         state = np.concatenate(input_frames*4, axis=-1)
#     elif(len(input_frames) == 2):
#         state = np.concatenate(input_frames[0:1]*2 + input_frames[1:]*2, axis=-1)
#     elif(len(input_frames) == 3):
#         state = np.concatenate(input_frames + input_frames[2:], axis=-1)
#     else:
#         state = np.concatenate(input_frames[-4:], axis=-1)

#     return state

In [14]:
def make_state(env_state):
    state = np.zeros(NUM_STATE_FEATURE)
    state[0] = env_state['player_y']
    state[1] = env_state['player_vel']
    state[2] = env_state['next_pipe_dist_to_player']
    state[3] = env_state['next_pipe_top_y'] - env_state['player_y']
    state[4] = env_state['next_pipe_bottom_y'] - env_state['player_y']
    state[5] = env_state['next_next_pipe_dist_to_player']
    state[6] = env_state['next_next_pipe_top_y'] - env_state['player_y']
    state[7] = env_state['next_next_pipe_bottom_y'] - env_state['player_y']
    
    return state
    

In [45]:
from IPython.display import Image, display

update_every_iteration = 1000
print_every_episode = 500
save_video_every_episode = 5000
NUM_EPISODE = 20001
NUM_EXPLORE = 20
BATCH_SIZE = 32

iter_num = 0
for episode in range(0, NUM_EPISODE + 1):
    
    # Reset the environment
    env.reset_game()
    
    # record frame
    if episode % save_video_every_episode == 0:
        frames = [env.getScreenRGB()]
    
    # input frame
#     input_frames = [preprocess_screen(env.getScreenGrayscale())]
    
    # for every 500 episodes, shutdown exploration to see the performance of greedy action
    if episode % print_every_episode == 0:
        online_agent.shutdown_explore()
    
    # cumulate reward for this episode
    cum_reward = 0
    
    t = 0
    while not env.game_over():
        
#         state = frames_to_state(input_frames)
        # Modify
        state = make_state(game.getGameState())
        
        # feed current state and select an action
        action = online_agent.select_action(state)
        
        # execute the action and get reward
        reward = env.act(env.getActionSet()[action])
        
        # record frame
        if episode % save_video_every_episode == 0:
            frames.append(env.getScreenRGB())
        
        # record input frame
#         input_frames.append(preprocess_screen(env.getScreenGrayscale()))
        
        # cumulate reward
        cum_reward += reward
        
        # observe the result
#         state_prime = frames_to_state(input_frames)  # get next state
        # Modify
        state_prime = make_state(game.getGameState())
        
        # append experience for this episode
        if episode % print_every_episode != 0:
            buffer.add((state, action, reward, state_prime, env.game_over()))
        
        # Setting up for the next iteration
        state = state_prime
        t += 1
        
        # update agent
        if episode > NUM_EXPLORE and episode % print_every_episode != 0:
            iter_num += 1
            train_states, train_actions, train_rewards, train_states_prime, terminal = buffer.sample(BATCH_SIZE)
#             train_states = np.asarray(train_states).reshape(-1, IMG_WIDTH, IMG_HEIGHT, NUM_STACK)
#             train_states_prime = np.asarray(train_states_prime).reshape(-1, IMG_WIDTH, IMG_HEIGHT, NUM_STACK)
            # Modify
#             print(train_states)
            train_states = np.asarray(train_states)
            train_states_prime = np.asarray(train_states_prime)
            
            # convert Python object to Tensor to prevent graph re-tracing
            train_states = tf.convert_to_tensor(train_states, tf.float32)
            train_actions = tf.convert_to_tensor(train_actions, tf.int32)
            train_rewards = tf.convert_to_tensor(train_rewards, tf.float32)
            train_states_prime = tf.convert_to_tensor(train_states_prime, tf.float32)
            terminal = tf.convert_to_tensor(terminal, tf.bool)
            
            train_step(train_states, train_actions, train_rewards, train_states_prime, terminal)

        # synchronize target model's weight with online model's weight every 1000 iterations
        if iter_num % update_every_iteration == 0 and episode > NUM_EXPLORE and episode % print_every_episode != 0:
            target_agent.model.set_weights(online_agent.model.get_weights())

    # update exploring rate
    online_agent.update_parameters(episode)
    target_agent.update_parameters(episode)

    if episode % print_every_episode == 0 and episode > NUM_EXPLORE:
        print(
            "[{}] time live:{}, cumulated reward: {}, exploring rate: {}, average loss: {}".
            format(episode, t, cum_reward, online_agent.exploring_rate, average_loss.result()))
        average_loss.reset_states()

    if episode % save_video_every_episode == 0:  # for every 500 episode, record an animation
        clip = make_anim(frames, fps=60, true_image=True).rotate(-90)
        clip.write_videofile("sychou/labs/lab17/movie_f/DQN_demo-{}.mp4".format(episode), fps=60)
        display(clip.ipython_display(fps=60, autoplay=1, loop=1, maxduration=120))


t:   3%|▎         | 2/63 [01:01<31:18, 30.79s/it, now=None]

t:   3%|▎         | 2/63 [01:01<31:19, 30.81s/it, now=None]
t:   3%|▎         | 2/63 [00:27<13:58, 13.75s/it, now=None][AMoviepy - Building video sychou/lab17/movie_f/DQN_demo-0.mp4.
Moviepy - Writing video sychou/lab17/movie_f/DQN_demo-0.mp4



t:   0%|          | 0/63 [00:00<?, ?it/s, now=None][A[A

OSError: [Errno 32] Broken pipe

MoviePy error: FFMPEG encountered the following error while writing file sychou/lab17/movie_f/DQN_demo-0.mp4:

 b'sychou/lab17/movie_f/DQN_demo-0.mp4: No such file or directory\n'

In [None]:
from moviepy.editor import *
print("DEMO Result")
clip = VideoFileClip("movie_f/DQN_demo-20000.mp4")
display(clip.ipython_display(fps=60, autoplay=1, loop=1, maxduration=120))