# Let DQN play Flappy Bird

## 1. Abstract

Flappy bird was a popular game for its high difficulty with an easy-understanding control. What the player needs to do is tapping the screen to make the bird fly higher or doing nothing to drop the bird, in order to let the bird fly over pipes. Though the control is easy, getting good scores is a hard problem. Then training an AI agent who is able to perfectly play this game would be a really interesting project.

<img src="image/FlappyBird.jpg"  width = 200, height = 100>

Our group though it is a great model for us to learn how DQN works, since in this specific game, an agent only has two actions in each state and the description of the state can be simplified as several parameters. Thatâ€™s might be the reason why so many DQN tutorials use Flappy Bird as the example.

## 2. Game Environment

The game enviroment, found on github, is a python based Flappy Bird version. The funtion below is the main funtion in this game which is used to update the game state and display the game screen. We have modified the parameters and the states to accelerate the training process. The return value "info" is game state in Q Learning.

During the game, in most of time, there will be three pairs of pipes displayed on the screen which is different from the intial conditon. We handle this exception by judging the size of Pipes array.

In [3]:
def frame_step(self, input_actions):
        pygame.event.pump()

        reward = 1.0
        terminal = False

        # input_actions[0] == 1: do nothing
        # input_actions[1] == 1: flap the bird
        if input_actions[1] == 1:
            if self.playery > -2 * PLAYER_HEIGHT:
                self.playerVelY = self.playerFlapAcc
                self.playerFlapped = True

        # check for score
        playerMidPos = self.playerx + PLAYER_WIDTH / 2
        for pipe in self.upperPipes:
            pipeMidPos = pipe['x'] + PIPE_WIDTH / 2
            if pipeMidPos <= playerMidPos < pipeMidPos + 4:
                self.score += 1
                reward = 1.0

        # check if crash here
        isCrash= checkCrash({'x': self.playerx, 'y': self.playery,
                             'index': self.playerIndex},
                            self.upperPipes, self.lowerPipes)
        
        if isCrash:
            terminal = True
            self.__init__()
            reward = -1000

        FPSCLOCK.tick(FPS)
        
        if len(self.lowerPipes) == 2:
            info = np.array([self.playery,self.playerVelY , self.lowerPipes[0]['x'],self.lowerPipes[0]['y'],self.upperPipes[0]['y']])
        else:
            info = np.array([self.playery, self.playerVelY, self.lowerPipes[1]['x'], self.lowerPipes[1]['y'],self.upperPipes[1]['y']])
            
        return info, reward, terminal

The game enviroment provides a lot of useful interface for DQN training. For example, "FPS" allows user to boost the game speed which will greatly accelerate the training process, and user can block "display_update" to prevent the waste of GPU rescource.

## 3. Design of Deep Q-Network

### 3.1 The structuce of DQN

The algorithm we use for DQN is Deep Q-Learning with experience replay. It requires two Neural Network with the same structure, and one of them are used for trainging the Q-table model and the other one which should be updated with a delay is used to estimate the optimal actions in one specific state and provide the traing set.

We have tried several Neural Network models, and the model delivering a good result is constructed as the code show below.

In [None]:
def _build_net(self):
# ------------------ build evaluate_net ------------------   
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input    
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss    
    with tf.variable_scope('eval_net'):      
        # c_names(collections_names) are the collections to store variables
        c_names, n_l1, n_h0, n_h1, n_h2, w_initializer, b_initializer = \
        ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 500, 500, 500, 500, \
        tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers
        
        # first layer. collections is used later when assign to target net
        with tf.variable_scope('l1'):
            w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
            b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
            l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
            
        # hidden layer 0
        with tf.variable_scope('l3'):
            w3 = tf.get_variable('w3', [n_l1, n_h0], initializer=w_initializer, collections=c_names)
            b3 = tf.get_variable('b3', [1, n_h0], initializer=b_initializer, collections=c_names)
            l3 = tf.nn.relu(tf.matmul(l1, w3) + b3)
            
        # hidden layer 1
        with tf.variable_scope('l4'):
            w4 = tf.get_variable('w4', [n_h0, n_h1], initializer=w_initializer, collections=c_names)
            b4 = tf.get_variable('b3', [1, n_h1], initializer=b_initializer, collections=c_names)
            l4 = tf.nn.relu(tf.matmul(l3, w4) + b4)
        
        # hidden layer 2
        with tf.variable_scope('l5'):
            w5 = tf.get_variable('w5', [n_h1, n_h2], initializer=w_initializer, collections=c_names)
            b5 = tf.get_variable('b5', [1, n_h2], initializer=b_initializer, collections=c_names)
            l5 = tf.nn.relu(tf.matmul(l4, w5) + b5)
            
        # second layer. collections is used later when assign to target net
        with tf.variable_scope('l2'):
            w2 = tf.get_variable('w2', [n_h2, self.n_actions], initializer=w_initializer, collections=c_names)
            b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
            self.q_eval = tf.matmul(l5, w2) + b2
            
        with tf.variable_scope('loss'):
            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
        
        with tf.variable_scope('train'):
            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

It is a Neural Network with 4 hidden layers with one input layer, and, in each hidden layer, there are 500 Relu units. The Neural Network strcuture graph generted by Tensorboard is show below.

<img src="image/NN_structure.png"  width = 900, height = 500>

### 3.2 The Tuning of Neural Network

Tuning Neural Network's parameters is the hardest part of a Deep Learning project. 

First of all, the parameters of each unit in neural network is initialized by Tensorflow functions. We have already done this part while we are buling the network.

In next stept, we start to mannualy tune other parameters including learning rate, the punishment of a crash, memory size, epsilon and update size. These parameter are defined below.

In [None]:
if __name__ == "__main__":
    # bird game
    env = game.GameState()  # define the game environment -> jump to game
    actions = 2
    features = 5
    RL = DeepQNetwork(actions, features, 
                      learning_rate= 10**-2,
                      reward_decay=1.0, e_greedy=0.6,
                      replace_target_iter= 200,
                      memory_size=50,
                      output_graph=True)
    
    time.sleep(0.5)
    run_bird()

We use tensorfflow to save each layer's parameters every 5000 steps, and the initial learning rate is 0.01. During the training, we monitor the effect of model all the time. When the AI  bird can pass several pipes, we then pause the trainging process and use a smaller learning rate to continue training the model. The memory size can be tuned smaller if the graph of loss fiunction did not proper decrease. Though we only concern the game score, the loss function also matters in optimization.

### 3.3 Trainging Model part 1

In [1]:
%%HTML
<video width="320" height="240" controls>
  <source src="video/Normal_SameHigh.mp4" type="video/mp4">
</video>

In [2]:
def getRandomPipe():
    """returns a randomly generated pipe"""
    # y of gap between upper and lower pipe
    gapYs = [20, 30, 40, 50, 60, 70, 80, 90]
    index = random.randint(0, len(gapYs)-1)
    #gapY = gapYs[index]
    gapY = gapYs[0]

    gapY += int(BASEY * 0.2)
    pipeX = SCREENWIDTH + 10

    return [
        {'x': pipeX, 'y': gapY - PIPE_HEIGHT*1.2},  # upper pipe
        {'x': pipeX, 'y': gapY + PIPEGAPSIZE*1.2},  # lower pipe
    ]

Here we first set the gapY to a constant value. As you can see in the video, all the pipes are at the same height which accelerate the training process. We want to fasten the training process so that we can easily test whether our network comes into effect.

### 3.4 Training Model part 2

In [2]:
%%HTML
<video width="320" height="240" controls>
  <source src="video/Normal_Random.mp4" type="video/mp4">
</video>

```
index = random.randint(0, len(gapYs)-1)  
gapY = gapYs[index]  
```
Here we set the gapY to a random value so in the video you can see the height of the pipes are different. In the previous model, the position of the pipes doesn't matter but here they are important states.   
```
{'x': pipeX, 'y': gapY - PIPE_HEIGHT*1.2},  
{'x': pipeX, 'y': gapY + PIPEGAPSIZE*1.2},  
```
Considering the time limit, the window size we set is 1.2 times of the initial version. This also accelerate the learning process and we have time to apply different parameters.

## 4. Analysis

## Reference

Original game model: https://github.com/floodsung/DRL-FlappyBird/blob/master/game/wrapped_flappy_bird.py

Original DQN model: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/RL_brain.py

Origianl Q learning idea from Jeremy's lecture notes.