*This is the test part of the final model. Running it will start a pygame window using the learnt policy in gaming.*

#  Project in brief
- **In this project, we try to train a ping pong player who learns the direction and distance of optimal movement through:**
    + A six-layered Convolutional Neural Network to recognize ball behavior from updated screen input
    + A reinforcement learning system which targets the learning agent at and against behaviors e.g. Catching the ball, avoiding loops
- **Game rules:** 
    + The ball is served by the opponent AI first, at randomd angle; Upon wining, the serving right is passed to the opponent
    + Score in testing (shown on pygame screen) is calculated as one additional points if opponent fails to catch the ball
- **File structure** - Project.zip contains:
    + A *pygame_player.py* for overall functions
    + A *pong.py* for defining opponent AI and screen display configurations
    + A *pong_player.py* for defining example class for playing pong which imports from pygame_player.py
    + A *main_train.ipynb* for training the learning agent
    + A *main_test.ipynb* for testing and demonstrating agent performance


# Our design
- **Opponent player** - We designed the opponent pong player to be competent while vulnerable: 
    + Competent: In order for it to catch most balls, we make the opponent cheat by
        - reading the ball's direction from coordinates
        - aligning the ball's speed with its own speed
    + Vulnerable: 
        - The opponent's action is defined at a random rate
        - Due to limited speed, it cannot catch certain servings e.g. when the ball heads from upper half towards upper corner while it is in the lower half of the field
- **RL process:**
    + Observation VS exploring - We settled on 100k observations steps and 1000k exploring steps. To balance the exploration/exploitation tradeoff, we raised the observation steps after finding our agent trapped in a local optimal policy to only move downwards
    + Learning reward - We designed the reward system to enhance several actions:
        - For the first catch in a round, whichever player catches the ball is rewarded 3 points. We encourage our agent to catch more serve with this design
        - For continuing catch in the round, reward is given at $3*0.6^{catch times}$. We discourage our agent to be trapped in tacit agreement to stay put and make every catch
- **CNN network:**
    + The structure is based on DeepMind
    + We improved it with ReLu, adding one maxpooling layer for stability, and adding one dropout layer against overfitting

# Improvement 
1. **Model 1 Baseline: **
    - Features: 
        + Basic reward (1 point if opponent fail to catch)
        + 2-directioned serve (ball starts from the middle line 45 degrees to one of the four corners)
        + 3-layered basic 
        + 20k observation steps
    - Performace (at 500k steps): [Video performance model 1](https://drive.google.com/open?id=1lUDO4R8-L9MkmMayr5M6QSiiJW7KhvAi)
    - Problems: 
        + Local optimal- the bar was easily trapped in a local policy of only moving downwards/upwards
        + Not amazing performace - positive but slow linear learning rate 
    - Code: [GitHub model 1](https://github.com/KaiChenColumbia/MagicBalls/tree/master/model1)
    
2. **Model 2:**
    - Improvement: In terms of the above problems, we -
        + increased the number of observation steps to 50000, and created checkpoints every 50000 steps to check how much performance has improved
        + raised the difficulty of the game by serving the ball at random angle
        + improved CNN to adopt DeepMind structure
    - Performance (at 500k steps): [Video performance model 2](https://drive.google.com/open?id=1urqt67ETAMmA52moPIF5oTbHycFk0qib)
    - Problems: New problems emerged - 
        + The two bars reached a “consensus” to stay in diagonal corners with which minimal movement ensures no one loses the game, this resulted in no update of the policy
        + Still learning rate (We considered it to be the result of harder rules/stronger opponent thus fewer positive reward)
    - Code: [GitHub model 2](https://github.com/KaiChenColumbia/MagicBalls/tree/master/model2)
    
3. **Model 3 Final version:**
    - Improvement: In terms of the above problems, we -
        + updated reward system to penalize repeated trajectory of the ball movement. Namely, reward will experience an exponential decay with a rate of 0.6 every time it hits the bar in a specific round, and the loop is forced to break after 10 hits
        + upgraded CNN with RELU
    - Performance (at 1000k steps): [Video performance model 3](https://drive.google.com/open?id=1othJyEePWy1O4lo9Et4UJgV19X6oJd4Z)
    - Code: [GitHub model 3](https://github.com/KaiChenColumbia/MagicBalls)

In [1]:
import os
# os.environ['SDL_VIDEODRIVER']='dummy'
import random
from collections import deque
from pong_player import PongPlayer
import tensorflow as tf
import numpy as np
import cv2
from pygame.constants import K_DOWN, K_UP, K_RIGHT

In [2]:
class DeepQPongPlayerTest(PongPlayer):
    ACTIONS_COUNT = 3  # number of valid actions. In this case up, still and down
    # FUTURE_REWARD_DISCOUNT = 0.99  # decay rate of past observations
    OBSERVATION_STEPS = 20000.  # time steps to observe before training
    EXPLORE_STEPS = 500000.  # frames over which to anneal epsilon
    INITIAL_RANDOM_ACTION_PROB = 0.05  # starting chance of an action being random
    FINAL_RANDOM_ACTION_PROB = 0.05  # final chance of an action being random
    MEMORY_SIZE = 50000  # number of observations to remember
    MINI_BATCH_SIZE = 100  # size of mini batches
    STATE_FRAMES = 4  # number of frames to store in the state
    RESIZED_SCREEN_X, RESIZED_SCREEN_Y = (80, 80)
    OBS_LAST_STATE_INDEX, OBS_ACTION_INDEX, OBS_REWARD_INDEX, OBS_CURRENT_STATE_INDEX, OBS_TERMINAL_INDEX = range(5)
    SAVE_EVERY_X_STEPS = 49999
    LEARN_RATE = 1e-6
    STORE_SCORES_LEN = 200.

    def __init__(self, checkpoint_path="checkpoint", playback_mode=False, verbose_logging=False):
        """
        Example of deep q network for pong

        :param checkpoint_path: directory to store checkpoints in
        :type checkpoint_path: str
        :param playback_mode: if true games runs in real time mode and demos itself running
        :type playback_mode: bool
        :param verbose_logging: If true then extra log information is printed to std out
        :type verbose_logging: bool
        """
        self.reward_history = deque()
        self.reward_memory = 0
        
        self._playback_mode = playback_mode
        super(DeepQPongPlayerTest, self).__init__(force_game_fps=8)
        self.verbose_logging = verbose_logging
        self._checkpoint_path = checkpoint_path
        if 1:
            self._session = tf.Session()
            self._input_layer, self._output_layer = DeepQPongPlayerTest._create_network()

            self._action = tf.placeholder("float", [None, self.ACTIONS_COUNT])
            self._target = tf.placeholder("float", [None])

            readout_action = tf.reduce_sum(tf.multiply(self._output_layer, self._action), reduction_indices=1)

            cost = tf.reduce_mean(tf.square(self._target - readout_action))
            #self._train_operation = tf.train.AdamOptimizer(self.LEARN_RATE).minimize(cost)

        self._observations = deque()
        self._last_scores = deque()

        # set the first action to do nothing
        self._last_action = np.zeros(self.ACTIONS_COUNT)
        self._last_action[1] = 1

        self._last_state = None
        self._probability_of_random_action = self.INITIAL_RANDOM_ACTION_PROB
        self._time = 0

        self._session.run(tf.global_variables_initializer())

        if not os.path.exists(self._checkpoint_path):
            os.mkdir(self._checkpoint_path)
        
        self._saver = tf.train.Saver()
        checkpoint = tf.train.get_checkpoint_state(self._checkpoint_path)

        if 1:
            self._saver.restore(self._session, checkpoint.model_checkpoint_path)
            print("Loaded checkpoints %s" % checkpoint.model_checkpoint_path)

    def get_keys_pressed(self, screen_array, reward, terminal):
        # scale down screen image
        screen_resized_grayscaled = cv2.cvtColor(cv2.resize(screen_array,
                                                            (self.RESIZED_SCREEN_X, self.RESIZED_SCREEN_Y)),
                                                 cv2.COLOR_BGR2GRAY)

        # set the pixels to all be 0. or 1.
        _, screen_resized_binary = cv2.threshold(screen_resized_grayscaled, 1, 255, cv2.THRESH_BINARY)

        #if reward != 0.0:
            # self._last_scores.append(reward)
            #if len(self._last_scores) > self.STORE_SCORES_LEN:
            #    self._last_scores.popleft()

        # first frame must be handled differently
        if self._last_state is None:
            # the _last_state will contain the image data from the last self.STATE_FRAMES frames
            self._last_state = np.stack(tuple(screen_resized_binary for _ in range(self.STATE_FRAMES)), axis=2)

            return DeepQPongPlayerTest._key_presses_from_action(self._last_action)

        screen_resized_binary = np.reshape(screen_resized_binary,
                                               (self.RESIZED_SCREEN_X, self.RESIZED_SCREEN_Y, 1)) ## 图像处理结束
        
        current_state = np.append(self._last_state[:, :, 1:], screen_resized_binary, axis=2)


        # store the transition in previous_observations
        self._observations.append((self._last_state, self._last_action, reward, current_state, terminal))



            # only train if done observing

        self._time += 1
            
        # update the old values
        self._last_state = current_state
        
        self._last_action = self._choose_next_action() ## 后面会定义好

          
        self.reward_memory += reward
        if (self._time >= self.EXPLORE_STEPS-1):
            return [K_RIGHT]    
        return DeepQPongPlayerTest._key_presses_from_action(self._last_action)

    def _choose_next_action(self):
        new_action = np.zeros([self.ACTIONS_COUNT])

        if (random.random() <= self._probability_of_random_action):
            # choose an action randomly
            action_index = random.randrange(self.ACTIONS_COUNT)
        else:
            # choose an action given our last state
            readout_t = self._session.run(self._output_layer, feed_dict={self._input_layer: [self._last_state]})[0]
            if self.verbose_logging:
                print("Action Q-Values are %s" % readout_t)
            action_index = np.argmax(readout_t)

        new_action[action_index] = 1
        return new_action # 


    @staticmethod
    def _create_network():
        # network weights
        if 1:
            input_layer = tf.placeholder("float", [None, 80, 80,4])   # Input layer
            
            w_1 = tf.Variable(tf.truncated_normal([8, 8, 4, 16], stddev=0.01))
            b_1 = tf.Variable(tf.constant(0.01, shape=[16]))
            layer_conv1 = tf.nn.relu(tf.nn.conv2d(input_layer, w_1, strides=[1, 4, 4, 1], padding="SAME") + b_1,
                                name= 'Conv1')
            layer_pool1 = tf.nn.max_pool(layer_conv1, ksize=[1, 2, 2, 1],
                                                    strides=[1, 2, 2, 1], padding="SAME")

            #FIRST HIDDEN LAYER: CONV LAYER, SHAPE =(?, 10, 10, 16)
            # Note:
            # The paper of Deep Mind didn't specify a specific form of recifier nonlinearity. Here we use ReLU, 
            # which is often useful in computer vision topics. 
             
            w_2 = tf.Variable(tf.truncated_normal([4, 4, 16, 32], stddev=0.01))
            b_2 = tf.Variable(tf.constant(0.01, shape=[32]))            
            layer_conv2 = tf.nn.relu(tf.nn.conv2d(layer_pool1, w_2, strides=[1, 2, 2, 1], padding="SAME") + b_2,
                                name = 'Conv2')            
            # SECOND HIDDEN LAYER: CONV LAYER, SHAPE =(?, 5, 5, 32)
            
            layer_2_flat = tf.reshape(layer_conv2, [-1, 5 *5 * 32])
            layer_dense = tf.layers.dense(inputs=layer_2_flat, units=256,  activation=tf.nn.relu) 
            # THIRD HIDDEN LAYER: FULLY-CONNECTED(DENSE) LAYER, SHAPE = (?, 256)
            
            dropout = tf.layers.dropout(layer_dense, rate=0.2, name = 'Dropout_layer')
            # REGULARIZATION LAYER: TO AVOID OVERFITTING, SHAPE = (?, 256)
            
            output_layer = tf.layers.dense(inputs=dropout, units=3) 
            # OUTPUT LAYER


        return input_layer, output_layer 

    @staticmethod
    def _key_presses_from_action(action_set):
        if action_set[0] == 1:
            return [K_DOWN]
        elif action_set[1] == 1:
            return []
        elif action_set[2] == 1:
            return [K_UP]
        raise Exception("Unexpected action") 

In [None]:
# starts pygame window
player = DeepQPongPlayerTest()
player.start()

In [4]:
from IPython.display import HTML
HTML("""<video width="600" height="400" controls><source src="Desktop/AML_final_project_1216/aml_model4/Test_model3.mp4" type="video/mp4"></video>""")