# Playing Pong with Deep Reinforcement Learning

---

Read the paper [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/pdf/1312.5602.pdf) (the paper is also inside the 'Papers' folder in the course materials), and implement a model that can play atari games.

The goals of this project are the following:

- Read and understand the paper.
- Add a brief summary of the paper at the start of the notebook.
- Mention and implement the preprocessing needed; you can add your own steps if needed.
- Load an Atari environment from OpenAI Gym; start with Pong, and try with at least one more.
- Define the convolutional model needed for training.
- Apply deep q learning with your model.
- Use the model to play a game and show the result.

**Rubric:**

1. A summary of the paper was included. The summary covered what the paper does, and why, as well as the preprocessing steps and the model they introduced.
2. Read images from the environment, and performed the correct preprocessing steps.
3. Defined an agent class with the needed functions.
4. Defined the model within the agent class.
5. Trained the model with the Pong environment. Save the weights after each episode.
6. Test the model by making it play Pong.
7. Train and test the agent with another Atari environment of your choosing.


## Add a summary of the paper in this cell

### Basic installs and imports for Colab

In [0]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym[box2d] > /dev/null 2>&1

Requirement already up-to-date: setuptools in /usr/local/lib/python3.6/dist-packages (41.0.1)


In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor

import matplotlib
import matplotlib.pyplot as plt

import cv2
import numpy as np
import random, math

from keras import models, layers, optimizers

from collections import deque

import glob, io, base64

from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

gymlogger.set_level(40) #error only
%matplotlib inline

Using TensorFlow backend.


### Functions that wraps a video in colab

In [0]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [0]:
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1007'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1007'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
!ls video

openaigym.episode_batch.4.5389.stats.json
openaigym.manifest.4.5389.manifest.json
openaigym.video.4.5389.video000000.meta.json
openaigym.video.4.5389.video000000.mp4
openaigym.video.4.5389.video000001.meta.json
openaigym.video.4.5389.video000001.mp4
openaigym.video.4.5389.video000008.meta.json
openaigym.video.4.5389.video000008.mp4
openaigym.video.4.5389.video000027.meta.json
openaigym.video.4.5389.video000027.mp4
openaigym.video.4.5389.video000064.meta.json
openaigym.video.4.5389.video000064.mp4
openaigym.video.4.5389.video000125.meta.json
openaigym.video.4.5389.video000125.mp4
openaigym.video.4.5389.video000216.meta.json
openaigym.video.4.5389.video000216.mp4


In [0]:
# Loads the cartpole environment
env = wrap_env(gym.make('PongDeterministic-v4'))

state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(state_size, action_size)

actions = env.unwrapped.get_action_meanings()

# right is up, left is down
print(actions)

batch_size = 32

n_episodes = 10000

print(np.random.choice([2,3]))

210 6
['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
3


In [0]:
env = wrap_env(gym.make('PongDeterministic-v4'))
observation = env.reset()

while True:
  
    env.render()
    
    #your agent goes here
    action = np.random.choice([0, 2,3])
    #action = env.action_space.sample() 
    
    observation, reward, done, info = env.step(action) 

    if done: 
      break;
            
env.close()
show_video()

## Paper notes

Authors present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning:

- High dimensional input: sensory inputs like vision and speech

- The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. 

- Applied to seven Atari games, outperforming all previous approaches on six of the games and surpassing a human expert on three of them.

### Issues

- Most successful RL applications that operate on high dimensional domains have relied on hand-crafted features combined with linear value functions or policy representations. Performance of such systems heavily relies on the quality of the feature representation.

- Most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states.

- In RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism,] which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.

- Experience replay: we store the agent’s experiences at each time-step, $e_t = (s_t; a_t; r_t; st+1)$ in a data-set $D = e_1; \dots; e_N$, pooled over many episodes into a replay memory.

- During the inner loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience drawn at random from the pool of stored samples.

### Background

Consider tasks in which an agent interacts with an environment $\varepsilon$, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action $a_t$ from the set of legal game actions, $A = \{1; \dots ;K\}$. The emulator’s
internal state is not observed by the agent; instead it observes an image $x_t \in \mathbb{R}^d$ from the emulator, which is a vector of raw pixel values representing the current screen.

**Note that in general the game score may depend on the whole prior sequence of actions and observations; feedback about an action may only be received after many thousands of time-steps have elapsed.**

Since the agent only observes images of the current screen, the task is partially observed, i.e. it is impossible to fully understand the current situation from only the current screen $x_t$. We therefore consider sequences of actions and observations, $s_t = x_1; a_1; x_2;\dots; a_{t-1}; x_t$, and learn game strategies that depend upon these sequences. 

All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.

- Markov decision process (MDP): A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning.

  - A Markov decision process is a 4-tuple $(S,A,P_{a},R_{a})$
  - $S$ is a finite set of states
  - $A$ is a finite set of actions (alternatively, $A_s$ is the finite set of actions available from state $s$),
  - $P_{a}(s,s')=\Pr(s_{t+1}=s' | s_{t}=s,a_{t}=a)$ is the probability that action a in state s at time t will lead to state s' at time t+1
  - $R_{a}(s,s')$ is the immediate reward (or expected immediate reward) received after transitioning from state s to state s', due to action a
  
The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. We make the standard assumption that future rewards are discounted by a factor of $\gamma$ per time-step, and define the future discounted return at time $t$ as 

$R_t = \sum_{t'=t}^T \gamma^{t'-t} r_t$

where T is the time-step at which the game terminates.

We define the optimal action-value function $Q^*(s, a)$ as the maximum expected return achievable by following any strategy, after seeing some sequence $s$ and then taking some action $a$, $Q^*(s, a) = max_\pi \mathbb{E}[R_t | s_t = s; a_t = a, \pi]$, where $\pi$ is a policy mapping sequences to actions (or distributions over actions).

The optimal strategy is to select the action $a'$ maximizing the expected value, $\mathbb{E}$, of $r + \gamma Q^*(s', a')$:

$Q^*(s,a) = \mathbb{E}_{s'\sim\varepsilon}[r + \gamma max_a'Q^*(s',a') | s, a]$

To estimate the action-value function, the $Q^*$ function, it is common to use a function approximator; in this case a neural network. The function is as follows:

$Q(s, a; \theta)\approx Q^*(s,a)$

where $\theta$ are the weights of the Q-network. The network can be trained by minimising a sequence of loss functions $L_i(\theta_i)$ that changes at each iteration, $i$:

$L_i (\theta_i)=\mathbb{E}[(y_i - Q(s, a;\theta_i))^2]$

where $y_i = \mathbb{E}[r + \gamma max_{a'}Q(s', a'; \theta_{i-1})|s, a]$

### Algorithm



## Define the Deep Q learning Agent

In [0]:
class DQNAgent:
    
    def __init__(self, state_size, action_size):
      
        self.state_size = state_size
        self.action_size = action_size
        
        # Events that are near in time are too coralated and do not give aditional information
        # we will use moves that are further separated in time
        self.max_memory = 300000
        self.memory = [] #deque(maxlen=800000)
        
        # Discount factor
        self.gamma = 0.99
        
        # Exploration
        self.epsilon = 1.0
        self.epsilon_min = 0.1
        self.epsilon_decay = (1-self.epsilon_min) / 1000
        
        print(self.epsilon_decay)
        
        self.learning_rate = 0.00025
        
        self.model = self._build_model()
        

    def _build_model(self):
        
        model = models.Sequential()
        
        model.add(layers.Conv2D(16, kernel_size = (8,8), strides=(4,4), 
                                padding = 'valid', 
                                kernel_initializer='glorot_uniform', 
                                input_shape=(84, 84, 4)))
        model.add(layers.LeakyReLU(alpha=0.3))
        model.add(layers.Conv2D(32, kernel_size = (4,4), strides=(2,2), 
                                padding = 'valid',
                                kernel_initializer='glorot_uniform'))
        model.add(layers.LeakyReLU(alpha=0.3))

        model.add(layers.Flatten())
        model.add(layers.Dense(256, kernel_initializer='glorot_uniform', 
                               activation='relu'))
        model.add(layers.Dense(self.action_size, 
                               kernel_initializer='glorot_uniform', activation='linear'))
        
        model.compile(loss='mse', optimizer= optimizers.RMSprop(lr=self.learning_rate, rho=0.95, epsilon=0.01))
        
        return model
    
    def remember(self, state, action, reward, next_state, done):
        '''
            state, action, reward at current time
            next_state is the state that occurs after the state-action
            done is if the episode ended
        '''
        if len(self.memory) > self.max_memory:
          self.memory.pop(0)
          
        self.memory.append((state, action, reward, next_state, done))
        
    def action(self, state):
        
        # Exploration mode
        if np.random.rand() <= self.epsilon:
            #return np.random.choice([0,2,3])
            return random.randrange(self.action_size)
        
        # Use what action is predicted by the model as the best choice
        act_values = self.model.predict(state)
        
        return np.argmax(act_values[0])
      
    def get_batch(self, batch_size):
        
        minibatch = random.sample(range(4, len(self.memory)), batch_size)

        batch = [
            (
                np.expand_dims(np.stack((self.memory[i-3][0], self.memory[i-2][0], self.memory[i-1][0], self.memory[i][0]), axis = 2), axis = 0),
                self.memory[i][1],
                np.sum((self.memory[i-3][2], self.memory[i-2][2], self.memory[i-1][2], self.memory[i][2])),
                np.expand_dims(np.stack((self.memory[i-3][3], self.memory[i-2][3], self.memory[i-1][3], self.memory[i][3]), axis = 2), axis = 0),
                self.memory[i][4]
            ) for i in minibatch
        ]
        
        return batch
      
    def train(self, batch_size):
        
        #minibatch = random.sample(self.memory, batch_size)
        batch = self.get_batch(batch_size)
        
        for state, action, reward, next_state, done in batch:
            
            target = reward
            
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
                
            target_f = self.model.predict(state)
            
            target_f[0][action] = target
            
            self.model.fit(state, target_f, epochs=1, verbose=0)
            
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay
    
           
    def load(self, name):
        self.model.load_weights(name)
        
    def save(self, name):
        self.model.save_weights(name)

In [0]:
agent = DQNAgent(state_size, action_size)
agent.model.summary()

0.0009
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 20, 20, 16)        4112      
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 20, 20, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 9, 9, 32)          8224      
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 9, 9, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 2592)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               663808    
_______________________________________________________________

### Needed preprocessing steps

In [0]:
def preprocessFrame(image):
  
  # Grayscale
  gray_img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
  
  # Reize
  resized_img = cv2.resize(gray_img, (84, 110))
  
  # Crop
  cropped_img = resized_img[16:100, 0:84]
  
  # Normalized
  normalized_img = cv2.normalize(cropped_img, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)
  
  return normalized_img

## Training with the environment images

In [0]:
env = wrap_env(gym.make('PongDeterministic-v4'))

try:
    for e in range(n_episodes):
        
        state = preprocessFrame(env.reset())
        states = deque((state, state, state, state), maxlen=4)
        states_tensor = None
        total_reward = 0
        done = False
        
        while not done:
            
            #env.render()
            states_tensor = np.stack((states), axis = 2).reshape((1, 84, 84, 4))
            
            # Takes a random action from the action space of the environment
            action = agent.action(states_tensor)
            
            next_state, reward, done, info = env.step(action)
            next_state = preprocessFrame(next_state)
            
            # Define the reward for this problem
            total_reward += reward
            
            agent.remember(state, action, total_reward, next_state, done)
            
            state = next_state
            states.append(next_state)
        
            if len(agent.memory) > batch_size + 4:
              agent.train(batch_size)
        
        if e%50 == 0:
          print("Episode: {}/{}, score: {}, e: {:.9}, m: {}".format(e, n_episodes, total_reward, agent.epsilon, len(agent.memory)))
        
        agent.save('max_reward_weights.hdf5')
                
        
finally:
    env.close()

# f, plots = plt.subplots(1, 4, figsize=(20,20))
# plots = [plot for plot in plots]
# imgs = [states_tensor[0,:,:,i] for i in range(4)]

# for img, plot in zip(imgs, plots):
#     plot.imshow(img, cmap='gray')
    
# preprocess = preprocessFrame(state)
#plt.imshow(states_tensor[0,:,:,0], cmap = 'gray')
# print(preprocess, preprocess.shape)

Instructions for updating:
Use tf.cast instead.
Episode: 0/10000, score: -21.0, e: 0.9991, m: 854
Episode: 50/10000, score: -21.0, e: 0.9541, m: 47180
Episode: 100/10000, score: -21.0, e: 0.9091, m: 91952
Episode: 150/10000, score: -21.0, e: 0.8641, m: 136505
Episode: 200/10000, score: -21.0, e: 0.8191, m: 180138
Episode: 250/10000, score: -21.0, e: 0.7741, m: 222323
Episode: 300/10000, score: -21.0, e: 0.7291, m: 263722
Episode: 350/10000, score: -21.0, e: 0.6841, m: 300001
Episode: 400/10000, score: -21.0, e: 0.6391, m: 300001
Episode: 450/10000, score: -21.0, e: 0.5941, m: 300001
Episode: 500/10000, score: -21.0, e: 0.5491, m: 300001
Episode: 550/10000, score: -21.0, e: 0.5041, m: 300001
Episode: 600/10000, score: -21.0, e: 0.4591, m: 300001
Episode: 650/10000, score: -21.0, e: 0.4141, m: 300001
Episode: 700/10000, score: -21.0, e: 0.3691, m: 300001
Episode: 750/10000, score: -21.0, e: 0.3241, m: 300001
Episode: 800/10000, score: -20.0, e: 0.2791, m: 300001
Episode: 850/10000, score

KeyboardInterrupt: ignored

In [0]:
!ls

### Test your model

In [0]:

env = wrap_env(gym.make('PongDeterministic-v4'))
#agent.load('0700hdf5')

try:
      state = env.reset()
      state = np.reshape(state, [1, state_size])

      total_reward = 0
      done = False

      while not done:

          env.render()

          # Takes a random action from the action space of the environment
          action = agent.action(state)

          next_state, reward, done, info = env.step(action)

          total_reward += reward

          next_state = np.reshape(next_state, [1, state_size])
          state = next_state
        
finally:
    env.close()       
    show_video()

## Train and test your agent with another atari environment