# Playing Pong with Deep Reinforcement Learning

---

Read the paper [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/pdf/1312.5602.pdf) (the paper is also inside the 'Papers' folder in the course materials), and implement a model that can play atari games.

The goals of this project are the following:

- Read and understand the paper.
- Add a brief summary of the paper at the start of the notebook.
- Mention and implement the preprocessing needed; you can add your own steps if needed.
- Load an Atari environment from OpenAI Gym; start with Pong, and try with at least one more.
- Define the convolutional model needed for training.
- Apply deep q learning with your model.
- Use the model to play a game and show the result.

**Rubric:**

1. A summary of the paper was included. The summary covered what the paper does, and why, as well as the preprocessing steps and the model they introduced.
2. Read images from the environment, and performed the correct preprocessing steps.
3. Defined an agent class with the needed functions.
4. Defined the model within the agent class.
5. Trained the model with the Pong environment. Save the weights after each episode.
6. Test the model by making it play Pong.
7. Train and test the agent with another Atari environment of your choosing.


## Add a summary of the paper in this cell

The paper talked about the posibility to create a neuronal network that can learn to play different games without any changes in the architecture of the network, the only change was the input that was a part of the screen of the game the network tried to learn.

They described how they calculated the reward and function implemented. Also mentioned that one movement in the game may take a lot of timesteps to see that movement result and give a proper reward, either negative or positive. There was a problem with the score too, different games means different ways to increase the score and therefore different increasing values, they opted to set positive rewars to one and negative to -1 to reduce complexity.

The images were preprocessed by pasing the image to gray scale, downsampling and corping it. This results in a square image. Then they merged four of those images (differente frames) and sended to the network, so the input was NxNx4. Also there was an skipping of four frames to speed up the training and it didn't affected the trainning to much, just in space invaders where the laser blinks.

The network they used was a small one, two convolutional layers, the flatten and two dense layers, one for analisis and the other for the result.

For their experiments they used a batch size of 32, RMSProp and a linear replay memory of 1 million.

### Basic installs and imports for Colab

In [0]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym[box2d] > /dev/null 2>&1

Collecting setuptools
[?25l  Downloading https://files.pythonhosted.org/packages/c8/b0/cc6b7ba28d5fb790cf0d5946df849233e32b8872b6baca10c9e002ff5b41/setuptools-41.0.0-py2.py3-none-any.whl (575kB)
[K    100% |████████████████████████████████| 583kB 26.6MB/s 
[31mdatascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[?25hInstalling collected packages: setuptools
  Found existing installation: setuptools 40.9.0
    Uninstalling setuptools-40.9.0:
      Successfully uninstalled setuptools-40.9.0
Successfully installed setuptools-41.0.0


In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor

import matplotlib
import matplotlib.pyplot as plt

import cv2
import numpy as np
import random, math

from keras import models, layers, optimizers

from collections import deque

import glob, io, base64

from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

gymlogger.set_level(40) #error only
%matplotlib inline

Using TensorFlow backend.


### Functions that wraps a video in colab

In [0]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [0]:
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
# Loads the pong environment
env = wrap_env(gym.make('Pong-v0'))

state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(state_size, action_size)

batch_size = 32

n_episodes = 1001

210 6


In [0]:
show_video()

Could not find video


## Define the Deep Q learning Agent

In [0]:
class DQNAgent:
    
    def __init__(self, state_size, action_size):
      
        self.state_size = state_size
        self.action_size = action_size
        
        self.memory = deque(maxlen=40000) #40k * 4 = 160k frames
        
        self.gamma = 0.95
        self.alpha = 0.001
        self.epsilon = 1.0
        self.epsilon_decay = 0.99997
        self.epsilon_min = 0.01
        
        self.model = self._build_model()
        

    def _build_model(self):
        
        model = models.Sequential()
        model.add(layers.Conv2D(16, (8,8), strides=4, input_shape = (4,84,84), data_format="channels_first"))
        model.add(layers.LeakyReLU(alpha=0.2))
        model.add(layers.Conv2D(32, (4,4), strides=2))
        model.add(layers.LeakyReLU(alpha=0.2))
        model.add(layers.Flatten())
        model.add(layers.Dense(256, activation='relu'))
        model.add(layers.Dense(self.action_size, activation='softmax')) #movements
        
        return model
    
    def remember(self, state, action, reward, next_state, done, screen):
        '''
            state, action, reward at current time
            next_state is the state that occurs after the state-action
            done is if the episode ended
        '''
        self.memory.append((state, action, reward, next_state, done, screen))
        
    def action(self, screen):
        
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        
        
        prediction = self.model.predict(screen)
        #print("prediction: ",prediction)
        
        return np.argmax(prediction[0])
        
    def train(self, batch_size):
        
        batch = random.sample(self.memory, batch_size)
        
        for state, action, reward, next_state, done, screen in batch:
          
            #screen = np.expand_dims(screen, axis=0)
            #print("screen: ",screen.shape)
            
            target = reward
            
            if not done:
                
                target = (reward + self.gamma * 
                          np.amax(self.model.predict(screen)[0]))
                
            target_y = self.model.predict(screen)
            #print(target_y,action)
            
            target_y[0][action] = target
            
            self.model.fit(screen, target_y, epochs=1, verbose=0)
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
    
           
    def load(self, name):
        self.model.load_weights(name)
        
    def save(self, name):
        self.model.save_weights(name)

In [0]:
agent = DQNAgent(state_size, action_size)
agent.model.summary()
agent.model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=1e-4), metrics=['accuracy'])


Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 16, 20, 20)        4112      
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 16, 20, 20)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 9, 32)          10272     
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 7, 9, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 2016)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               516352    
_________________________________________________________________
dens

### Needed preprocessing steps

In [0]:
def preprocessFrame(image): #210,160
  image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
  image = cv2.resize(image,(84,110)) #resize
  image = image[18:102, :]  # crop the useless areas
  
  return image

In [0]:
"""#env.reset()
env.step(2)
img = env.render(mode='rgb_array') #get the screen
print(preprocessFrame(img).shape)

plt.imshow(preprocessFrame(img), cmap="gray")"""

'#env.reset()\nenv.step(2)\nimg = env.render(mode=\'rgb_array\') #get the screen\nprint(preprocessFrame(img).shape)\n\nplt.imshow(preprocessFrame(img), cmap="gray")'

## Training with the environment images
instead of perform an action every four frames keep the same for another three and change it in the fourth

In [0]:
agent.load('15_9900_pong_extended_leaky_2Alpha.hdf5')
agent.epsilon=  0.02650281148520746

batch_size = 32
frames_pack = []
try:
    for e in range(10001): #epochs
        
        state = env.reset()
        
        total_reward = 0
        done = False
        action = 0 #do nothing in context of pong
        
        while not done:
            
            screen = env.render(mode='rgb_array') #get the screen
            screen = preprocessFrame(screen)
            frames_pack.append(screen) #stack images 
            screen = np.expand_dims(screen, axis=2)
            screen = np.expand_dims(screen, axis=0)
            #print("test: ", screen.shape)
        
            
            if len(frames_pack) == 4: 
              frames_pack = np.array(frames_pack)
              #frames_pack = np.reshape(frames_pack, (84,84,4))
              frames_pack = np.expand_dims(frames_pack, axis=0)
              #print(frames_pack.shape)
              
              # Takes a random action from the action space of the environment
              action = agent.action(frames_pack) #change action
              next_state, reward, done, info = env.step(action)
              # Define the reward for this problem
              reward = reward if not done else -1
              total_reward += reward
  
              agent.remember(state,action,reward,next_state,done,frames_pack)
              state = next_state
              frames_pack = []
            else:
              #repeat the action
              next_state, reward, done, info = env.step(action)
              reward = reward if not done else -1
              total_reward += reward
              state = next_state
              
            
        
        if e % 100 == 0:
            agent.save('16_{:05d}_pong_extended_leaky_2Alpha'.format(e) + '.hdf5')
            print(e, " : ",agent.epsilon, " : ",total_reward)
        if len(agent.memory) > batch_size:
          agent.train(batch_size)
                
        
finally:
    env.close()

0  :  0.02650281148520746  :  -20.0
Instructions for updating:
Use tf.cast instead.
100  :  0.026423421005152575  :  -21.0
200  :  0.026344268343199622  :  -21.0
300  :  0.026265352786952766  :  -21.0
400  :  0.026186673626150122  :  -21.0
500  :  0.026108230152657466  :  -21.0
600  :  0.026030021660461835  :  -21.0
700  :  0.02595204744566518  :  -21.0
800  :  0.025874306806477967  :  -21.0
900  :  0.02579679904321292  :  -21.0
1000  :  0.025719523458278725  :  -21.0
1100  :  0.02564247935617373  :  -21.0
1200  :  0.025565666043479693  :  -21.0
1300  :  0.025489082828855546  :  -21.0
1400  :  0.025412729023031166  :  -21.0
1500  :  0.02533660393880118  :  -21.0
1600  :  0.025260706891018746  :  -21.0
1700  :  0.025185037196589426  :  -21.0
1800  :  0.02510959417446504  :  -21.0
1900  :  0.02503437714563748  :  -21.0
2000  :  0.024959385433132694  :  -21.0
2100  :  0.02488461836200452  :  -21.0
2200  :  0.024810075259328594  :  -21.0
2300  :  0.024735755454196404  :  -21.0
2400  :  0.0

In [0]:
#!rm *

!ls

### Test your model

In [0]:
#1_9900 : 0.7430407020422212  : 40k
#2_9900  :  0.5521094848913968  :  -20.0
#3_9900  :  0.4102398192578718  :  -21.0
#4_9900  :  0.30482488330704127  :  -20.0
#5_9800  :  0.22717781765931463  :  -21.0
#6_9900  :  0.16880236512199645  :  -21.0
#7_2900  :  0.15473705986001693  :  -21.0
#8_9900  :  0.1149759335903355  :  -21.0
#9_3600  :  0.10320540371981428  :  -21.0
#10_4600  :  0.08990190606057706  :  -21.0
#11_3400  :  0.08118395371003878  :  -21.0
#12_8300  :  0.06328914808241247  :  -21.0
#13_7500  :  0.04802298438354698  :  -21.0
#14_9900  :  0.03568303203051324  :  -21.0
#15_9900  :  0.02650281148520746  :  -21.0
print(agent.epsilon)
#agent.load('12_9900_pong_extended_leaky_2Alpha.hdf5')
env = wrap_env(gym.make('Pong-v0'))
frames_pack= []

try:
      action = 0 #do nothing in context of pong

      state = env.reset()

      total_reward = 0
      done = False
      screen = env.render(mode='rgb_array') #get the screen
      screen = preprocessFrame(screen)
      screen = np.expand_dims(screen, axis=2)
      screen = np.expand_dims(screen, axis=0)

      while not done:
      #for time in range(200):

          #env.render()

          # Takes a random action from the action space of the environment
          screen = env.render(mode='rgb_array') #get the screen
          screen = preprocessFrame(screen)
          frames_pack.append(screen) #stack images 
          screen = np.expand_dims(screen, axis=2)
          screen = np.expand_dims(screen, axis=0)
          #action = agent.action(screen)
          
          if len(frames_pack) == 4:
            
            frames_pack = np.array(frames_pack)
            frames_pack = np.expand_dims(frames_pack, axis=0)


            # Takes a random action from the action space of the environment
            action = agent.action(frames_pack)
            
            next_state, reward, done, info = env.step(action)
            # Define the reward for this problem
            reward = reward if not done else -1

            total_reward += reward
            #print(total_reward)


            state = next_state
            frames_pack = []
          else:
            #repeat the action
            next_state, reward, done, info = env.step(action)
            reward = reward if not done else -1
            total_reward += reward
            state = next_state
        
finally:
    env.close()       
    show_video()

## Train and test your agent with another atari environment

In [0]:
eps = 1.0
decay = 0.99995
count = 0

while eps > 0.01:
    count += 1
    eps *= decay
    if count % 100 == 0:
      print(count," : ",eps)