# Executive Summary

I'm using the environment in the OpenAI gym to practice how to train an agent to play Atari Game by reinforcement learning. In this notebook, I tried to ***build a deep learning model with Tensorflow for reinforcement learning*** and ***training a live reinforcement learning model using Keras-RL***. By comparing the agent's performance different between training for 100k timesteps and 1 million timesteps. The result show that ***the Agent's performance after training for 1 million timesteps perfrom 42.7% better in terms of average scores received***, and ***time spend more than 21 hours to train for 34,852,326 trainable params in 100k timesteps by CPU***. 
 
 
### Here is my key findings during the process:

Agent's Performance ***Before Training***
- Before training the agent to play the Atari Game - SpaceInvaders-v0, the score in the first 5 episodes are between 105-240 (score 179 on average). (See detail in Section 2. Test Random Environment with OpanAI Gym)

Agent's Performance ***After Training 100k Timesteps***
- After training 100k timesteps with the key parameters and models structure below, the agent's performance improved to score 45-625 in 10 episodes (score 195.5 on average) only. (See detail in Section 7. Evaluate the Agent's Performance after Training for 100k Timesteps )

    Key Parameters:
    - Squential image base model with 3 convolution2D layers, 1 Flatten layer and 5 Dense layers (Trainable params: 34,852,326)
    - Optimizers: Adam
    - Keras-RL Agent: DQNAgent
    - Learning rate: 1e-4
    - Number of training timesteps: 100k
    
Agent's Performance ***After Training 1 Million Timesteps***
- After training 1 million timesteps with the same parameters on the above except for reduced last 2 Dense layers (Trainable params: 34,812,326), the agent's performance improved to score 120-530 in 10 episodes and average score increased to 279. (See detail in Section 9. Evaluate the Agent's Performance after Training for 1M Timesteps)

### ***Table of Content:***

1. What is SpaceInvaders
2. Test Random Environment with OpanAI Gym
3. Create a Deep Learning Model with Keras
4. Build Agent with Keras-RL
5. Train the Agent
6. Save the Model
7. Evaluate the Agent's Performance after Training for 100k Timesteps 
8. Further Improvement
9. Evaluate the Agent's Performance after Training for 1M Timesteps 

# 1. What is SpaceInvaders

<img src='http://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/SpaceInvaders-v0/poster.jpg' width='250px'/>

In this environment, the observation is an RGB image of the screen, which is an array of shape (210, 160, 3) 
Each action is repeatedly performed for a duration of k frames, where k is uniformly sampled from {2,3,4}.

# 2. Test Random Environment with OpanAI Gym

In [1]:
import gym
import random
import time
import os
import tensorflow.keras as keras
from multiprocessing import Process
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

***SpaceInvaders-v0***

- The v0 environment of Space Invaders returns an image as part of the state.
- We extract the shape of the image to pass to structure of neural netork

In [2]:
environment_name = 'SpaceInvaders-v0'

env = gym.make(environment_name)

height, width, channels = env.observation_space.shape

actions = env.action_space.n                             # 6 actions we can take ['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [3]:
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [4]:
episodes = 5

for episode in range(1, episodes + 1):                     # looping from 1 to 5
    
    state = env.reset()                                    # initial set of observation (not just the pole)
    
    done = False                                           # maximum number of steps in this particular environment
    
    score = 0                                              # running score counter
    
    while not done:
        
        env.render()                                       # view the graphical representation that environment(outside colab only)
        
        action = random.choice([0,1,2,3,4,5])               # 
        
        n_state, reward, done, info = env.step(action)      # pass random actions to environment to get back 
                                                            # 1. next set of observation (4 observation in this case)
                                                            # 2. reward (Positive value increment, negative value decrement)
                                                            # 3. done (episode is done = True)
        
        score += reward
        
    print('Episode:{} Score:{}'.format(episode, score))    # print out score for each episode
    
env.close()



Episode:1 Score:220.0
Episode:2 Score:210.0
Episode:3 Score:105.0
Episode:4 Score:240.0
Episode:5 Score:120.0


During the above 5 games, the score between 105-240 (average score 179) before training.

# 3. Create a Deep Learning Model with Keras

In [5]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Convolution2D   
from tensorflow.keras.optimizers import Adam

In [6]:
def build_model(height, width, channels, actions):
    
    model = Sequential()
    
    model.add(Convolution2D(32,                                         # For image base model, we start with convolution2D layer with 32 convolution2D filters 
                            (8,8),                                      # Filter size 8 x 8
                            strides = (4,4),                            # 4 x 4 means move 4 pixels to the right, and then 4 pixels to downward
                            activation = 'relu',                       # use 'relu' function
                            input_shape = (3, height, width, channels)  # pass through the image's heights, width, and channels as input into the model
                           ))
    
    model.add(Convolution2D(64,                                         # 64 convolution2D filters in the second layer
                            (4,4),                                      # Filter size 4 x 4
                            strides = (2,2),                            # 2 x 2 means move 2 pixels to the right, and then 2 pixels to downward
                            activation = 'relu'
                           ))
    
    model.add(Convolution2D(64,                                         # 64 convolution2D filters in the second layer
                            (3,3),                                      # Filter size 3 x 3
                            strides = (1,1),                            # 1 x 1 means move 1 pixels to the right, and then 1 pixels to downward
                            activation = 'relu'         
                           ))
    
    model.add(Flatten())    # Flatten down all the above layers into a single layer, so that we can than pass it through to next Dense layer
    
    model.add(Dense(512, activation = 'relu'))                          # compress all the above input image into 512 units of Dense layer
    
    model.add(Dense(256, activation = 'relu'))                          # compress down to 256 units of Dense layer
    
    model.add(Dense(128, activation = 'relu'))                          # compress down to 128 units of Dense layer
    
    model.add(Dense(64, activation = 'relu'))                           # compress down to 64 units of Dense layer
    
    model.add(Dense(actions, activation = 'linear'))                    # compress down to number of actions as output of the model
    
    return model

In [11]:
del model

In [12]:
model = build_model(height, width, channels, actions)

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 3, 51, 39, 32)     6176      
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 3, 24, 18, 64)     32832     
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 3, 22, 16, 64)     36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 67584)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 512)               34603520  
_________________________________________________________________
dense_7 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_8 (Dense)              (None, 128)              

# 4. Build Agent with Keras-RL

In [13]:
from rl.agents import DQNAgent                                  
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

In [14]:
def build_agent(model, actions):                       

    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),   # Using LinearAnnealedPolicy pass EpsGreedyQPolicy in it
                                attr = 'eps',
                                value_min = .1,
                                value_max = 1., 
                                value_test = .2,
                                nb_steps = 10000       # number of steps
                               )
    
    memory = SequentialMemory(limit = 1000,
                              window_length = 3        # If we want to increase the windowed period, 
                                                        # we need to change the input_shape in DL model to match it
                             )
    
    dqn = DQNAgent(model = model,
                   memory = memory,
                   policy = policy,
                   enable_dueling_network = True,     # Duealing Networks split value and advantage, it help the model learn when to take action and when not to bother,
                                                       # it is not so much competing but a modified network
                   dueling_type = 'avg',
                   nb_actions = actions,              # agent to learn 6 actions ['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
                   nb_steps_warmup = 10000            # warmup help the agent get a bit information before kick of the training 
                  )
    
    return dqn

# 5. Train the Agent

In [15]:
%%time
dqn = build_agent(model, actions)   # setup an agent

dqn.compile(Adam(lr=1e-4))          # use Adam optimizer 

dqn.fit(env, nb_steps = 100000, visualize = False, verbose = 2)  # DeepMind Advise to train 10 Million to 40 Million steps as state of the art model



Training for 100000 steps ...




   639/100000: episode: 1, duration: 14.332s, episode steps: 639, steps per second:  45, episode reward: 75.000, mean reward:  0.117 [ 0.000, 25.000], mean action: 2.521 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  1227/100000: episode: 2, duration: 12.870s, episode steps: 588, steps per second:  46, episode reward: 135.000, mean reward:  0.230 [ 0.000, 30.000], mean action: 2.408 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  2048/100000: episode: 3, duration: 17.886s, episode steps: 821, steps per second:  46, episode reward: 225.000, mean reward:  0.274 [ 0.000, 30.000], mean action: 2.317 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  2690/100000: episode: 4, duration: 13.928s, episode steps: 642, steps per second:  46, episode reward: 125.000, mean reward:  0.195 [ 0.000, 30.000], mean action: 2.313 [0.000, 5.000],  loss: --, mean_q: --, mean_eps: --
  3322/100000: episode: 5, duration: 14.005s, episode steps: 632, steps per second:  45, episode reward: 50.0



 10663/100000: episode: 16, duration: 569.514s, episode steps: 852, steps per second:   1, episode reward: 65.000, mean reward:  0.076 [ 0.000, 25.000], mean action: 2.390 [0.000, 5.000],  loss: 1.378905, mean_q: 5.869412, mean_eps: 0.100000
 11147/100000: episode: 17, duration: 403.949s, episode steps: 484, steps per second:   1, episode reward: 75.000, mean reward:  0.155 [ 0.000, 20.000], mean action: 2.409 [0.000, 5.000],  loss: 0.461618, mean_q: 5.665067, mean_eps: 0.100000
 11983/100000: episode: 18, duration: 696.030s, episode steps: 836, steps per second:   1, episode reward: 230.000, mean reward:  0.275 [ 0.000, 30.000], mean action: 3.126 [0.000, 5.000],  loss: 0.650514, mean_q: 5.100923, mean_eps: 0.100000
 12461/100000: episode: 19, duration: 395.450s, episode steps: 478, steps per second:   1, episode reward: 105.000, mean reward:  0.220 [ 0.000, 30.000], mean action: 2.513 [0.000, 5.000],  loss: 0.736310, mean_q: 5.237313, mean_eps: 0.100000
 13670/100000: episode: 20, du

 36603/100000: episode: 50, duration: 1030.170s, episode steps: 1242, steps per second:   1, episode reward: 350.000, mean reward:  0.282 [ 0.000, 30.000], mean action: 2.898 [0.000, 5.000],  loss: 0.357804, mean_q: 6.037867, mean_eps: 0.100000
 37103/100000: episode: 51, duration: 414.974s, episode steps: 500, steps per second:   1, episode reward: 95.000, mean reward:  0.190 [ 0.000, 20.000], mean action: 2.972 [0.000, 5.000],  loss: 0.280628, mean_q: 5.681791, mean_eps: 0.100000
 37607/100000: episode: 52, duration: 418.699s, episode steps: 504, steps per second:   1, episode reward: 45.000, mean reward:  0.089 [ 0.000, 15.000], mean action: 2.683 [0.000, 5.000],  loss: 0.138025, mean_q: 5.459441, mean_eps: 0.100000
 38124/100000: episode: 53, duration: 429.497s, episode steps: 517, steps per second:   1, episode reward: 110.000, mean reward:  0.213 [ 0.000, 30.000], mean action: 2.692 [0.000, 5.000],  loss: 0.192133, mean_q: 5.598152, mean_eps: 0.100000
 39088/100000: episode: 54, 

 59920/100000: episode: 84, duration: 436.974s, episode steps: 526, steps per second:   1, episode reward: 75.000, mean reward:  0.143 [ 0.000, 25.000], mean action: 2.211 [0.000, 5.000],  loss: 0.353090, mean_q: 7.667947, mean_eps: 0.100000
 60754/100000: episode: 85, duration: 699.374s, episode steps: 834, steps per second:   1, episode reward: 210.000, mean reward:  0.252 [ 0.000, 30.000], mean action: 2.535 [0.000, 5.000],  loss: 0.836739, mean_q: 8.080201, mean_eps: 0.100000
 61170/100000: episode: 86, duration: 355.252s, episode steps: 416, steps per second:   1, episode reward: 65.000, mean reward:  0.156 [ 0.000, 20.000], mean action: 2.399 [0.000, 5.000],  loss: 0.271210, mean_q: 7.801833, mean_eps: 0.100000
 61868/100000: episode: 87, duration: 596.259s, episode steps: 698, steps per second:   1, episode reward: 125.000, mean reward:  0.179 [ 0.000, 25.000], mean action: 2.021 [0.000, 5.000],  loss: 0.340848, mean_q: 7.728512, mean_eps: 0.100000
 62799/100000: episode: 88, du

 83124/100000: episode: 118, duration: 686.210s, episode steps: 810, steps per second:   1, episode reward: 165.000, mean reward:  0.204 [ 0.000, 30.000], mean action: 2.995 [0.000, 5.000],  loss: 0.881244, mean_q: 10.636079, mean_eps: 0.100000
 84072/100000: episode: 119, duration: 798.872s, episode steps: 948, steps per second:   1, episode reward: 245.000, mean reward:  0.258 [ 0.000, 30.000], mean action: 2.890 [0.000, 5.000],  loss: 0.726823, mean_q: 9.786154, mean_eps: 0.100000
 84765/100000: episode: 120, duration: 584.198s, episode steps: 693, steps per second:   1, episode reward: 120.000, mean reward:  0.173 [ 0.000, 30.000], mean action: 2.778 [0.000, 5.000],  loss: 0.502907, mean_q: 9.694742, mean_eps: 0.100000
 85148/100000: episode: 121, duration: 323.466s, episode steps: 383, steps per second:   1, episode reward: 65.000, mean reward:  0.170 [ 0.000, 20.000], mean action: 2.569 [0.000, 5.000],  loss: 0.692597, mean_q: 9.378683, mean_eps: 0.100000
 85882/100000: episode: 

<keras.callbacks.History at 0x7fa3f96d6940>

Time spend to traing the agent with 100k timesteps by CPU: 21hrs 10mins

# 6. Save the Model

In [16]:
dqn.save_weights('Savedweights/SpaceInvaders_dqn_weights_100ktimesteps.h5f')

# 7. Evaluate the Agent's Performance after Training for 100k Timesteps 

In [19]:
scores = dqn.test(env, nb_episodes = 10, visualize = True)

print(np.mean(scores.history['episode_reward']))

Testing for 10 episodes ...




Episode 1: reward: 625.000, steps: 1356
Episode 2: reward: 210.000, steps: 814
Episode 3: reward: 145.000, steps: 937
Episode 4: reward: 155.000, steps: 806
Episode 5: reward: 170.000, steps: 927
Episode 6: reward: 240.000, steps: 785
Episode 7: reward: 110.000, steps: 704
Episode 8: reward: 150.000, steps: 643
Episode 9: reward: 105.000, steps: 637
Episode 10: reward: 45.000, steps: 406
195.5


Not much improvement!

# 8. Video Record and Play the Final Performance

In [49]:
def generate_session(env):
    
    scores = dqn.test(env, nb_episodes = 10, visualize = True)

In [50]:
# Record sessions

import gym.wrappers

with gym.wrappers.Monitor(gym.make("SpaceInvaders-v0"), directory="SpaceInvaders-v0_videos", force=True) as env_monitor:
    
    sessions = [generate_session(env_monitor) for _ in range(10)]

Testing for 10 episodes ...




Episode 1: reward: 260.000, steps: 991
Episode 2: reward: 120.000, steps: 764
Episode 3: reward: 80.000, steps: 384
Episode 4: reward: 365.000, steps: 1097
Episode 5: reward: 110.000, steps: 671
Episode 6: reward: 65.000, steps: 383
Episode 7: reward: 185.000, steps: 794
Episode 8: reward: 110.000, steps: 678
Episode 9: reward: 195.000, steps: 789
Episode 10: reward: 215.000, steps: 814
Testing for 10 episodes ...
Episode 1: reward: 125.000, steps: 654
Episode 2: reward: 115.000, steps: 682
Episode 3: reward: 375.000, steps: 792
Episode 4: reward: 110.000, steps: 633
Episode 5: reward: 75.000, steps: 556
Episode 6: reward: 130.000, steps: 827
Episode 7: reward: 485.000, steps: 1393
Episode 8: reward: 210.000, steps: 794
Episode 9: reward: 55.000, steps: 450
Episode 10: reward: 120.000, steps: 663
Testing for 10 episodes ...
Episode 1: reward: 440.000, steps: 1022
Episode 2: reward: 320.000, steps: 702
Episode 3: reward: 675.000, steps: 1004
Episode 4: reward: 210.000, steps: 676
Episod

In [52]:
# Play the recorded video

from pathlib import Path
from IPython.display import HTML

video_names = sorted([s for s in Path('SpaceInvaders-v0_videos').iterdir() if s.suffix == '.mp4'])


HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format(video_names[-2]))  # Play the video

# 9. Evaluate the Agent's Performance after Training for 1M Timesteps 

In [59]:
def build_model_2(height, width, channels, actions):
    
    model = Sequential()
    
    model.add(Convolution2D(32,                                         # For image base model, we start with convolution2D layer with 32 convolution2D filters 
                            (8,8),                                      # Filter size 8 x 8
                            strides = (4,4),                            # 4 x 4 means move 4 pixels to the right, and then 4 pixels to downward
                            activation = 'relu',                       # use 'relu' function
                            input_shape = (3, height, width, channels)  # pass through the image's heights, width, and channels as input into the model
                           ))
    
    model.add(Convolution2D(64,                                         # 64 convolution2D filters in the second layer
                            (4,4),                                      # Filter size 4 x 4
                            strides = (2,2),                            # 2 x 2 means move 2 pixels to the right, and then 2 pixels to downward
                            activation = 'relu'
                           ))
    
    model.add(Convolution2D(64,                                         # 64 convolution2D filters in the second layer
                            (3,3),                                      # Filter size 3 x 3
                            strides = (1,1),                            # 1 x 1 means move 1 pixels to the right, and then 1 pixels to downward
                            activation = 'relu'         
                           ))
    
    model.add(Flatten())    # Flatten down all the above layers into a single layer, so that we can than pass it through to next Dense layer
    
    model.add(Dense(512, activation = 'relu'))                          # compress all the above input image into 512 units of Dense layer
    
    model.add(Dense(256, activation = 'relu'))                          # compress down to 256 units of Dense layer
    
    #model.add(Dense(128, activation = 'relu'))                          # compress down to 128 units of Dense layer
    
    #model.add(Dense(64, activation = 'relu'))                           # compress down to 64 units of Dense layer
    
    model.add(Dense(actions, activation = 'linear'))                    # compress down to number of actions as output of the model
    
    return model

In [60]:
model_2 = build_model_2(height, width, channels, actions)

model_2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_6 (Conv2D)            (None, 3, 51, 39, 32)     6176      
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 3, 24, 18, 64)     32832     
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 3, 22, 16, 64)     36928     
_________________________________________________________________
flatten_2 (Flatten)          (None, 67584)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 512)               34603520  
_________________________________________________________________
dense_13 (Dense)             (None, 256)               131328    
_________________________________________________________________
dense_14 (Dense)             (None, 6)                

In [61]:
dqn_2 = build_agent(model_2, actions)   # setup an agent

dqn_2.compile(Adam(lr=1e-4))          # use Adam optimizer 



In [62]:
dqn_2.load_weights('Savedweights/1m/dqn_weights.h5f')

In [63]:
scores_2 = dqn_2.test(env, nb_episodes = 10, visualize = True)

print(np.mean(scores_2.history['episode_reward']))

Testing for 10 episodes ...




Episode 1: reward: 180.000, steps: 1137
Episode 2: reward: 240.000, steps: 820
Episode 3: reward: 410.000, steps: 1030
Episode 4: reward: 230.000, steps: 1081
Episode 5: reward: 250.000, steps: 1131
Episode 6: reward: 190.000, steps: 690
Episode 7: reward: 175.000, steps: 671
Episode 8: reward: 120.000, steps: 472
Episode 9: reward: 465.000, steps: 1550
Episode 10: reward: 530.000, steps: 1292
279.0


In [64]:
# Record sessions

def generate_session_2(env):
    
    scores = dqn_2.test(env, nb_episodes = 10, visualize = True)

import gym.wrappers

with gym.wrappers.Monitor(gym.make("SpaceInvaders-v0"), directory="SpaceInvaders-v0_1Mtimesteps_videos", force=True) as env_monitor:
    
    sessions = [generate_session_2(env_monitor) for _ in range(10)]

Testing for 10 episodes ...




Episode 1: reward: 175.000, steps: 701
Episode 2: reward: 240.000, steps: 1356
Episode 3: reward: 280.000, steps: 1211
Episode 4: reward: 165.000, steps: 665
Episode 5: reward: 275.000, steps: 1125
Episode 6: reward: 245.000, steps: 860
Episode 7: reward: 140.000, steps: 848
Episode 8: reward: 310.000, steps: 736
Episode 9: reward: 230.000, steps: 800
Episode 10: reward: 165.000, steps: 632
Testing for 10 episodes ...
Episode 1: reward: 215.000, steps: 1149
Episode 2: reward: 165.000, steps: 648
Episode 3: reward: 125.000, steps: 663
Episode 4: reward: 310.000, steps: 929
Episode 5: reward: 260.000, steps: 873
Episode 6: reward: 425.000, steps: 1100
Episode 7: reward: 235.000, steps: 583
Episode 8: reward: 755.000, steps: 1583
Episode 9: reward: 215.000, steps: 1148
Episode 10: reward: 495.000, steps: 1046
Testing for 10 episodes ...
Episode 1: reward: 370.000, steps: 1072
Episode 2: reward: 210.000, steps: 864
Episode 3: reward: 275.000, steps: 924
Episode 4: reward: 210.000, steps: 6

In [65]:
# Play the recorded video

from pathlib import Path
from IPython.display import HTML

video_names = sorted([s for s in Path('SpaceInvaders-v0_1Mtimesteps_videos').iterdir() if s.suffix == '.mp4'])


HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format(video_names[-1]))  # Play the video

***End of Page***