# Executive Summary

I'm using the environment in the OpenAI gym to practice how to train an agent to play Atari Game by reinforcement learning. Since the maximum score for all Atari Game is 5000, therefore the agaent trained after 3 million timesteps are doing so far so good (average score 2000 out of 5000). As DeepMind advise to train 10 Million to 40 Million steps as state of the art model, I feel confidence that if further incrase number of timesteps during the training, the agent wll be able to finish the MsPacman game with maximun score 5000.   

### Here is my key findings during the process:

Agent's Performance ***Before Training***
- Before training the agent to play the Atari Game - MsPacman-v0, the score in the first 50 games are between 120-570. (See detail in Section 2. Try to play 50 episodes before training)

Agent's Performance ***After Training 1 Million Timesteps***
- After training 1 million timesteps with the key parameters listed below, the agent performance improved significantly starting from step 700k (See detail in Section 6. Viewing Logs in Tensorboard)
- The average episode lasted from 900 episode in 700k step, rised to more than 1.1e+3 episode in 1M step
- The average score received from 850 score in 700k step, rised to more than 1.1e+3 score in 1M step

    Key Parameters:
    - Model: A2C with CnnPolicy
    - number of envs: 32
    - number of stack: 4
    - learning rate: 0.0007
    - number of timesteps: 1 million
    - Training Time Spend: 58min 36s

Agent's Performance ***After Training 3 Million Timesteps***
- After training 3 million timesteps with the same parameters on the above, the agent performance improved significantly starting from step 700k (See detail in Section 8. Final Performance of the Agent trained 3 Million timesteps)
- The average episode lasted from 1000 episode in 1M step, rised to 2.0e+3 episode in 3M step
- The average score received from 1000 score in 1M step, rised to 2.0e+3 score in 3M step
- Training Time Spend: 3h 1min 18s

### ***Table of Content:***

0. What is MsPacman
1. Import Dependencies
2. Try to play 50 episodes before training
3. Vectorise Environment and Train Model
4. Save the Model
5. Evaluate the Model
6. Viewing Logs in Tensorboard
7. Further Improvement
8. Final Performance of the Agent trained 3 Million timesteps

# 0. What is MsPacman

<img src='http://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/MsPacman-v0/poster.jpg' width='250px'/>

Maximize our score in the Atari 2600 game MsPacman.

In this environment, the observation is an RGB image of the screen, which is an array of shape (210, 160, 3) 
Each action is repeatedly performed for a duration of k frames, where k is uniformly sampled from {2,3,4}.

# 1. Import Dependencies

In [1]:
import gym
import time
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os

In [2]:
environment_name = 'MsPacman-v0'     # Using MsPacman verion 0 in this notebook

env = gym.make(environment_name)     # import the MsPacman from gym as environment

In [3]:
env.unwrapped.get_action_meanings()  # 9 actions could be taken in this environment

['NOOP',
 'UP',
 'RIGHT',
 'LEFT',
 'DOWN',
 'UPRIGHT',
 'UPLEFT',
 'DOWNRIGHT',
 'DOWNLEFT']

In [4]:
env.reset()                          # reset the evironment

array([[[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[228, 111, 111],
        [228, 111, 111],
        [228, 111, 111],
        ...,
        [228, 111, 111],
        [228, 111, 111],
        [228, 111, 111]],

       [[228, 111, 111],
        [228, 111, 111],
        [228, 111, 111],
        ...,
        [228, 111, 111],
        [228, 111, 111],
        [228, 111, 111]],

       ...,

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]]

In [5]:
env.action_space                

Discrete(9)

In [6]:
env.observation_space        # observation space in this environment 210x160 pixel, 3 RGB colors channel

Box([[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]], [[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 ...

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
 

# 2. Try to play 50 episodes before training

- use the result as the base line of the model

In [7]:
%%time
episodes = 50

for episode in range(1, episodes + 1):                     # looping from 1 to 50
    
    obs = env.reset()                                      # initial the set of observation
    
    done = False                                           # initial the game is over = False, until reach maximum number of steps in this particular environment
    
    score = 0                                              # running score counter
    
    while not done:
        
        env.render()                                       # view the graphical representation that environment
        
        action = env.action_space.sample()                 # random choose an action 
        
        n_obs, reward, done, info = env.step(action)       # pass random actions into environment to get back
                                                            # 1. next set of observation (4 observation in this case)
                                                            # 2. reward (Positive value increment, negative value decrement)
                                                            # 3. done (episode is done = True)
        
        score += reward                                    # accumulate each episodes' reward received into score
        
    print('Episode:{} Score:{}'.format(episode, score))    # print out score for each episode



Episode:1 Score:160.0
Episode:2 Score:220.0
Episode:3 Score:210.0
Episode:4 Score:360.0
Episode:5 Score:180.0
Episode:6 Score:220.0
Episode:7 Score:270.0
Episode:8 Score:260.0
Episode:9 Score:150.0
Episode:10 Score:280.0
Episode:11 Score:250.0
Episode:12 Score:220.0
Episode:13 Score:130.0
Episode:14 Score:200.0
Episode:15 Score:240.0
Episode:16 Score:220.0
Episode:17 Score:140.0
Episode:18 Score:120.0
Episode:19 Score:310.0
Episode:20 Score:150.0
Episode:21 Score:270.0
Episode:22 Score:300.0
Episode:23 Score:170.0
Episode:24 Score:150.0
Episode:25 Score:190.0
Episode:26 Score:260.0
Episode:27 Score:280.0
Episode:28 Score:230.0
Episode:29 Score:160.0
Episode:30 Score:200.0
Episode:31 Score:160.0
Episode:32 Score:160.0
Episode:33 Score:300.0
Episode:34 Score:230.0
Episode:35 Score:300.0
Episode:36 Score:190.0
Episode:37 Score:210.0
Episode:38 Score:340.0
Episode:39 Score:210.0
Episode:40 Score:180.0
Episode:41 Score:110.0
Episode:42 Score:280.0
Episode:43 Score:100.0
Episode:44 Score:570

During the above 50 games, the score between 120-570 before training.

In [8]:
env.close()                             # close the opened enivernment

# 3. Vectorise Environment and Train Model

- Vectorizing the environment, particularly with multiple environments, allows us to train the agent faster by training in parallel

***Helper Functions***
- **make_atari_env** is a helper from stable baselines that helps create wrapped Atrai environments
- **VecFrameStack** allows us to stack the environemnts together

Policies
- Think of an agent's policy as the rule which tells it how to operate in the environment

Stable Baseline 3 has types of policy:
- MlpPolicy: Multi Layer Perceptrons Policy that implements actor critic, using a MLP (2 layers of 64)
- CnnPolicy: Convolution Neural Network Policy that implements actor critic, using a CNN (the nature CNN)

In [7]:
# train 32 environment at the same time

env = make_atari_env("MsPacman-v0", n_envs = 32, seed = 0)               # use make_atari_env to create Atari environment
                                                                          # n_envs: how many environment train at the same time

env = VecFrameStack(env, n_stack = 4)                                    # use VecFrameStack to set the number of frame stack together

In [8]:
log_path = os.path.join('Training', 'Logs')                               # set the log_path into ../Training/Logs

model = A2C('CnnPolicy', env, verbose = 1, tensorboard_log = log_path)    # using A2C model, save tensorboard_log in log_path
                                                                          # type of policy: CnnPolicy (for image training)

Using cpu device
Wrapping the env in a VecTransposeImage.


In [11]:
%%time
model.learn(total_timesteps = 1000000)                                    # Train 1M timesteps

Logging to Training/Logs/A2C_3
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 908      |
|    ep_rew_mean        | 691      |
| time/                 |          |
|    fps                | 300      |
|    iterations         | 100      |
|    time_elapsed       | 53       |
|    total_timesteps    | 16000    |
| train/                |          |
|    entropy_loss       | -1.74    |
|    explained_variance | 0.116    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -0.195   |
|    value_loss         | 4.02     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 879      |
|    ep_rew_mean        | 679      |
| time/                 |          |
|    fps                | 314      |
|    iterations         | 200      |
|    time_elapsed       | 101      |
|    total_timesteps    | 32000    |
| train

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 895      |
|    ep_rew_mean        | 769      |
| time/                 |          |
|    fps                | 325      |
|    iterations         | 1400     |
|    time_elapsed       | 688      |
|    total_timesteps    | 224000   |
| train/                |          |
|    entropy_loss       | -1.31    |
|    explained_variance | 0.857    |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | -0.635   |
|    value_loss         | 4.09     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 909      |
|    ep_rew_mean        | 820      |
| time/                 |          |
|    fps                | 325      |
|    iterations         | 1500     |
|    time_elapsed       | 737      |
|    total_timesteps    | 240000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 847      |
|    ep_rew_mean        | 714      |
| time/                 |          |
|    fps                | 315      |
|    iterations         | 2800     |
|    time_elapsed       | 1417     |
|    total_timesteps    | 448000   |
| train/                |          |
|    entropy_loss       | -1.21    |
|    explained_variance | 0.875    |
|    learning_rate      | 0.0007   |
|    n_updates          | 2799     |
|    policy_loss        | 0.537    |
|    value_loss         | 4.25     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 853      |
|    ep_rew_mean        | 699      |
| time/                 |          |
|    fps                | 315      |
|    iterations         | 2900     |
|    time_elapsed       | 1470     |
|    total_timesteps    | 464000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 894      |
|    ep_rew_mean        | 837      |
| time/                 |          |
|    fps                | 318      |
|    iterations         | 4200     |
|    time_elapsed       | 2113     |
|    total_timesteps    | 672000   |
| train/                |          |
|    entropy_loss       | -0.516   |
|    explained_variance | 0.949    |
|    learning_rate      | 0.0007   |
|    n_updates          | 4199     |
|    policy_loss        | -0.416   |
|    value_loss         | 5.87     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 904      |
|    ep_rew_mean        | 854      |
| time/                 |          |
|    fps                | 318      |
|    iterations         | 4300     |
|    time_elapsed       | 2162     |
|    total_timesteps    | 688000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.02e+03 |
|    ep_rew_mean        | 1.07e+03 |
| time/                 |          |
|    fps                | 319      |
|    iterations         | 5600     |
|    time_elapsed       | 2800     |
|    total_timesteps    | 896000   |
| train/                |          |
|    entropy_loss       | -0.555   |
|    explained_variance | 0.972    |
|    learning_rate      | 0.0007   |
|    n_updates          | 5599     |
|    policy_loss        | -0.118   |
|    value_loss         | 2.77     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.04e+03 |
|    ep_rew_mean        | 1.12e+03 |
| time/                 |          |
|    fps                | 320      |
|    iterations         | 5700     |
|    time_elapsed       | 2849     |
|    total_timesteps    | 912000   |
| train/                |          |
|

<stable_baselines3.a2c.a2c.A2C at 0x7fbb00a338e0>

Spend 58min 36s to train 1M timesteps as the above

Notice that: the ***explained_variance 0.97*** (very close to 1) ~ good sign!

# 4. Save the Model

In [12]:
a2c_path = os.path.join('Training', 'Saved Models', 'A2C_MsPacman_Model_1Mtimesteps')  # set a2c_path into ../Training/Saved Models/A2C_MsPacman_Model_1Mtimesteps

model.save(a2c_path)                                                                   # save the trained model



# 5. Evaluate the Model

In [14]:
env = make_atari_env("MsPacman-v0", n_envs = 1, seed = 0)     # reset the n_envs = 1 for evaluate and test purpose

env = VecFrameStack(env, n_stack = 4)                         # n_stack: number of frame stack together

In [15]:
evaluate_policy(model, env, n_eval_episodes = 10, render = True)  # use evaluate_policy to evaluate the trained model
                                                                   # n_eval_episodes: number of evaluate episodes



(1014.0, 202.59318843435975)

# 6. Viewing Logs in Tensorboard

### Two Core Evaluation Metrics we should pay attention to: 

***ep_len_mean:*** on average how long a particular episode lasted before gameover

***ep_rew_mean:*** the average reward that the agent accumulated per episode

In [16]:
training_log_path = os.path.join(log_path, 'A2C_3')

!tensorboard --logdir={training_log_path}


NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


<img src='image/A2C_MsPacman_Model_1Mtimesteps.png' />

The result on the above show that:
- Starting from step 700k, the agent performance improved significantly
- The average episode lasted from 900 episode in 700k step, rised to more than 1.1e+3 episode in 1M step
- The average score received from 850 score in 700k step, rised to more than 1.1e+3 score in 1M step

# 7.  Further Improvement

Since our target is to the maximize our score in this Atari game - MsPacman

### Let's try to improve the agent's performance by the following actions:

- Train more steps

## Train more steps 

In [9]:
%%time
# Double the total_timesteps to 3M
model.learn(total_timesteps = 3000000)

Logging to Training/Logs/A2C_6
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 911      |
|    ep_rew_mean        | 665      |
| time/                 |          |
|    fps                | 285      |
|    iterations         | 100      |
|    time_elapsed       | 56       |
|    total_timesteps    | 16000    |
| train/                |          |
|    entropy_loss       | -1.73    |
|    explained_variance | 0.813    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 0.636    |
|    value_loss         | 1.4      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 918      |
|    ep_rew_mean        | 625      |
| time/                 |          |
|    fps                | 293      |
|    iterations         | 200      |
|    time_elapsed       | 108      |
|    total_timesteps    | 32000    |
| train

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 884      |
|    ep_rew_mean        | 741      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 1400     |
|    time_elapsed       | 754      |
|    total_timesteps    | 224000   |
| train/                |          |
|    entropy_loss       | -0.741   |
|    explained_variance | 0.854    |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | -0.944   |
|    value_loss         | 6.23     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 850      |
|    ep_rew_mean        | 687      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 1500     |
|    time_elapsed       | 809      |
|    total_timesteps    | 240000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 901      |
|    ep_rew_mean        | 798      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 2800     |
|    time_elapsed       | 1508     |
|    total_timesteps    | 448000   |
| train/                |          |
|    entropy_loss       | -1.34    |
|    explained_variance | 0.93     |
|    learning_rate      | 0.0007   |
|    n_updates          | 2799     |
|    policy_loss        | 0.892    |
|    value_loss         | 3.58     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 878      |
|    ep_rew_mean        | 812      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 2900     |
|    time_elapsed       | 1563     |
|    total_timesteps    | 464000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 859      |
|    ep_rew_mean        | 808      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 4200     |
|    time_elapsed       | 2268     |
|    total_timesteps    | 672000   |
| train/                |          |
|    entropy_loss       | -0.628   |
|    explained_variance | 0.9      |
|    learning_rate      | 0.0007   |
|    n_updates          | 4199     |
|    policy_loss        | -0.00261 |
|    value_loss         | 5.25     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 921      |
|    ep_rew_mean        | 864      |
| time/                 |          |
|    fps                | 296      |
|    iterations         | 4300     |
|    time_elapsed       | 2322     |
|    total_timesteps    | 688000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 894      |
|    ep_rew_mean        | 918      |
| time/                 |          |
|    fps                | 298      |
|    iterations         | 5600     |
|    time_elapsed       | 2999     |
|    total_timesteps    | 896000   |
| train/                |          |
|    entropy_loss       | -0.896   |
|    explained_variance | 0.868    |
|    learning_rate      | 0.0007   |
|    n_updates          | 5599     |
|    policy_loss        | 0.308    |
|    value_loss         | 4.75     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 897      |
|    ep_rew_mean        | 908      |
| time/                 |          |
|    fps                | 299      |
|    iterations         | 5700     |
|    time_elapsed       | 3050     |
|    total_timesteps    | 912000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 971      |
|    ep_rew_mean        | 1.03e+03 |
| time/                 |          |
|    fps                | 302      |
|    iterations         | 7000     |
|    time_elapsed       | 3708     |
|    total_timesteps    | 1120000  |
| train/                |          |
|    entropy_loss       | -0.522   |
|    explained_variance | 0.966    |
|    learning_rate      | 0.0007   |
|    n_updates          | 6999     |
|    policy_loss        | 0.0778   |
|    value_loss         | 3.33     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 997      |
|    ep_rew_mean        | 1.11e+03 |
| time/                 |          |
|    fps                | 302      |
|    iterations         | 7100     |
|    time_elapsed       | 3758     |
|    total_timesteps    | 1136000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.04e+03 |
|    ep_rew_mean        | 1.36e+03 |
| time/                 |          |
|    fps                | 304      |
|    iterations         | 8400     |
|    time_elapsed       | 4413     |
|    total_timesteps    | 1344000  |
| train/                |          |
|    entropy_loss       | -0.573   |
|    explained_variance | 0.981    |
|    learning_rate      | 0.0007   |
|    n_updates          | 8399     |
|    policy_loss        | -0.194   |
|    value_loss         | 3.18     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.11e+03 |
|    ep_rew_mean        | 1.45e+03 |
| time/                 |          |
|    fps                | 304      |
|    iterations         | 8500     |
|    time_elapsed       | 4463     |
|    total_timesteps    | 1360000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.08e+03 |
|    ep_rew_mean        | 1.2e+03  |
| time/                 |          |
|    fps                | 306      |
|    iterations         | 9800     |
|    time_elapsed       | 5116     |
|    total_timesteps    | 1568000  |
| train/                |          |
|    entropy_loss       | -0.403   |
|    explained_variance | 0.995    |
|    learning_rate      | 0.0007   |
|    n_updates          | 9799     |
|    policy_loss        | 0.0422   |
|    value_loss         | 1.22     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.08e+03 |
|    ep_rew_mean        | 1.22e+03 |
| time/                 |          |
|    fps                | 306      |
|    iterations         | 9900     |
|    time_elapsed       | 5166     |
|    total_timesteps    | 1584000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.07e+03 |
|    ep_rew_mean        | 1.29e+03 |
| time/                 |          |
|    fps                | 308      |
|    iterations         | 11200    |
|    time_elapsed       | 5817     |
|    total_timesteps    | 1792000  |
| train/                |          |
|    entropy_loss       | -0.531   |
|    explained_variance | 0.992    |
|    learning_rate      | 0.0007   |
|    n_updates          | 11199    |
|    policy_loss        | -0.228   |
|    value_loss         | 1.53     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.07e+03 |
|    ep_rew_mean        | 1.22e+03 |
| time/                 |          |
|    fps                | 308      |
|    iterations         | 11300    |
|    time_elapsed       | 5868     |
|    total_timesteps    | 1808000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.07e+03 |
|    ep_rew_mean        | 1.43e+03 |
| time/                 |          |
|    fps                | 309      |
|    iterations         | 12600    |
|    time_elapsed       | 6521     |
|    total_timesteps    | 2016000  |
| train/                |          |
|    entropy_loss       | -0.328   |
|    explained_variance | 0.972    |
|    learning_rate      | 0.0007   |
|    n_updates          | 12599    |
|    policy_loss        | 0.0383   |
|    value_loss         | 3.76     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.09e+03 |
|    ep_rew_mean        | 1.54e+03 |
| time/                 |          |
|    fps                | 309      |
|    iterations         | 12700    |
|    time_elapsed       | 6571     |
|    total_timesteps    | 2032000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.31e+03 |
|    ep_rew_mean        | 1.77e+03 |
| time/                 |          |
|    fps                | 310      |
|    iterations         | 14000    |
|    time_elapsed       | 7223     |
|    total_timesteps    | 2240000  |
| train/                |          |
|    entropy_loss       | -0.376   |
|    explained_variance | 0.996    |
|    learning_rate      | 0.0007   |
|    n_updates          | 13999    |
|    policy_loss        | -0.101   |
|    value_loss         | 2.28     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.36e+03 |
|    ep_rew_mean        | 1.81e+03 |
| time/                 |          |
|    fps                | 310      |
|    iterations         | 14100    |
|    time_elapsed       | 7273     |
|    total_timesteps    | 2256000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.37e+03 |
|    ep_rew_mean        | 1.9e+03  |
| time/                 |          |
|    fps                | 310      |
|    iterations         | 15400    |
|    time_elapsed       | 7930     |
|    total_timesteps    | 2464000  |
| train/                |          |
|    entropy_loss       | -0.218   |
|    explained_variance | 0.991    |
|    learning_rate      | 0.0007   |
|    n_updates          | 15399    |
|    policy_loss        | -0.164   |
|    value_loss         | 4.53     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.36e+03 |
|    ep_rew_mean        | 1.88e+03 |
| time/                 |          |
|    fps                | 310      |
|    iterations         | 15500    |
|    time_elapsed       | 7981     |
|    total_timesteps    | 2480000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.39e+03 |
|    ep_rew_mean        | 1.91e+03 |
| time/                 |          |
|    fps                | 311      |
|    iterations         | 16800    |
|    time_elapsed       | 8635     |
|    total_timesteps    | 2688000  |
| train/                |          |
|    entropy_loss       | -0.226   |
|    explained_variance | 0.996    |
|    learning_rate      | 0.0007   |
|    n_updates          | 16799    |
|    policy_loss        | 0.00636  |
|    value_loss         | 1.33     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.33e+03 |
|    ep_rew_mean        | 1.91e+03 |
| time/                 |          |
|    fps                | 311      |
|    iterations         | 16900    |
|    time_elapsed       | 8686     |
|    total_timesteps    | 2704000  |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.43e+03 |
|    ep_rew_mean        | 2.1e+03  |
| time/                 |          |
|    fps                | 311      |
|    iterations         | 18200    |
|    time_elapsed       | 9363     |
|    total_timesteps    | 2912000  |
| train/                |          |
|    entropy_loss       | -0.304   |
|    explained_variance | 0.994    |
|    learning_rate      | 0.0007   |
|    n_updates          | 18199    |
|    policy_loss        | 0.0315   |
|    value_loss         | 1.5      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 1.43e+03 |
|    ep_rew_mean        | 1.96e+03 |
| time/                 |          |
|    fps                | 311      |
|    iterations         | 18300    |
|    time_elapsed       | 9412     |
|    total_timesteps    | 2928000  |
| train/                |          |
|

<stable_baselines3.a2c.a2c.A2C at 0x7f8debb96610>

Spend 3h 1min 18s to train 3 Million timesteps as the above

In [10]:
# Save the 2M timesteps model 
a2c_path = os.path.join('Training', 'Saved Models', 'A2C_MsPacman_Model_3Mtimesteps')  # set a2c_path into ../Training/Saved Models/A2C_MsPacman_Model_1Mtimesteps

model.save(a2c_path) 

In [11]:
# See the 2M timesteps model performance
training_log_path = os.path.join(log_path, 'A2C_6')

!tensorboard --logdir={training_log_path}


NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


In [12]:
env = make_atari_env("MsPacman-v0", n_envs = 1, seed = 0)     # reset the n_envs = 1 for evaluate and test purpose

env = VecFrameStack(env, n_stack = 4)                         # n_stack: number of frame stack together

evaluate_policy(model, env, n_eval_episodes = 10, render = True)



(1993.0, 224.18965185752887)

# 8. Final Performance of the Agent trained 3 Million timesteps

<img src='image/A2C_MsPacman_Model_3Mtimesteps.png'/>

The result on the above show that:
- The average episode lasted from 1000 episode in 1M step, rised to 2.0e+3 episode in 3M step
- The average score received from 1000 score in 1M step, rised to 2.0e+3 score in 3M step

***End of the Page***