# Executive Summary

I'm using the environment in the ViZDoom to practice how to train an agent by reinforcement learning. In this notebook, ***the goal of the Basic environment is to kill the monster as soon as possible***.

### Here is my key findings during the process:

***Found Project Baseline by Using Random Action*** (Section 3)
- The ***reward received in between -365 to 66 (average -121) by taken random actions***.

***Using stable_baselines3 PPO model to train the Agent*** (Section 4)
- The ***Mean Reward reach highest reward at 150k timesteps around 80*** (time spend 1hr 18min).
- Compare to the average reward -121 by taking random action before learning, ***the agent perform significantly better. It killed the monster very quickly***.

### ***Table of Content:***
1. Configs of Doom Environment
2. Converting it to a Gym Environment
3. Found Project Baseline by Using Random Action
4. Using stable_baselines3 PPO model to train the Agent 
5. Load and Test the Best Saved Agent

# 1. Configs of Doom Environment 

<img src='image/VIZDOOM_Screen.png'/>

- ***Scenarios:*** Basic.cfg
- ***Episdoe Start Time:*** 14
- ***Episode Timeout:*** 300
- ***Available Actions:*** 
  - Move_Left  [1,0,0]
  - Move_Right [0,1,0]
  - Attack     [0,0,1]
- ***Available Game Variables:*** AMMO2
- ***Mode:*** Player
- ***Doom Skill:*** 5
- ***Living Reward:*** -1

##### Rendering options
- screen_resolution = RES_320X240
- screen_format = CRCGCB
- render_hud = True
- render_crosshair = false
- render_weapon = true
- render_decals = false
- render_particles = false
- window_visible = true

In [1]:
from vizdoom import *          # Import vizdoom for game env

import random                  # Import random from action sampling

import time                    # Import time for sleeping

import numpy as np             # Import numpy for identity matrix

In [2]:
# Setup game
game = DoomGame()

game.load_config('VizDoom/scenarios/basic.cfg')

game.init()

In [3]:
game.get_state().screen_buffer.shape                       # screen shape 240 x 320 RGB Colors

(3, 240, 320)

In [4]:
# This is the set of actions we can take in the environment
actions = np.identity(3, dtype = np.uint8)                 # actions: Move Left[1,0,0], Move Right[0,1,0], Attack[0,0,1]

random.choice(actions)

array([1, 0, 0], dtype=uint8)

# 2. Converting it to a Gym Environment

In [16]:
!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.5.5.64-cp36-abi3-macosx_10_15_x86_64.whl (46.3 MB)
[K     |████████████████████████████████| 46.3 MB 329 kB/s eta 0:00:01
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.5.64


In [5]:
# Import Dependencies
from gym import Env                      # Import environment base class from OpenAI Gym

from gym.spaces import Discrete, Box     # Import gym spaces

import cv2                               # Import opencv

import numpy as np                       # Impoet numpy

In [8]:
class VizDoomGym(Env): 
    # Function that is called when we start the env
    def __init__(self, render=False): 
        
        # Inherit from Env
        super().__init__()
        
        # Setup the game 
        self.game = DoomGame()
        self.game.load_config('VizDoom/scenarios/basic.cfg')
        
        # Render frame logic
        if render == False: 
            
            self.game.set_window_visible(False)
            
        else:
            
            self.game.set_window_visible(True)
        
        # Start the game 
        self.game.init()
        
        # Create the action space and observation space
        self.observation_space = Box(low=0, high=255, shape=(100,160,1), dtype=np.uint8) 
        self.action_space = Discrete(3)
        
    # This is how we take a step in the environment
    def step(self, action):
        
        # Specify action and take step 
        actions = np.identity(3)
        reward = self.game.make_action(actions[action], 4) 
        
        # Get all the other stuff we need to retun 
        if self.game.get_state(): 
            state = self.game.get_state().screen_buffer
            state = self.grayscale(state)
            ammo = self.game.get_state().game_variables[0]
            info = ammo
        else: 
            state = np.zeros(self.observation_space.shape)
            info = 0 
        
        info = {"info":info}
        done = self.game.is_episode_finished()
        
        return state, reward, done, info 
    
    # Define how to render the game or environment 
    def render(): 
        pass
    
    # What happens when we start a new game 
    def reset(self): 
        self.game.new_episode()
        state = self.game.get_state().screen_buffer
        return self.grayscale(state)
    
    # Grayscale the game frame and resize it 
    def grayscale(self, observation):
        gray = cv2.cvtColor(np.moveaxis(observation, 0, -1), cv2.COLOR_BGR2GRAY)
        resize = cv2.resize(gray, (160,100), interpolation=cv2.INTER_CUBIC)
        state = np.reshape(resize, (100,160,1))
        return state
    
    # Call to close down the game
    def close(self): 
        self.game.close()

In [9]:
env = VizDoomGym(render = True)

In [10]:
env.step(0)

(array([[[88],
         [75],
         [91],
         ...,
         [75],
         [89],
         [76]],
 
        [[53],
         [53],
         [50],
         ...,
         [45],
         [54],
         [53]],
 
        [[26],
         [26],
         [26],
         ...,
         [25],
         [35],
         [51]],
 
        ...,
 
        [[75],
         [63],
         [62],
         ...,
         [44],
         [71],
         [60]],
 
        [[15],
         [48],
         [47],
         ...,
         [49],
         [69],
         [47]],
 
        [[22],
         [14],
         [26],
         ...,
         [57],
         [37],
         [39]]], dtype=uint8),
 -4.0,
 False,
 {'info': 50.0})

In [11]:
env.close()

### Check the Environment

In [32]:
!pip install torch

Collecting torch
  Downloading torch-1.11.0-cp39-none-macosx_10_9_x86_64.whl (129.9 MB)
[K     |████████████████████████████████| 129.9 MB 203 kB/s eta 0:00:01
[?25hCollecting typing-extensions
  Downloading typing_extensions-4.1.1-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions, torch
Successfully installed torch-1.11.0 typing-extensions-4.1.1


In [33]:
!pip install stable_baselines3

Collecting stable_baselines3
  Using cached stable_baselines3-1.5.0-py3-none-any.whl (177 kB)
Collecting pandas
  Downloading pandas-1.4.2-cp39-cp39-macosx_10_9_x86_64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 206 kB/s eta 0:00:01
[?25hCollecting matplotlib
  Downloading matplotlib-3.5.1-cp39-cp39-macosx_10_9_x86_64.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 67 kB/s eta 0:00:012
[?25hCollecting gym==0.21
  Using cached gym-0.21.0.tar.gz (1.5 MB)
Collecting pillow>=6.2.0
  Downloading Pillow-9.1.0-cp39-cp39-macosx_10_9_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 117 kB/s eta 0:00:01
[?25hCollecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.2-cp39-cp39-macosx_10_9_x86_64.whl (65 kB)
[K     |████████████████████████████████| 65 kB 128 kB/s eta 0:00:01
Collecting fonttools>=4.22.0
  Downloading fonttools-4.31.2-py3-none-any.whl (899 kB)

# 3. Found Project Baseline by Using Random Action

In [18]:
episodes = 10                                                 # set number of game to play

for episode in range(episodes+1):                             # loop over each of game
    
    game.new_episode()                                        # reset the game back to initial state 
    
    while not game.is_episode_finished():                     # check he game isn't done
        
        state = game.get_state()                              # get a current game state
        
        img = state.screen_buffer                             # convert current state into an image
        
        info = state.game_variables                           # the info here is number of ammo remain 
        
        reward = game.make_action(random.choice(actions), 4)  # take an random action, 4: mean skip 4 frames and then get the reward (time delay to see the bullet hit on the target or not)
        
        print('reward', reward)                               # print out reward received from last action
        
        time.sleep(0.2)                                    
        
    print('Result:', game.get_total_reward())                 # print out current total reward
    
    time.sleep(2)

reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward 97.0
Result: -153.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward 99.0
Result: 66.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0
reward -9.0
reward -4.0
reward -4.0


In [19]:
game.close()

Remark:
- The above 11 episodes showing that, the ***reward received in between -365 to 66 (average -121) by taken random actions***.

# 4. Using stable_baselines3 PPO model to train the Agent 

### Setup Callback

In [29]:
# Import Dependencies
import os                                                      # Import os for file nav

from stable_baselines3.common.callbacks import BaseCallback    # Import Base Callback for saving models

from stable_baselines3 import PPO                              # Import PPO for training

In [24]:
CHECKPOINT_DIR = './train/DOOM'                                # Save the trained model into train directory

LOG_DIR = './logs/DOOM'                                        # Save the log into log directory

In [25]:
class TrainAndLoggingCallback(BaseCallback):

    def __init__(self, check_freq, save_path, verbose=1):
        
        super(TrainAndLoggingCallback, self).__init__(verbose)
        
        self.check_freq = check_freq
        
        self.save_path = save_path

    def _init_callback(self):
        
        if self.save_path is not None:
            
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self):
        
        if self.n_calls % self.check_freq == 0:
            
            model_path = os.path.join(self.save_path, 'best_model_{}'.format(self.n_calls))
            
            self.model.save(model_path)

        return True

In [26]:
# Setup model saving callback
callback = TrainAndLoggingCallback(check_freq = 500000, save_path=CHECKPOINT_DIR)   # save the model for every 500000 steps

In [30]:
# Non rendered environment
env = VizDoomGym()

PPO_model = PPO('CnnPolicy',
            env,
            tensorboard_log = LOG_DIR, 
            verbose = 1, 
            learning_rate = 0.0001,
            n_steps = 4096                  # 4096 here means that 4096 sets of observations, 
                                            # actions, log probabilities & values will be stored in the buffer for one iteration 
           )

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


In [None]:
PPO_model.learn(total_timesteps=1000000,    # 1 million 
            callback=callback
           )

Logging to ./logs/DOOM/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 47.1     |
|    ep_rew_mean     | -175     |
| time/              |          |
|    fps             | 242      |
|    iterations      | 1        |
|    time_elapsed    | 16       |
|    total_timesteps | 4096     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 46.3        |
|    ep_rew_mean          | -170        |
| time/                   |             |
|    fps                  | 55          |
|    iterations           | 2           |
|    time_elapsed         | 148         |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.008129916 |
|    clip_fraction        | 0.24        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.08       |
|    explained_variance   | 5.14e-05    |
|

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 40.4        |
|    ep_rew_mean          | -135        |
| time/                   |             |
|    fps                  | 34          |
|    iterations           | 11          |
|    time_elapsed         | 1318        |
|    total_timesteps      | 45056       |
| train/                  |             |
|    approx_kl            | 0.009580939 |
|    clip_fraction        | 0.159       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.78       |
|    explained_variance   | 0.821       |
|    learning_rate        | 0.0001      |
|    loss                 | 768         |
|    n_updates            | 100         |
|    policy_gradient_loss | 0.0031      |
|    value_loss           | 1.7e+03     |
-----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 37.9    

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 11.3        |
|    ep_rew_mean          | 50          |
| time/                   |             |
|    fps                  | 32          |
|    iterations           | 21          |
|    time_elapsed         | 2624        |
|    total_timesteps      | 86016       |
| train/                  |             |
|    approx_kl            | 0.017433833 |
|    clip_fraction        | 0.2         |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.752      |
|    explained_variance   | 0.84        |
|    learning_rate        | 0.0001      |
|    loss                 | 765         |
|    n_updates            | 200         |
|    policy_gradient_loss | -0.0114     |
|    value_loss           | 1.36e+03    |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 7.97  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.9         |
|    ep_rew_mean          | 79          |
| time/                   |             |
|    fps                  | 32          |
|    iterations           | 31          |
|    time_elapsed         | 3944        |
|    total_timesteps      | 126976      |
| train/                  |             |
|    approx_kl            | 0.020463418 |
|    clip_fraction        | 0.0702      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.151      |
|    explained_variance   | 0.762       |
|    learning_rate        | 0.0001      |
|    loss                 | 31.8        |
|    n_updates            | 300         |
|    policy_gradient_loss | 0.00806     |
|    value_loss           | 34          |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.72  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.54        |
|    ep_rew_mean          | 80.6        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 41          |
|    time_elapsed         | 5269        |
|    total_timesteps      | 167936      |
| train/                  |             |
|    approx_kl            | 0.025296211 |
|    clip_fraction        | 0.0444      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0482     |
|    explained_variance   | 0.783       |
|    learning_rate        | 0.0001      |
|    loss                 | 10.6        |
|    n_updates            | 400         |
|    policy_gradient_loss | 0.0062      |
|    value_loss           | 32.6        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.68  

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 6.01         |
|    ep_rew_mean          | 78.6         |
| time/                   |              |
|    fps                  | 31           |
|    iterations           | 51           |
|    time_elapsed         | 6591         |
|    total_timesteps      | 208896       |
| train/                  |              |
|    approx_kl            | 0.0025788657 |
|    clip_fraction        | 0.0129       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0316      |
|    explained_variance   | 0.928        |
|    learning_rate        | 0.0001       |
|    loss                 | 5.85         |
|    n_updates            | 500          |
|    policy_gradient_loss | 0.00497      |
|    value_loss           | 8.95         |
------------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mea

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 5.54       |
|    ep_rew_mean          | 80.8       |
| time/                   |            |
|    fps                  | 31         |
|    iterations           | 61         |
|    time_elapsed         | 7906       |
|    total_timesteps      | 249856     |
| train/                  |            |
|    approx_kl            | 0.01806085 |
|    clip_fraction        | 0.0305     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.0316    |
|    explained_variance   | 0.904      |
|    learning_rate        | 0.0001     |
|    loss                 | 4.63       |
|    n_updates            | 600        |
|    policy_gradient_loss | 0.00637    |
|    value_loss           | 12         |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.97        |
|    ep_rew_m

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 6.19       |
|    ep_rew_mean          | 78.1       |
| time/                   |            |
|    fps                  | 31         |
|    iterations           | 71         |
|    time_elapsed         | 9226       |
|    total_timesteps      | 290816     |
| train/                  |            |
|    approx_kl            | 0.11155401 |
|    clip_fraction        | 0.0758     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.0259    |
|    explained_variance   | 0.876      |
|    learning_rate        | 0.0001     |
|    loss                 | 2          |
|    n_updates            | 700        |
|    policy_gradient_loss | -0.0135    |
|    value_loss           | 13.7       |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 6.2         |
|    ep_rew_m

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 6.22       |
|    ep_rew_mean          | 78.1       |
| time/                   |            |
|    fps                  | 31         |
|    iterations           | 81         |
|    time_elapsed         | 10553      |
|    total_timesteps      | 331776     |
| train/                  |            |
|    approx_kl            | 0.15766165 |
|    clip_fraction        | 0.175      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.114     |
|    explained_variance   | 0.664      |
|    learning_rate        | 0.0001     |
|    loss                 | 10.2       |
|    n_updates            | 800        |
|    policy_gradient_loss | -0.00627   |
|    value_loss           | 43.1       |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.91        |
|    ep_rew_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.61        |
|    ep_rew_mean          | 80.4        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 91          |
|    time_elapsed         | 11852       |
|    total_timesteps      | 372736      |
| train/                  |             |
|    approx_kl            | 0.037570946 |
|    clip_fraction        | 0.0194      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0118     |
|    explained_variance   | 0.899       |
|    learning_rate        | 0.0001      |
|    loss                 | 0.992       |
|    n_updates            | 900         |
|    policy_gradient_loss | -0.000745   |
|    value_loss           | 11.3        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 8.92  

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 27.9         |
|    ep_rew_mean          | -42.9        |
| time/                   |              |
|    fps                  | 31           |
|    iterations           | 101          |
|    time_elapsed         | 13139        |
|    total_timesteps      | 413696       |
| train/                  |              |
|    approx_kl            | 0.0047850814 |
|    clip_fraction        | 0.0171       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0185      |
|    explained_variance   | 0.711        |
|    learning_rate        | 0.0001       |
|    loss                 | 268          |
|    n_updates            | 1000         |
|    policy_gradient_loss | 0.00292      |
|    value_loss           | 747          |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 11.4        |
|    ep_rew_mean          | 48.8        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 111         |
|    time_elapsed         | 14415       |
|    total_timesteps      | 454656      |
| train/                  |             |
|    approx_kl            | 0.008378582 |
|    clip_fraction        | 0.0127      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0116     |
|    explained_variance   | 0.906       |
|    learning_rate        | 0.0001      |
|    loss                 | 582         |
|    n_updates            | 1100        |
|    policy_gradient_loss | -0.000653   |
|    value_loss           | 1.24e+03    |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 11.6  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.52        |
|    ep_rew_mean          | 80.8        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 121         |
|    time_elapsed         | 15713       |
|    total_timesteps      | 495616      |
| train/                  |             |
|    approx_kl            | 0.045650378 |
|    clip_fraction        | 0.0316      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0169     |
|    explained_variance   | 0.926       |
|    learning_rate        | 0.0001      |
|    loss                 | 2.57        |
|    n_updates            | 1200        |
|    policy_gradient_loss | -0.00407    |
|    value_loss           | 8.13        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.94  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.8         |
|    ep_rew_mean          | 79.7        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 131         |
|    time_elapsed         | 17015       |
|    total_timesteps      | 536576      |
| train/                  |             |
|    approx_kl            | 0.049305268 |
|    clip_fraction        | 0.0269      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.011      |
|    explained_variance   | 0.889       |
|    learning_rate        | 0.0001      |
|    loss                 | 14.6        |
|    n_updates            | 1300        |
|    policy_gradient_loss | -0.00588    |
|    value_loss           | 13.7        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 5.59

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.61        |
|    ep_rew_mean          | 80.6        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 141         |
|    time_elapsed         | 18320       |
|    total_timesteps      | 577536      |
| train/                  |             |
|    approx_kl            | 0.013951454 |
|    clip_fraction        | 0.0189      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0154     |
|    explained_variance   | 0.986       |
|    learning_rate        | 0.0001      |
|    loss                 | 0.31        |
|    n_updates            | 1400        |
|    policy_gradient_loss | -0.00808    |
|    value_loss           | 1.29        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.77  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.82        |
|    ep_rew_mean          | 79.6        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 151         |
|    time_elapsed         | 19612       |
|    total_timesteps      | 618496      |
| train/                  |             |
|    approx_kl            | 0.050929606 |
|    clip_fraction        | 0.0614      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0243     |
|    explained_variance   | 0.955       |
|    learning_rate        | 0.0001      |
|    loss                 | 1.16        |
|    n_updates            | 1500        |
|    policy_gradient_loss | -0.00759    |
|    value_loss           | 3.97        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.63  

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.4         |
|    ep_rew_mean          | 81          |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 161         |
|    time_elapsed         | 20876       |
|    total_timesteps      | 659456      |
| train/                  |             |
|    approx_kl            | 0.012026083 |
|    clip_fraction        | 0.0154      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0165     |
|    explained_variance   | 0.952       |
|    learning_rate        | 0.0001      |
|    loss                 | 0.291       |
|    n_updates            | 1600        |
|    policy_gradient_loss | -0.000916   |
|    value_loss           | 5.29        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.38  

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 5.64       |
|    ep_rew_mean          | 80.4       |
| time/                   |            |
|    fps                  | 31         |
|    iterations           | 171        |
|    time_elapsed         | 22135      |
|    total_timesteps      | 700416     |
| train/                  |            |
|    approx_kl            | 0.00870125 |
|    clip_fraction        | 0.00535    |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.00967   |
|    explained_variance   | 0.963      |
|    learning_rate        | 0.0001     |
|    loss                 | 0.376      |
|    n_updates            | 1700       |
|    policy_gradient_loss | 0.00302    |
|    value_loss           | 4.15       |
----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 5.61         |
|    ep_re

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 5.7          |
|    ep_rew_mean          | 79.9         |
| time/                   |              |
|    fps                  | 31           |
|    iterations           | 181          |
|    time_elapsed         | 23390        |
|    total_timesteps      | 741376       |
| train/                  |              |
|    approx_kl            | 0.0021651373 |
|    clip_fraction        | 0.00576      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0105      |
|    explained_variance   | 0.987        |
|    learning_rate        | 0.0001       |
|    loss                 | 2.28         |
|    n_updates            | 1800         |
|    policy_gradient_loss | 0.00363      |
|    value_loss           | 1.3          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 6.4         |
|    ep_rew_mean          | 77.2        |
| time/                   |             |
|    fps                  | 31          |
|    iterations           | 191         |
|    time_elapsed         | 24704       |
|    total_timesteps      | 782336      |
| train/                  |             |
|    approx_kl            | 0.061906595 |
|    clip_fraction        | 0.144       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.079      |
|    explained_variance   | 0.819       |
|    learning_rate        | 0.0001      |
|    loss                 | 7.74        |
|    n_updates            | 1900        |
|    policy_gradient_loss | -0.00145    |
|    value_loss           | 27.9        |
-----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 6.21    

---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 5.34      |
|    ep_rew_mean          | 81.6      |
| time/                   |           |
|    fps                  | 31        |
|    iterations           | 201       |
|    time_elapsed         | 26105     |
|    total_timesteps      | 823296    |
| train/                  |           |
|    approx_kl            | 0.0315566 |
|    clip_fraction        | 0.113     |
|    clip_range           | 0.2       |
|    entropy_loss         | -0.137    |
|    explained_variance   | 0.852     |
|    learning_rate        | 0.0001    |
|    loss                 | 6.06      |
|    n_updates            | 2000      |
|    policy_gradient_loss | -0.0142   |
|    value_loss           | 15.7      |
---------------------------------------


In [None]:
!tensorboard --logdir={os.path.join('logs/DOOM', 'PPO_1')}

<img src='image/Doom_Basic_1.png'/>
<img src='image/Doom_Basic_2.png'/>

Remark:
- We would like the ***Mean Reward keep going as higher as possible***, it ***reach highest reward at 150k timesteps around 80*** (time spend 1hr 18min).
- Compare to the average reward -121 by taking random action before learning, ***the agent perform significantly better. It killed the monster very quickly***.

# 5. Load and Test the Best Saved Agent

In [47]:
# Import eval policy to test agent
from stable_baseline3.common.evaluation import evaluate_policy

# Reload model from disc
model = PPO_model.load(CHECKPOINT_DIR+'/best_model_1000000.zip')     # input the model name here

# Create rendered environment
env = VizDoomGym(render = True)

# Evaluate mean reward for 10 games
mean_reward, _ = evaluate_policy(model, env, n_eval_episodes = 20)

print(mean_reward)

83.0


Remark:
- ***Load the best saved agent*** during the 1 million timesteps process to evaluate the final performance of the agent. ***It killed the monster very quickly and received 83 rewards***. 

***End of Page***