# CS673 Deep Learning | Final Project

## Proposal -  Improved Bot Learning process on Atari games by using Transfer Learning

## Team: 
- Ching-Hao Sun
- Chia-Lin Hsieh
- Rahul Gautham Putcha

## Index
- [Abstract](#Abstract)
- [Baseline Models](#Baseline-Models:)
- [Candidate ML Models / Methods](#Candidate-ML-Models-/-Methods)
- [Project Environment Setup](#Project-Environment-Setup)
  - [Part 1: Installation of GYM](#Part-1)
  - [Part 2: Reinforcement Learning Dependencies](#Part-2)
- [Working on the Project: PART I - Learning the first game](#Working-on-the-Project:-PART-I---Learning-the-first-game)
  - [Hyperparameters](#Hyperparameters)
  - [About Reinforcement Learning](#About-Reinforcement-Learning) : (Yet-to-update)
  - [Replay Memory](#Replay-Memory)
  - [Agent](#Agent)
  - [Starting the Game Environment](#Starting-the-Game-Environment)
  - [A short demo: Of how the model predicts](#A-short-demo:-Of-how-the-model-predicts)
  - [Q-Learning](#Q-Learning)
  - [Model Checkpointing](#Model-Checkpointing)
- [Working on the Project: PART II - Learning the second game](#Working-on-the-Project:-PART-I---Learning-the-first-game)


## Abstract
Reinforcement learning algorithms require tens of thousands or millions of time steps -
which is equivalent to several weeks of training in real time to learn how to play a
single game. Having a bot trained from scratch is costly in terms of time and processing
power.

Suppose we have a pre-trained model of a bot that has already learnt to play one game. 
We intend to make use of the same trained-model for a bot in learning another game of 
a similar traits/environment, thereby improving the efficiency of learning the second 
game and expanding the bot’s knowledge in tackling multiple games in less time.

## Baseline Models:
CNN (Convolutional Neural Network) with DQN (Deep-Q-Network; a Q-Learning variant)


## Candidate ML Models / Methods
- Deep Convolutional Neural Network
- Deep Q-Network
- Transfer Learning


## Project Environment Setup

### Part 1
#### Requirements:
  - Development Environment Window (Installation procedure is similar for Linux and Mac too...)
  - Miniconda or Anaconda with conda cmd installed 


#### Install Microsoft Visual Studio 2022 (For Windows only)
  - Select Build Tools Desktop Development with C++

#### Installation process
- Open a Terminal (For example: Command prompt) ... Require Conda cmd installed by means of Miniconda installation setup
- Setup a new environment **(Recommended)**\
    <code>$ conda create -n env3</code>
    
    <code>$ conda activate env3</code>

- Install Necessary Package in our new environments

  - Install Python3.7 \
    <code>$ conda install python=3.7</code>
    
  - Install OpenAI Gym for Atari games \
    <code>$ pip install gym[atari]</code>

- With this project comes a git repository where you can download the project folder structure and the necessary file after environment setup shown above.
   - The link to [GitHub project repo](https://github.com)


- After Setting up the repo locally into your computer, put all of your atari game into the './roms' folder.
- Choose a Atari game from any of the following sources or your choice:
  - [Breakout from oldgames.sk](https://www.oldgames.sk/en/game/breakout/download/8314/)
  - [SpaceInvaders from consoleroms.com](https://www.consoleroms.com/roms/atari-2600/space-invaders)
  - [SpaceInvaders from atarimania.com](http://www.atarimania.com/game-atari-2600-vcs-space-invaders_s6947.html)

- Also, you can see by default the roms folder contains Breakout and SpaceInvaders '.bin' files in it.
- After putting all of you games that you want to run in this project, go back to the terminal where you are running conda environment.
- Here, run following cmd to load in the game into our arcade learning environment (A way for us to use the atari games using open-ai gym) \
    <code>$ ale-import-roms /roms</code>

**You are now all set to Run this project...**

If not, no need to worry. Execute below project steps sequentially to get all dependencies setup in no time.

### Making Gym[Atari] work on our localhost
At first we load the games by importing the Arcade Learning Environment package. we uploaded the games using ale-import-roms into this program and use it inside gym emulator. This is a setup tutorial, if you have already done with the setup feel free to skip and proceed to [Part 2](#Part-2).

This project requires Python 3.7 and gym[atari]==0.19.0

**Execute below line (START to END) if you are on Google Colab else execute below line in conda CLI environment such as env3 as mentioned above**

For Example,\
Conda Terminal or CMD prompt(Windows) or Terminal(Linux or Mac OS)\
<code>(env3) path> conda uninstall python</code>\
<code>(env3) path> conda install python=3.7</code>

**START**

In [None]:
!conda uninstall python
!conda install python=3.7

In [None]:
!pip uninstall gym
!pip install gym[atari]==0.19.0

**END**

**(Mandatary execution)** Executing below step will import games that are necessary for our project

In [None]:
!ale-import-roms roms

In [None]:
import warnings
from ale_py import ALEInterface
from ale_py.roms import Breakout
from ale_py.roms import SpaceInvaders

ale = ALEInterface()        # Ignore any Deprecation warnings cause by this line
ale.loadROM(Breakout)       # This line will load your Breakout game into this project
ale.loadROM(SpaceInvaders)  # This line will load your SpaceInvaders game into this project

Let Try to see the Breakout atari game inside of gym,

In [None]:
import gym
env = gym.make('Breakout-v4')

Actions moves the bot can make in this game:

In [None]:
print(f"This game supports {env.action_space.n} action moves")
print(f"The moves are {env.unwrapped.get_action_meanings()}") # Note that the NOOP means no operation or no move

A basic game play can be executed as follows:

In [None]:
env.reset() # Start the game from beginning
t=0 # timestamp (epoch)
while True: # Run the game till the game is over, for every timestep
    if HYPERPARAMS["ON_COLAB"]==False: env.render() # Print the game to the screen ...
    action = env.action_space.sample() # Random action
    observation, reward, done, info = env.step(action) # At each step try random action
    if done: # if the game is over (End of the game: can be win, lose or draw in any game) => Stop the game
        print("Episode is finished after the {} timesteps".format(t+1))
        print("Episode info: {}".format(info)) # What the reason? for the game to stop
        break
    t=t+1

env.close() # Close the window
print() # Just for format: This one just prints nothing so we can avoid the print of previous line in jupyter

You will see the game window pop up and close automatically. If you did Hurray!! We are now able to work with any game using Gym in our project.

### Part 2

Install Tensorflow and Keras,

**Execute below line (START to END) if you are on Google Colab else execute below line in conda CLI environment such as env3 as mentioned above**

**START**

In [None]:
!pip uninstall keras==2.7.0
!pip install keras

In [None]:
!pip uninstall tensorflow
!pip install --upgrade tensorflow==2.7.0

In [None]:
!pip install numpy
!pip install opencv-python
!pip install pyglet
!pip install scikit-image
# For PIL
!pip install pillow
!pip install matplotlib

**END**

## Working on the Project: PART I - Learning the first game
If you run above commands we will see below import modules to be successfully executed in our program

In [3]:
# Basic Python Libraries
import os
import sys
import itertools
import random
from collections import deque

# Gym for loading Atari Environment compatible for Reinforcement Learning
import gym
from gym.wrappers import Monitor

# Basic Data Science Libraries (Useful for Reinforcement Learning)
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
import keras

# Image Processing Libraries
import cv2
import imageio
from skimage.color import rgb2gray
from skimage.transform import resize
from skimage import img_as_ubyte

In [4]:
# For Deep Learning: Building Neural Network
import tensorflow as tf
import keras.backend as K
from keras.models import Sequential, Model
from keras.layers import Input, Flatten, Dense, Multiply, Concatenate, LeakyReLU, Lambda, Conv2D
from tensorflow.keras.optimizers import Adam, RMSprop, SGD

### Hyperparameters
Below are the hyperparameters that we are using to tune the learning process of the agent.

In [5]:
# Hyperparameters space
HYPERPARAMS = {
    # Google COLAB Setting
    "ON_COLAB": False,
    
    # Reinforcement Learning Parameters
    "ENV_NAME"        : 'BreakoutDeterministic-v4', # Name of environment to be used
    # ('SpaceInvaders-v0') # ('Assault-ram-v0')
    "MEM_SIZE"        : 250000, # Size of replay memory
    "GAMMA"           : 0.99,   # Gamma (Discount rate) of Markov decision process
    
    # Exploration vs Exploitation (for Epsilon Decay Policy)
    "EPSILON"         : 1,     # Start agent at exploration stage=1, or exploitation=0
    "EPSILON_MIN"     : 0.01,
    "EPSILON_DECAY"   : 0.9995,
    "TOTAL_FRAMES"    : 5000000,
    "EPSILON_MAX"     : 1,
    
    # Model Training Hyper Parameters
    "LEARNING_RATE"   : 0.0001,
    "MOMENTUM"        : 0.001,
    "STACK_SIZE"      : 4,
    "HIDDEN_NEURONS"  : 512,  # Number of Neurons in the Deep Neural Network
    "MINIBATCH_SIZE"  : 32,
    "NUM_EPISODES"    : 5000, # Number of episodes/gameplay for the agent training
    "RETRAIN"         : 100,  # Number of times the agent background model trains on its Replay memory before proceeding
    
    # Model used in Replay Memory for Exploitation (Learning to win) on What has be Explored (or What has been found).
    "TGT_UPDATE_FREQ" : 1500,
    "NUM_EXPLORE"     : 1000,
    
    # For demo purpose
    "VIS_DIR"               : "GIFs", # For GOOGLE COLAB Environment only
    "AUTOSAVE_CHECKPOINT"   : 100,   # Auto save model after a number of episode = 100
    "SAVED_MODEL_NAME"      : "model_dqn_breakout.h5", # Name of final model, second game "model_dqn_spaceshoot.h5"
    "TRANSFER_MODEL_NAME"   : "model_dqn_breakout.h5",
    "TMP_MODEL"             : 'tmp_model_1.h5',        # TODO: Remove it as its not used yet??
    
    # Extra Tuning
    "NOOPMAX"         : 8, # Maximum number of No operation actions taken at the beginning of the game (For using every exploration)
    
    # Testing
    "NUM_EVAL"        : 20, # Number of Evals (Test runs)
}

In [6]:
#Lists to store loss and reward value per game!
LOSS_HISTORY = []
REWARD_HISTORY = []

# Number of Frame viewed OR number of times env.step(action) called.
FRAME_COUNT = 0

model_swap =1
if not os.path.exists("models"): 
    os.mkdir("models")

### About Reinforcement Learning
(To be updated later)


### Replay Memory
Replay Memory for improving the agent model by making it play (or fitting) over it past experience, stored in it's Memory.

In [7]:
# For Boosting Experience
class Replay_Memory:
    '''
        This replay memory clas would act as a buffer in which previous experiences would be stored. 
        Agent Experience = [ state=current_state, action=current_action, reward, next_state, done]
    '''
    def __init__(self, MEM_SIZE = 2000): 
        self.memory = deque(maxlen = MEM_SIZE)
        self.max_size = MEM_SIZE
    def add(self,  state, action, reward, next_state, done): 
        self.memory.append(( state, action, reward, next_state, done))

### Agent
The agent class: contains the following attribute,
- Performs Learning using Reinforcement Learning
- Saves/Loads the model
- Contains Background model and Foreground model
- Foreground model plays the game (Exploration)
- Background model trains on its Explored Observation (or states/images/frames)
- Background model is trained by means of using Replay Memory as mentioned above.
- Foreground model is updated after every TGT_UPDATE_FREQ periods on swap count i.e.,(swap_count % TGT_UPDATE_FREQ == 0)
- **Special:** Can also do Transfer Learning
- Transfer Learning is a ability of making an agent that has learned to play one game to play another game of similar but higher complexity, in a short duration.

In [8]:
# Model/Agent
class Agent:
    '''This class contains all methods for an agent to function.'''
    def __init__(self, env):
        print("Setting up the agent ...")
        self.state_size = env.observation_space.shape[0]
        self.action_size = env.action_space.n
        self.memory = Replay_Memory(HYPERPARAMS["MEM_SIZE"])
        self.gamma = HYPERPARAMS["GAMMA"]
        self.epsilon = HYPERPARAMS["EPSILON"]
        self.epsilon_max = HYPERPARAMS["EPSILON_MAX"]
        self.epsilon_min = HYPERPARAMS["EPSILON_MIN"]
        self.total_frame = HYPERPARAMS["TOTAL_FRAMES"]
        self.slope = (self.epsilon_max - self.epsilon_min)/self.total_frame
        self.epsilon_decay = HYPERPARAMS["EPSILON_DECAY"]
        self.lr = HYPERPARAMS["LEARNING_RATE"]
        self.momentum = HYPERPARAMS["MOMENTUM"]
        self.dummy_input = np.zeros((1,self.action_size))
        self.dummy_batch = np.zeros((HYPERPARAMS["MINIBATCH_SIZE"],self.action_size))
        
        self.model = self.build_model()
        self.target_model = self.get_tgt_model()
        
        print("Agent has been Sucessfully setup ...")
        
        
    # Function for Agent Model Setup
    def lambda_out_shape(self, input_shape):
        shape = list(input_shape)
        shape[-1] = 1
        return tuple(shape)
        
    
    def build_model(self):
        '''Model to train the agent'''
        return self._build_compatible_model(self.action_size)

    def transfer_learning(self, model_pathname, old_actions_size, new_actions_size):
        '''
            Perform a transfer of knowledege about a game of similar less complex enviroment to this game.
            Please supply:
            - model_pathname: the location of model file that has already learnt to training on a game.
            - old_action_size: the estimation of size for the already learnt game.
            - new_action_size: the estimation of size for the current game being played.
        '''
        return self._build_compatible_model(old_actions_size, True, new_actions_size,model_pathname)

    def _build_compatible_model(self, actions_size, is_transfer_learning=False, new_actions_size=0, model_pathname=""):
        '''A single method to build model normally or build model after transfer learning from another game'''
        prev_actions_size = actions_size

        agent_bot = self
        input_layer = Input(shape=(84, 84, HYPERPARAMS["STACK_SIZE"]), name="image")     # Sending the stack of 4 resized image of 84*84

        # Convolution-Max Pooling parts
        conv_layer1 = Conv2D(32, (8, 8), strides=(4, 4), activation='relu', name="conv2D_1")(input_layer)
        # May Introduce Max Pooling here ...
        conv_layer2 = Conv2D(64, (4, 4), strides=(2, 2), activation='relu', name="conv2D_2")(conv_layer1)
        # May Introduce Max Pooling here ...
        conv_layer3 = Conv2D(64, (3, 3), strides=(1, 1), activation='relu', name="conv2D_3")(conv_layer2)

        # Densely connected Neural Network
        flat_feature = Flatten(name="flat_1")(conv_layer3)                               # Input Layer
        hidden_feature = Dense(HYPERPARAMS["HIDDEN_NEURONS"], name="hidden_layer_1")(flat_feature)      # Hidden Layer
        lrelu_feature = LeakyReLU(name="activation_layer")(hidden_feature)               # using Leaky-Rely activation with alpha=0.3 on Hidden Layer

        # Setting up the Output Layer
        q_value_prediction = Dense(prev_actions_size, name="q_values")(lrelu_feature)

        # Get Single Action and Target Q value
        action_one_hot = Input(shape=(prev_actions_size,), name="action")          # Take Current Action to be played
        select_q_value_of_action = Multiply()([q_value_prediction,action_one_hot]) # Checking the Q-value of current move/action
        target_q_value = Lambda(lambda x:K.max(x, axis=-1, keepdims=True),output_shape=agent_bot.lambda_out_shape)(select_q_value_of_action)

        model = Model(inputs=[input_layer,action_one_hot], outputs=[q_value_prediction, target_q_value])

        if is_transfer_learning:
            # Load model for previous game
            model.load_weights(model_pathname)
            
            # Sibling layers for learning actions
            prev_Qlayer = model.get_layer(name="q_values")
            prev_Qlayer._name = "old_action_q_values"
            new_Qlayer = Dense(new_actions_size, name="new_action_q_values")

            # Merge the Sibling layers to form one layer to estimate Q-values
            q_value_prediction = Concatenate(name="q_values")([prev_Qlayer(lrelu_feature), new_Qlayer(lrelu_feature)])

            # Get Single Action and Target Q value
            action_one_hot = Input(shape=(prev_actions_size+new_actions_size,), name="action")
            select_q_value_of_action = Multiply()([
                q_value_prediction,
                action_one_hot
            ])  
            target_q_value = Lambda(lambda x:K.max(x, axis=-1, keepdims=True),output_shape=agent_bot.lambda_out_shape)(select_q_value_of_action)

            model = Model(inputs=[input_layer,action_one_hot], outputs=[q_value_prediction, target_q_value])

        model.compile(loss=['mse','mse'], loss_weights=[0.0,1.0],optimizer=Adam(agent_bot.lr))
        return model
        
    def get_tgt_model(self):
        '''This method would clone the architecture as well as the initial weights of the base model into target model'''
        self.target_model = self.build_model()
        self.update_target_model()
        return self.target_model
    
    def update_target_model(self): 
        '''This method would update weights of target model'''
        self.target_model.set_weights(self.model.get_weights())
    
    def load_model(self, pathname): 
        '''This method would load weights of model'''
        # load weights into new model
        self.model.load_weights(pathname)
        self.target_model = self.get_tgt_model()
        print("Loaded model from disk")
        
    def save_model(self, pathname):
        '''Save method would save weights of model model'''
        # serialize weights to HDF5
        self.model.save_weights(pathname)
        self.target_model = self.get_tgt_model()
        print("Saved model to disk")
        

    # Agent Play Prediction function
    def next_action(self, state):
        '''Get the next action using epsilon greedy policy for deciding whether to exploit or explore'''
        if self.epsilon > self.epsilon_min:
            self.epsilon = self.epsilon_max - self.slope*(FRAME_COUNT)
        if (np.random.rand() <= self.epsilon):
            return env.action_space.sample()
        q_values = self.model.predict([np.expand_dims(state,axis=0),self.dummy_input])[0]
        return np.argmax(q_values[0])

    # Replay Functions
    def store_experience(self, state, action, reward, next_state, done):
        '''Store the experience in our replay memory'''
        self.memory.add(state, action, reward, next_state, done)

    def replay(self, batch_size, model_swap):
        '''
            Does the back propogation to adjust weights during exploitation action.
            - batch_size: total number of random samples that the agent can recollect from memory
            - The higher the batch_size more is the time for training process.
        '''
        # REINFORCEMENT LEARNING
        print("Game Play Paused! Model is training on it's past Memory")
        # First we set all input to NOOP or no move for every observations(stack of frames or images)
        # Dummy_Inputs_batch.shape = [(MINBATCH_SIZE = 32 images), (action_size = 4 moves for breakout)]
        dummy_batch = np.zeros((batch_size,self.action_size)) 
        
        # Experience batch set
        state_batch      = []
        action_batch     = []
        reward_batch     = []
        next_state_batch = []
        terminal_batch   = []  # recording Is_done?
        
        # Actual Move that should have played (This is also an Assumption)
        y_batch = []

        # Sample random minibatch of transition from replay memory
        minibatch = random.sample(list(self.memory.memory), batch_size)
        # For every experience thats in our Replay Memory
        for data in minibatch: # We organize the data
            state_batch.append(data[0])
            action_batch.append(data[1])
            reward_batch.append(data[2])
            next_state_batch.append(data[3])
            terminal_batch.append(data[4])
        
        # Convert the is_done to a numpy array
        terminal_batch = np.array(terminal_batch)

         
        for i in np.arange(HYPERPARAMS["RETRAIN"]):
            # Get what agent is assuming with the trained model till now. Supplying NOOP/NoAction input for every move... 
            # Model is predicting the Q-Value or we can all it as Future reward from current move (or action) made
            target_q_values_batch = self.target_model.predict([np.float32(np.array(next_state_batch)),self.dummy_batch])[0]
            # What model should assume (The Assumption is to predict its own output without any loss) Outrageous!!
            y_batch = reward_batch + (1 - terminal_batch) * self.gamma * np.max(target_q_values_batch, axis=-1)
            # (1 - terminal_batch) above is to indicate the game is done(0, return only reward) or not(1, return reward with discounted sum)
            # y_batch is also called Future Reward or the reward model is expecting to get in the future.


            a_one_hot = np.zeros((batch_size,self.action_size))
            for index,action in enumerate(action_batch):
                a_one_hot[index,action] = 1.0            # Get the Action the player performed previously

            # START TRAINING PROCESS
            # Get the loss between NoAction and the Expected Action that model suplies
            loss = self.model.train_on_batch([np.float32(np.array(state_batch)),a_one_hot],[self.dummy_batch,y_batch])

            if i == HYPERPARAMS["RETRAIN"]-1: # Append loss to it's history, only on the last re-train loop
                LOSS_HISTORY.append(loss[1])
        
            # END TRAINING PROCESS
        
        #At target network's update frequency, update the target network
        if(model_swap % HYPERPARAMS["TGT_UPDATE_FREQ"] == 0):
            self.update_target_model()  # Swap Models
            print("Target model swapped successfully with Original model!")
            
        return 

In [9]:
# Helper functions

def preprocess(image):
    '''This method downsamples and resizes the images to 84*84 and converts it to grayscale for CNN compatibility and processing efficiency'''
    # Downsample(image) & resize image into square for CNN compatibility
    resized_image = preprocess_rgb(image)
    grayscale_image = rgb2gray(resized_image)
    return grayscale_image

def preprocess_rgb(image):
    '''This method downsamples and resizes the images to 84*84 and converts it to grayscale for CNN compatibility and processing efficiency'''
    # Downsample(image) & resize image into square for CNN compatibility
    resized_image = cv2.resize(image[::2, ::2], (84, 84), interpolation = cv2.INTER_AREA)
    return resized_image

def generate_gif(frame_no, frames, reward, path, e):
    '''Utility method to generate gif from frames'''
    for idx, frame_idx in enumerate(frames): 
        frames[idx] = resize(frame_idx, (420, 320, 3), preserve_range=True, order=0).astype(np.uint8)
        
    imageio.mimsave(f'{path}{"episode_{0}_frame_{1}_reward_{2}.gif".format(e, frame_no, reward)}', frames, duration=1/100)


### Starting the Game Environment
As seen previously we have installed the environment for running any game within gym emulator setup. Now is the time to get things working in action, for the main aim of this project.

In [10]:
from ale_py import ALEInterface
from ale_py.roms import Breakout, SpaceInvaders,Tetris
ale = ALEInterface()
ale.loadROM(Breakout) #ale.loadROM(Breakout)

  for external in metadata.entry_points().get(self.group, []):
  _RESOLVED_ROMS = _resolve_roms()


In [11]:
import gym
env = gym.make(HYPERPARAMS["ENV_NAME"])

In [12]:
print(f'ENVIRONMENT {HYPERPARAMS["ENV_NAME"]}:')
print(f'This environment requires {env.action_space.n} actions.')
print(f'The actions are {env.unwrapped.get_action_meanings()}.')

ENVIRONMENT BreakoutDeterministic-v4:
This environment requires 4 actions.
The actions are ['NOOP', 'FIRE', 'RIGHT', 'LEFT'].


In [13]:
agent = Agent(env)

Setting up the agent ...
Agent has been Sucessfully setup ...


### A short demo: Of how the model predicts

For the first step we are delivering the **STACK_SIZE=4** number of images at a time in our agent model. Along with this is the current action being played. The action signifies the last move that was played, i.e. the move played by in the last image of the 4 image stack/sequence.

You may wonder why we are considering the **STACK_SIZE**. Firstly, DeepMind choose to use the past 4 frames. Why?
1. frame doesnot describe the movement of player or the enemies or any items. (Relative motion of any object)
2. frames bare minimum requirement to learn about the speed of objects. (We capture the relative position on object between 2 frames)
3. frames is necessary to infer acceleration. Why? 
   - Every frame we are received with, provides the derivative of position w.r.t time.
4. and so on...

In [None]:
env.reset()
image1, image2, image3, image4 = preprocess(env.step(1)[0]), preprocess(env.step(2)[0]), preprocess(env.step(0)[0]), preprocess(env.step(0)[0])
cv2.imshow("image1", image1)
cv2.imshow("image2", image2)
cv2.imshow("image3", image3)
cv2.imshow("image4", image4)
cv2.waitKey(0)
cv2.destroyAllWindows()

4 windows pop up showcasing how the images look like. 

**Warning: Hit Space or any button to Resume. Else, your IPython Kernel may die/crash.**

By executing below you are letting make it's first prediction. As said before the model takes STACK_SIZE=4 images and the current action (in one hot encoded format).

In [None]:
agent.model.predict([
    np.expand_dims(
        np.stack([
            image1, image2, image3, image4
        ], axis=2), 
        axis=0
    ), np.array([[0,1,0,0]],dtype='float')
])

### Q-Learning
The first four floats, shown above, are the Q-values for each action move that can be played in the game. Also, the Q-values are for the action for the current state. The maximum of these Q-values is the **target output** or y, which we use in our **REPLAY MEMORY OF AGENT** as the expected output.

Process of Reinforcement Learning using Q-Learning
- Get Q-values from Neural Network
- use Target_Qvalue_action_i = r+max(Q-values), or 
  - Future Reward for the action_i on state_s is ( current_reward + max(next_predicted_future_reward) )

In [14]:
def slow_start(env,image_stack=[], rgb_stack=[],NOOPMAX=10):
    idle_times = random.randint(4, NOOPMAX)
    
    #print(f'Agent is Staying Idle for {idle_times} times. Agent is thinking about what move to make...')
    for idle_time in range(idle_times):
        if HYPERPARAMS["ON_COLAB"]==False: env.render() # NOTE: Comment this in Google Colab
        state, reward, done, info = env.step(0) # Zero means: NOOP or No Operation
        processed_frame = preprocess(state)
        image_stack.append(processed_frame)
        rgb_stack.append(preprocess_rgb(state))
        
    return image_stack, rgb_stack

### Model Checkpointing
During the training process we found that it is really inefficient to produce the complete training model of 30hrs in a single run. It's better to work checkpoint of 5-10 hrs and save the progress in middle. For this we have devised the load_model and save_model functionality within our agent class.

Use below function to check the trained model performance after training.

In [13]:
agent.load_model('./models/model_dqn_breakout.h5')

OSError: Unable to open file (unable to open file: name = './models/model_dqn_breakout.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

**OR** run below code block to run initial exploration stage for our agent to get an understanding of the game. This is not exactly understanding, but a way for us to fill the replay memory with images of the game being played with random actions.

### Exploration (Run only if you are avoiding Model Checkpointing step)

In [15]:
EXPLORE = 1
TRAIN   = 2
TEST    = 3
def gameplay(PLAY_TYPE=TEST,MAX_EPISODE_PLAYTIME=1000000):
    global FRAME_COUNT
    #print(f'Agent is starting a new game: {e} games played.')
    
    # Reset Game
    state = env.reset()
    times_rewarded = times_penalized = 0
    last_lives = 5
    terminal_life_lost = False # False if last_lives==0 else True
    
    #print('Agent has made the start move.')
    # Start the game by 'FIRE' action, incase if it doesnot start the game without it
    state, _, _, _ = env.step(1) 
    
    # Fill agent's memory with random times of no operation played
    image_stack,rgb_stack = slow_start(env=env, NOOPMAX=HYPERPARAMS["NOOPMAX"])
    
    
    #print(f'Agent is now playing the game...')
    i, state = 0, np.stack(image_stack[-4:], axis = 2)
    while i < MAX_EPISODE_PLAYTIME:
        agent.epsilon = 1  # Agent is Exploring the game by default
        
        # If agent has lost a life then start the game with 'FIRE' again.
        if(terminal_life_lost == True):
            state, _, _, _ = env.step(1) # 'FIRE' to start the game
            slow_start(env, image_stack, rgb_stack, HYPERPARAMS["NOOPMAX"])
            state = np.stack(image_stack[-4:], axis = 2)

        FRAME_COUNT = FRAME_COUNT + 1
        action = env.action_space.sample() if PLAY_TYPE==EXPLORE else (agent.next_action(state) if PLAY_TYPE==TRAIN else np.argmax(agent.model.predict([np.expand_dims(state,axis=0),agent.dummy_input])[0]))

        if HYPERPARAMS["ON_COLAB"]==False: env.render() # NOTE: Comment this in Google Colab
        # Agent Makes random moves here...
        next_state, reward, done, info = env.step(action)
        
        rgb_stack.append(preprocess_rgb(next_state))
        
        # Agent updates it's game status here...
        terminal_life_lost = True if info['ale.lives'] < last_lives else done
        last_lives = info['ale.lives']
        
            
        if reward > 0:      times_rewarded = times_rewarded + 1
        elif reward < 0: times_penalized = times_penalized + 1
        elif terminal_life_lost: times_penalized = times_penalized + 1
        # Making the starting experience of rewards more fruitful. For our replay memory...
        reward = 10 if reward > 0 else (-30 if reward < 0 else reward)
        reward = -30 if terminal_life_lost else reward
        
        # Store the stack of images for new a experience
        processed_frame = preprocess(next_state)
        image_stack = image_stack[-3:]
        image_stack.append(processed_frame)
        
        next_state = np.stack(image_stack[-4:], axis = 2)
        if(len(image_stack) != 4): print("Something's not right!! The stack size is less than expected.")
            
        #Store experience in replay mem
        if(PLAY_TYPE==EXPLORE or PLAY_TYPE==TRAIN): 
            agent.store_experience(state, action, reward, next_state, terminal_life_lost)
        state = next_state
        
        if done: break
        i+=1
        
    REWARD_HISTORY.append(times_rewarded)
    return image_stack, times_rewarded, times_penalized, rgb_stack

In [16]:
'''
    THIS IS DONE TO POPULATE REPLAY MEMORY WITH COMPLETE EXPLORATION TO INITIALIZE THE MEMORY 
'''
total_times_rewarded=total_times_penalized=0
for e in range(HYPERPARAMS["NUM_EXPLORE"]):
    image_stack, times_rewarded, times_penalized, rgb_stack = gameplay(PLAY_TYPE=EXPLORE, MAX_EPISODE_PLAYTIME=1000)
    total_times_rewarded  = total_times_rewarded + times_rewarded
    total_times_penalized = total_times_penalized+ times_penalized
    if(e % 100 == 0): 
        print("Finished exploring for {} episodes".format(e))
        print("Total Times Rewarded: {}, Total Times Penalized: {}".format(total_times_rewarded, total_times_penalized))

print("EXPLORATION STEP COMPLETED")

Finished exploring for 0 episodes
Total Times Rewarded: 0, Total Times Penalized: 5


KeyboardInterrupt: 

### TRAINING STAGE

In [22]:
'''
TRAIN DQN ON THE GAME
'''
rew_list = []
e=0
for e in range(1, HYPERPARAMS["NUM_EPISODES"]+1):
    print("Agent is ready for gameplay...")
    image_stack, times_rewarded, times_penalized, rgb_stack = gameplay(PLAY_TYPE=TRAIN, MAX_EPISODE_PLAYTIME=1000)
            
    rew_list.append(times_rewarded)
            
    if(e % HYPERPARAMS["AUTOSAVE_CHECKPOINT"] == 0):
        print("Finished episode ", e , "/", HYPERPARAMS["NUM_EPISODES"], " Total reward = ", sum(rew_list)/len(rew_list), " eps = ", agent.epsilon)
        if HYPERPARAMS["ON_COLAB"]:
            if not os.path.exists(HYPERPARAMS["VIS_DIR"]):
                os.mkdir(HYPERPARAMS["VIS_DIR"])
            generate_gif(len(rgb_stack), rgb_stack, sum(rew_list)/(len(rew_list)), HYPERPARAMS["VIS_DIR"] + "/", e)
        
        rew_list = []
        print(f"Saving Model Checkpoint at episode {e}...")
        agent.save_model("models/tmp_model_" + str(e) + ".h5")

    # BACKPROP INITIATED AT THE END OF EVERY EPISODE AND NOT AT TGT_FREQ
    agent.replay(HYPERPARAMS["MINIBATCH_SIZE"], model_swap)
    model_swap = model_swap + 1
    
print(f"Agent is now prepared now, after training for {e} episodes.")

Agent is ready for gameplay...
Game Play Paused! Model is training on it's past Memory
Agent is ready for gameplay...
Game Play Paused! Model is training on it's past Memory
Agent is ready for gameplay...
Game Play Paused! Model is training on it's past Memory
Agent is ready for gameplay...
Game Play Paused! Model is training on it's past Memory
Agent is ready for gameplay...
Game Play Paused! Model is training on it's past Memory


KeyboardInterrupt: 

#### Saving the final model

In [None]:
agent.save_model(agent.model, "models/" + HYPERPARAMS["MODEL_NAME"])

## Testing the model

In [18]:
rew_list = []
total_times_rewarded=total_times_penalized=0
for e in range(HYPERPARAMS["NUM_EVAL"]):
    image_stack, times_rewarded, times_penalized, rgb_stack = gameplay(PLAY_TYPE=EXPLORE, MAX_EPISODE_PLAYTIME=1000)
    
    total_times_rewarded  = total_times_rewarded + times_rewarded
    total_times_penalized = total_times_penalized+ times_penalized
    rew_list.append(times_rewarded)
    
    print("Finished episode ", e , "/", HYPERPARAMS["NUM_EVAL"], " Total reward = ", times_rewarded*(10))
    if HYPERPARAMS["ON_COLAB"]:
        if not os.path.exists("test"):
            os.mkdir("test")
        generate_gif(len(rgb_stack), rgb_stack, total_times_rewarded, "test/", e)

print("Testing Complete")

Finished episode  0 / 20  Total reward =  10
Finished episode  1 / 20  Total reward =  30
Finished episode  2 / 20  Total reward =  10
Finished episode  3 / 20  Total reward =  0
Finished episode  4 / 20  Total reward =  40
Finished episode  5 / 20  Total reward =  10
Finished episode  6 / 20  Total reward =  0
Finished episode  7 / 20  Total reward =  30
Finished episode  8 / 20  Total reward =  10
Finished episode  9 / 20  Total reward =  10
Finished episode  10 / 20  Total reward =  20
Finished episode  11 / 20  Total reward =  30
Finished episode  12 / 20  Total reward =  10
Finished episode  13 / 20  Total reward =  20
Finished episode  14 / 20  Total reward =  10
Finished episode  15 / 20  Total reward =  30
Finished episode  16 / 20  Total reward =  0
Finished episode  17 / 20  Total reward =  10
Finished episode  18 / 20  Total reward =  10
Finished episode  19 / 20  Total reward =  20
Testing Complete


In [20]:
env.close()
exit(1) # Close Gym Windows