<a href="https://colab.research.google.com/github/SevioStanton/Spoon-Knife/blob/master/Battlezone_DeepQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import Dependencies**

In [1]:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import os # for creating directories

#for rendering in colab
import glob
from IPython.display import HTML
from gym.wrappers import Monitor
import io
import base64
from IPython import display as ipythondisplay

Using TensorFlow backend.


**Setting up Rendering Process**

In [2]:
#pip installs for rendering
#!pip install gym pyvirtualdisplay > /dev/null 2>&1
#!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
#!apt-get update > /dev/null 2>&1
#!apt-get install cmake > /dev/null 2>&1
#!pip install --upgrade setuptools 2>&1
#!pip install ez_setup > /dev/null 2>&1
#!pip install gym[atari] 
#!pip install h5py pyyaml
#!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
#!unzip ngrok-stable-linux-amd64.zip

In [3]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

**Set Hyperparameters**

In [4]:
#env = gym.make('BattleZone-v0') # initialise environment

#because of rendering process, we must use the wrap_env function defined above
env = wrap_env(gym.make('BattleZone-v0'))

In [5]:
state_size = env.observation_space.shape[0]
state_size

#The shape attribute for numpy arrays returns the dimensions of the array. 
#If Y has n rows and m columns, then Y.shape is (n,m). So Y.shape[0] is n
#so, number of rowes in observation space is 210 for Battlezone.
#not sure what is included in those rows.
#for cartpole, the rows were for cart velocity, cart position, pole position, and pole angular velocity


210

In [6]:
action_size = env.action_space.n
action_size

18

In [7]:
env.unwrapped.get_action_meanings()

##could posibly remove up and down from actions to eliminate tank movement.

['NOOP',
 'FIRE',
 'UP',
 'RIGHT',
 'LEFT',
 'DOWN',
 'UPRIGHT',
 'UPLEFT',
 'DOWNRIGHT',
 'DOWNLEFT',
 'UPFIRE',
 'RIGHTFIRE',
 'LEFTFIRE',
 'DOWNFIRE',
 'UPRIGHTFIRE',
 'UPLEFTFIRE',
 'DOWNRIGHTFIRE',
 'DOWNLEFTFIRE']

In [8]:
batch_size = 32
#amount of data used to update
#larger batch sizes can result in faster computational speed because it takes fewer iterations to learn
#smaller batch sizes empirically produce better results https://www.youtube.com/watch?v=O5xeyoRL95U
#According to Yann LeCunn in Revisiting Small Batch Training for Deep Neural Networks (2018)
#Batch sizes larger than 32 can be bad for test error; the results don't generalize well for some reason

**Info on batch optimal batch sizes**

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
Some kinds of hardware achieve better runtime with speciﬁc sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

You might want also to consult several good posts here in Stack Exchange:

Tradeoff batch size vs. number of iterations to train a neural network
Selection of Mini-batch Size for Neural Network Regression
How large should the batch size be for stochastic gradient descent?
Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

Hope this helps...

UPDATE (Dec 2017): There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

In [9]:
n_episodes = 200 #number of games we want to play

#could probably optimize for some diminishing returns threshold.
#doing so should also eliminate overfitting.

In [10]:
output_dir = 'model_output/battlezone/'
if not os.path.exists(output_dir):
  os.makedirs(output_dir)

  #if file location does not exist, make it

**Define Agent**

In [11]:
class DQNAgent:
  def __init__(self, state_size, action_size):
    self.state_size = state_size
    self.action_size = action_size
    self.memory = deque(maxlen=2000)
    #deque is a double-ended que which acts like a list, but elements can be added
    #or removed from either end

    self.gamma = 0.95
    #decay/discount rate: enables agent to take into account future actions
    #in addition to the immediate ones, but gives less credence to future rwards
    #the farther out they are.

    self.epsilon = 1.0
    #exploration rate: the initial probability of exploration
    #will decrease based on the epsilon decay rate

    self.epsilon_decay = 0.990
    #decrease the number of random explorations by 1.0% of its own magnitude every time step
    #decay rate is multiplied to epsilon later ************************

    self.epsilon_min = 0.001
    #minimum amount of random exploratory probability

    self.learning_rate = 0.001
    #rate at which neural network adjusts model parameters via
    #stochastic gradient descent

    self.model = self._build_model() #private method
    #should only be accessible from within the class


  
  def _build_model(self):
    #neural net to approximate Q-value Function:
    model = Sequential()

    model.add(Dense(32, activation='relu', input_dim=self.state_size))
    #first hidden layer
    #states are the input, 210 of them

    model.add(Dense(32, activation = 'relu'))
    #second hidden layer

    model.add(Dense(self.action_size, activation = 'linear'))
    #18 actions, so there should be 18 output neurons

    model.compile(loss='mse', optimizer = Adam(lr = self.learning_rate))
    #consider other loss functions https://keras.io/api/losses/ to replace mse

    return model



  def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
    #appends list of previous experiences
    #allows for retraining later



  def train(self, batch_size):
    #method that trains neural network with experiences sampled from memory

    minibatch = random.sample(self.memory, batch_size)
    #sample a minibatch from memory

    for state, action, reward, next_state, done in minibatch:
      #xtract data for each minibatch sample
      
      #if done
      target = reward
      #if done, which is a boolean describing whether or not the game is done,
      #then target equals reward

      if not done:
        target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
        #targe is equal to reward plus discount rate times the maximum target Q based on future reward

      target_f = self.model.predict(state)
      #approximately map current state to future discounted reward

      target_f[0][action] = target
      #target_f is an arary of arrays
      #target is assigned to the action-nth index in the 0th array of target_f

      self.model.fit(state, target_f, epochs = 1, verbose = 0)
      #single epoch of training with x=state, y=target_f; fit decreases loss between
      #target_f and y_hat

    if self.epsilon > self.epsilon_min:
      self.epsilon *= self.epsilon_decay
      #reduces epsilon by 0.5% of itself (if decay = 0.995)
      #until min threshold is reached/surpassed



  def act(self, state):
    if np.random.rand() <= self.epsilon:
      #if acting randomly, take random action
      #random number generated from 0 to 1

      return random.randrange(self.action_size)
      #returns random choice from range of action_size (0:18)

    act_values = self.model.predict(state)
    #if not acting randomly, predict reward value based on current state

    return np.argmax(act_values[0])
    #pick the action that will give the highest reward
    #maycause a problem as in battlezone and telescopes
    #simultaneous actionsa are possible. ie, move and shoot or move up and left

  

  def save(self, name):
    self.model.save_weights(name)

  def load(self, name):
    self.model.load_weights(name)


**Interact with Environment**

In [12]:
agent = DQNAgent(state_size, action_size) #initialize agent

In [13]:
episode_array = []
episode_counter = 0
time_score_array = []
game_scores=[]

for e in range(n_episodes):
  #iterate over of episodes of the game

  


  state = env.reset()
  #reset state at start of each new episode of the game
  
  tot_reward =0
  #game_scores, start_life = [], 1

  state = np.reshape(state, [480, state_size])

  done = False

  time = 0
  #time represents a frame of the episode
  #may need to edit this


  while not done:
    #env.render()
    
    action = agent.act(state)
    #choose from set of 18 actions

    next_state, reward, done, _ = env.step(action)
    #agent interacts with env, gets feedback; 210 state data points

    reward = reward if not done else reward-10
    #may not translate well from cartpole to battlezone.
    #I guess it may reward survival

    

    next_state = np.reshape(next_state, [480, state_size])

    agent.remember(state, action, reward, next_state, done)
    #remember the previous timestep's state, actions, reward, etc.

    state = next_state
    #set "current state" for upcoming iteration to the current next state
    
    time+=1  
    tot_reward+=reward
    if done:
      #if episode ends:
      print("episode: {}/{}, Time: {}, e: {:.2}, In Game Score: {}".format(e, n_episodes-1, time, agent.epsilon, tot_reward))
      #print episode#, time steps survived, agent's epsilon, and in game score (assuming in game score = reward)

      episode_counter+=1
      episode_array.append(episode_counter)
      time_score_array.append(time)
  show_video()
       

  #Then you can show video after each episode or only in the end
  

  if len(agent.memory) > batch_size:
    agent.train(batch_size)
    #train the agent by replaying the experiences of the episode

  if e % 10 == 0:
    agent.save(output_dir + "weights_" + '{:04d}'.format(e) + ".hdf5")
  
  



episode: 0/199, Time: 1915, e: 1.0, In Game Score: 9990.0


episode: 1/199, Time: 1688, e: 0.99, In Game Score: 2990.0


episode: 2/199, Time: 1496, e: 0.98, In Game Score: 4990.0


episode: 3/199, Time: 2566, e: 0.97, In Game Score: 2990.0


episode: 4/199, Time: 1720, e: 0.96, In Game Score: 990.0


episode: 5/199, Time: 1625, e: 0.95, In Game Score: 1990.0


episode: 6/199, Time: 1536, e: 0.94, In Game Score: 2990.0


episode: 7/199, Time: 1799, e: 0.93, In Game Score: -10.0


KeyboardInterrupt: ignored

In [None]:
print(episode_array)
print(time_score_array)
print(game_score)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(episode_array,time_score_array)
plt.ylabel('Score')
plt.xlabel('Episode')
plt.show()

In [None]:
def normalize_scores(score_input):
  data = np.copy(score_input)
  normalizing_factor = np.amax(score_input)
  print(data[0]/np.amax(score_input))
  print(normalizing_factor)
  normed = []
  for score in range(len(data)):
    print(data[score]/np.amax(score_input))
    normed.append(data[score]/np.amax(score_input))
    #print(normed)
  return normed

normalized_time_scores = normalize_scores(time_score_array)
print(normalized_time_scores)
print(normalized_time_scores[0])


In [None]:
plt.plot(episode_array, normalized_time_scores)
plt.ylabel('Normalized Score')
plt.xlabel('Episode')
plt.show()

In [None]:
plt.plot(episode_array,time_score_array, scaley=True)
plt.ylabel('Score (Time Steps Survived)')
plt.xlabel('Episode')
plt.show()

In [None]:
from sklearn import preprocessing
import numpy as np

data = np.copy(time_score_array)
print("Data = ", data)

# normalize the data attributes
normalized = preprocessing.normalize(data)
print("Normalized Data = ", normalized)

In [None]:
print(time_score_array[0]/np.amax(time_score_array))

In [None]:
plt.hist(time_score_array)
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(time_score_array, cumulative = True)
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(time_score_array, cumulative = True, histtype = 'step', bins=500)
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(time_score_array, histtype = 'step', bins=500)
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(time_score_array, histtype = 'step', bins = 100)
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
def bin_width(width_int, data_array):
  bin_size = []
  num_of_bins = int(round(np.amax(data_array)/width_int))
  for i in range(num_of_bins+1):
    bin_size.append(width_int*i)
  return bin_size

plt.hist(time_score_array, histtype = 'step', bins= bin_width(500, time_score_array))
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(time_score_array, histtype = 'step', bins= bin_width(1000, time_score_array))
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:


plt.hist(time_score_array, histtype = 'step', bins= bin_width(100, time_score_array))
plt.xlabel('Score (Time Steps Survived)')
plt.ylabel('Frequency')
plt.show()

In [None]:
count = 0
for i in range(len(time_score_array)):
  if time_score_array[i] >= 8000:
    count+=1
print(count)

print(count/n_episodes)

In [None]:
count = 0
for i in range(len(time_score_array)):
  if time_score_array[i] >= 9000:
    count+=1
print(count)

print(count/n_episodes)

In [None]:
count = 0
for i in range(len(time_score_array)):
  if time_score_array[i] >= 9999:
    count+=1
print(count)

print(count/n_episodes)

In [None]:
help(env)