<a href="https://colab.research.google.com/github/ArghaSarker/reinforcement-learning-homework-summer-2023/blob/main/homework04/homework04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Implementing ATARI-game "Breakout" as a Deep Q-Network

#General GYM commands

###Create (vector) environment
`env = gym.make("ENVIRONMENT_NAME")` <br>
`envs = gym.vector.make("ENVIRONMENT_NAME", N_ENVIRONMENTS)`

###Reset environment to initial state
`observation, info = env.reset()` <br>
***NOTE: In vectorized environment, we do not have to reset the environment when it is terminated/truncated, because whenever a sub-environment finishes, it starts again automatically. This is for efficiency, because we will end at different time points for the sub-environments, and can't afford to wait until all are terminated.

###Sample a random action in environment
`action = env.action_space.sample()`

###Take an action in environment
`observation, reward, terminated, truncated, info = env.step(action)`
- observation = new state (image)
- reward = reward (int)
- terminated = terminal state reached? (bool)
- truncated = max sequence length reached? (bool)
- info = human-readable information

#Breakout (ALE/Breakout-v5)

###Action Space (`env.action_space`)
***There are 4 different actions conducted by pressing 4 different buttoms.
- action 0 (NOOP): perform no action
- action 1 (FIRE): restart after losing a life
- action 2 (RIGHT): move platform to the right
- action 3 (LEFT): move platform to the left

###Observation Space (`env.observation_space`)
The output state / observation is an image with 210x160 pixels + 3 color channels (rgb), i.e. a vector of size **(210,160,3)**. <br>
The observation space is a vector of size **(0,255,(210,160,3),uint8)**, i.e. each data point in this space is an image of size (210,160,3), where each pixel can take an unsigned integer (with 8 Bits) - value between [0,255].

###Frameskip
Uses a frameskip of **4**: Instead of playing each frame of the game independently, we play 4 frames at once, i.e. we do the same action for all 4 frames with a probability of 0.25.

# Tool Setup

Gym environment (Atari Games): https://gymnasium.farama.org/#

In [1]:
pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
pip install "gymnasium[atari, accept-rom-license]"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import gymnasium as gym    #for atari games environment
import tensorflow as tf    #for deep ANN
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import time
import gc

#Experience Replay Buffer
= stores all collected samples, i.e. works as a dataset that is used for training our Q-Network


In [4]:
class ExperienceReplayBuffer:

  def __init__(self, max_size: int, environment_name: str, parallel_game_unrolls: int, observation_preprocessing_function: callable, unroll_steps: int):
    self.max_size = max_size                                                      #amount of max samples that can be stored
    self.environment_name = environment_name                                      #the environment we use
    self.parallel_game_unrolls = parallel_game_unrolls                            #amount of environments, in which we play in parallel
    self.observation_preprocessing_function = observation_preprocessing_function  #preprocessing function used to [...] the observations
    self.unroll_steps = unroll_steps                                              #amount of steps that we take in each sub-environment to generate data sample
    self.envs = gym.vector.make(environment_name, self.parallel_game_unrolls)     #create vectorized environment to allow sampling from multiple envs in parallel
    self.num_possible_actions = self.envs.single_action_space.n                   #amount of possible actions that can be taken in a sub-environment
    self.current_states, _ = self.envs.reset()                                    #stores the current state of each sub-env
    self.data = []                                                                #stores all tf.datasets that each represent the information of a single step at time step t in all sub-environments

  def sample_epsilon_greedy(self, dqn_network, epsilon: float):
    """ sample an action from DQN given the observation by using  epsilon greedy approach """
    #get observation / current state
    observations = self.observation_preprocessing_function(self.current_states)
    #run q-network on observation to estimate q-values
    q_values = dqn_network(observations)                                                                                                           #tensor of type tf.float32 + shape (parallel_game_unrolls, num_actions)
    #find best action [0 or 1 or 2 or 3] for each sub-environment
    greedy_actions = tf.argmax(q_values, axis=1)                                                                                                   #tensor of type tf.int64 + shape(parallel_game_unrolls,1)
    #select random action [0 or 1 or 2 or 3] for each sub-environment
    random_actions = tf.random.uniform(shape=(self.parallel_game_unrolls,1), minval=0, maxval=self.num_possible_actions, dtype=tf.int64)           #tensor of type tf.int64 + shape(parallel_game_unrolls,1)
    #create a boolean vector using epsilon [True (with prob = 1-epsilon), False (with prob = epsilon)]
    epsilon_sampling = tf.random.uniform(shape=(self.parallel_game_unrolls,1), minval=0, maxval=1, dtype=tf.float32) > epsilon                     #tensor of type tf.bool + shape(parallel_game_unrolls,1)
    #get action by applying boolean vector onto the 2 choice vectors greedy and random actions [True = greedy/vector1, False = random/vector2]
    actions = tf.where(epsilon_sampling,greedy_actions,random_actions).numpy()                                                                     #tensor of type tf.int64 + shape(parallel_game_unrolls,1)
    return actions

  def fill_with_samples(self, dqn_network, epsilon: float):
    """ add new samples into the ERB """
    states_list = []
    actions_list = []
    rewards_list = []
    subsequent_states_list = []
    terminateds_list = []

    #GENERATE DATA: conduct 'unroll_steps' steps in each sub-environment
    for i in range(self.unroll_steps):
      #choose next action using epsilon greedy
      actions = self.sample_epsilon_greedy(dqn_network, epsilon)
      #actions is a 2d array (for some reason) so flatten it to 1d array
      actions = np.array(actions).flatten()
      #conduct action
      next_states, rewards, terminateds, _, _ = self.envs.step(actions)
      #save observation, action, reward, next observation
      states_list.append(self.current_states)
      actions_list.append(actions)
      rewards_list.append(rewards)
      subsequent_states_list.append(next_states)
      terminateds_list.append(terminateds)
      #update states
      self.current_states = next_states

    #EXTRACT INFORMATION TUPLES OF EACH SUB-ENVIRONMENT
    def data_generator():
      for states_batch, actions_batch, rewards_batch, subsequent_states_batch, terminateds_batch in zip(states_list, actions_list, rewards_list, subsequent_states_list, terminateds_list):
        #for each sub-environment
        for game_idx in range(self.parallel_game_unrolls):
          #get state, action, reward, next state, terminated
          state = states_batch[game_idx,:,:,:] #state is given in high, width, color channels
          action = actions_batch[game_idx]
          reward = rewards_batch[game_idx]
          subsequent_state = subsequent_states_batch[game_idx]
          terminated = terminateds_batch[game_idx]
          yield(state,action,reward,subsequent_state,terminated)

    #FEED INFORMATION TUPLES INTO A tf.DATASET
    #give shape + data type for state,                                   action,                                  reward,                                  subsequent_state,                                 terminated
    ds_tensor_specs = (tf.TensorSpec(shape=(210,160,3), dtype=tf.uint8), tf.TensorSpec(shape=(), dtype=tf.int32), tf.TensorSpec(shape=(), dtype=tf.int32), tf.TensorSpec(shape=(210,160,3), dtype=tf.uint8), tf.TensorSpec(shape=(), dtype=tf.bool))
    #create tf.dataset that store all information of this step taken in all sub-environments
    new_samples_ds = tf.data.Dataset.from_generator(data_generator, output_signature=ds_tensor_specs)

    #ADD NEW DATAPOINT/SUB-DATASET TO OUR DATASET/ERB
    #preprocess dataset
    new_samples_ds = new_samples_ds.map(lambda state, action, reward, subsequent_state, terminated: (self.observation_preprocessing_function(state), action, reward,  self.observation_preprocessing_function(subsequent_state), terminated))
    new_samples_ds = new_samples_ds.cache().shuffle(buffer_size=self.unroll_steps * self.parallel_game_unrolls, reshuffle_each_iteration=True)
    #run through dataset once (without doing sth) to apply preprocessing steps and to make sure that cache is applied
    for elem in new_samples_ds:
      continue
    #save dataset
    self.data.append(new_samples_ds)
    #get total amount of data samples (# of datasets * # of steps considered per sample * # of sub-environments)
    datapoints_in_data = len(self.data) * self.unroll_steps * self.parallel_game_unrolls
    #check if maximum amount of data samples is exceeded
    if datapoints_in_data > self.max_size:
      #delete oldest data sample to stay below max size
      self.data.pop(0)

  def create_dataset(self):
    """ create td.data.Dataset object from the ERB """
    erb_dataset = tf.data.Dataset.sample_from_datasets(datasets=self.data, weights=[1/float(len(self.data)) for _ in self.data], stop_on_empty_dataset=False)
    return erb_dataset


  and should_run_async(code)


In [5]:
def observation_preprocessing_function(observation):
  """ convert an obsevation (state) to a tensor of type tf.float32, shape (84,84,3) """

  #reduce image size from (210,160,3) to (84,84,3) for efficiency
  observation = tf.image.resize(observation, size=(84,84), method=tf.image.ResizeMethod.NEAREST_NEIGHBOR) #ATARI has low-resolution graphics and sharp pixelated edges, so we use Nearest Neighbor (instead of Linear Interpolation) as method to preserve those characteristics and avoid smoothing/blurring artifacts
  #change data type from  tf.uint8 to tf.float32
  observation = tf.cast(observation, dtype=tf.float32)
  #zero-center, i.e. put data in the range [-1.0, 1.0]
  observation = observation / 128.0 - 1.0

  return observation


#Deep Q-Network
= a neural network that gets an image of Atari game state as input and returns the choosen action (the picked buttoms) as output

In [6]:
def create_dqn_network(num_actions: int):
  """ create deep Q-network agent using functional API """

  #create input for functional tf.model api (we reduce the image size for efficiency)
  input_layer = tf.keras.Input(shape=(84,84,3), dtype=tf.float32)
  #convolutional layers using residual/skip connections
  x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, activation='relu')(input_layer)
  x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, activation='relu')(x)
  #use global average pooling
  x = tf.keras.layers.GlobalAvgPool2D()(x)
  #apply densely connected layer
  x = tf.keras.layers.Dense(units=64, activation='relu')(x)
  #output layer (NO residual connection!) using no activation; creates q-values for all actions
  y = tf.keras.layers.Dense(units=num_actions, activation='linear')(x)

  model = tf.keras.Model(inputs=input_layer, outputs=y)
  return model

In [13]:
def train_dqn(train_dqn_network, target_network, dataset, optimizer, discount_factor: float, num_training_steps: int, batch_size: int=256):
  """ Train Deep Q-Network in 'num_training_steps' steps using 'optimizer' """

  #use minibatches
  dataset = dataset.batch(batch_size).prefetch(4)

  @tf.function
  def training_step(q_targets, observations, actions):
    """ A sub-function for a single training step"""
    with tf.GradientTape() as tape:
      #estimate q-values for the actions that we actually took in the sample
      q_predictions_all_actions = train_dqn_network(observations)                                   #shape of (batch_size, num_actions)
      q_predictions = tf.gather(q_predictions_all_actions, actions, batch_dims=1)
      #compute MSE loss
      loss = tf.reduce_mean(tf.square(q_predictions - q_targets))
    #calculate the gradients
    gradients = tape.gradient(loss, train_dqn_network.trainable_variables)
    #apply gradients on network
    optimizer.apply_gradients(zip(gradients, train_dqn_network.trainable_variables))
    #return loss for prediction error tracker
    return loss

  losses = []
  q_values = []
  for i, state_transition in enumerate(dataset):
    state, action, reward, subsequent_state, terminated = state_transition
    #get q-values from target network
    all_q_values = target_network(subsequent_state)
    #select max q-value of each sub-environment step
    max_q_values = tf.math.reduce_max(all_q_values, axis=1)
    #save mean of all q-values for q-value tracker
    q_values.append(np.mean(all_q_values.numpy()))
    #get bool-vector of whether we want to use subsequent states q-value (0 = if current state is terminated, 1 = if not terminated)
    use_subsequent_state = tf.where(terminated, tf.zeros_like(max_q_values, dtype=tf.float32), tf.ones_like(max_q_values, dtype=tf.float32))
    #compute q-targets
    q_targets = reward + (discount_factor * max_q_values * use_subsequent_state)
    #conduct a training step: update network parameters by gradient descent
    loss = training_step(q_targets, state, action)
    loss = loss.numpy()
    #save loss for prediction error tracker
    losses.append(loss)
    #check if training done
    if i >= num_training_steps:
      break

  #return the average over the losses for prediction error tracker + the q-values for the q-value tracker
  return np.mean(losses), np.mean(q_values)


  and should_run_async(code)


#RL application

In [8]:
def test_q_network(test_dqn_network, environment_name: str, num_parallel_tests: int, discount_factor: float, preprocessing_function: callable, test_epsilon: float=0.05):
  """
    Play n games in parallel until all are finished, to get return for return tracker
      - This is a very inefficient way of getting the return, but quite easy.
      - We need this function the get the return, because most samples (of x steps) do not cover the whole trajectory from start state to terminal state, but only a trajectory part.
  """
  #create a vectorized environment + reset to start at initial state
  envs = gym.vector.make(environment_name, num_parallel_tests)
  states, _ = envs.reset()

  done = False
  timestep = 0
  returns = np.zeros(num_parallel_tests)
  episodes_finished = np.zeros(num_parallel_tests, dtype=bool) #np vector of shape (num_parallel_tests,1) filled woth booleans, starting with all False
  num_possible_actions = envs.single_action_space.n

  #conduct a step in each sub-environment until all sub-environments are terminated
  while not done:

    #preprocess states
    states = preprocessing_function(states)
    #compute q-values
    q_values = test_dqn_network(states)
    #get best/greedy action for each sub-environment
    greedy_actions = tf.argmax(q_values, axis=1)                                                                             #tensor of type tf.int64 + shape (num_parallel_tests,1)
    #get random action for each sub-environment
    random_actions = tf.random.uniform(shape=(num_parallel_tests,1), minval=0, maxval=num_possible_actions, dtype=tf.int64)  #tensor of type tf.int64 + shape (num_parallel_tests,1)
    #choose an action using epsilon sampling
    epsilon_sampling = tf.random.uniform(shape=(num_parallel_tests,1), minval=0, maxval=1, dtype=tf.float32) > test_epsilon  #tensor of type tf.bool + shape (num_parallel_tests,1)
    actions = tf.where(epsilon_sampling, greedy_actions, random_actions).numpy()                                             #tensor of type tf.int64 + shape (num_parallel_tests,1)
    #conduct action
    next_states, rewards, terminateds, _, _ = envs.step(actions)

    #update by turning all newly terminated environments to True
    episodes_finished = np.logical_or(episodes_finished, terminateds)
    #update return (discounted sum of rewards of all time steps) by adding discounted reward of current time step
    returns += ((discount_factor**timestep) * rewards) * (np.logical_not(episodes_finished).astype(np.float32))     #NOTE: We only update non-terminated sub-environments
    #increase time step
    timestep += 1
    #check if all sub-environments terminated
    done = np.all(episodes_finished)

  #return the average return
  return np.mean(returns)


In [9]:
def polyak_averaging_weights(source_network, target_network, polyak_averaging_factor: float):
  """
    copy the weights of a source network to the a target network in a Polyak averaging way,
    i.e. average Source and Target network's weights in a weighted manner
    --> If 'polyak_averaging_factor' = 0, then we copy WHOLE source network's weights.
  """

  for target_weights, source_weights in zip(target_network.weights, source_network.weights):
        target_weights.assign(polyak_averaging_factor * source_weights + (1 - polyak_averaging_factor) * target_weights)


In [10]:
def visualize_results(results_df, step):

  #create three subplots (one for each tracker)
  fig, axis = plt.subplots(3,1)
  #include the row indexes explicitly in the results.df
  results_df['step'] = results_df.index
  #plot the average return
  sns.lineplot(x='step', y='average_return', data=results_df, ax=axis[0])
  #plot the average loss
  sns.lineplot(x='step', y='average_loss', data=results_df, ax=axis[1])
  #plot the average q-values
  sns.lineplot(x='step', y='average_q_values', data=results_df, ax=axis[2])
  #create a timestring from the timestamp
  timestring = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  #save the figure
  plt.savefig(f'./results/{timestring}_results_step{step}.png')
  #close the figure
  plt.close(fig)


In [14]:
def dqn():

  ENVIRONMENT_NAME = "ALE/Breakout-v5"
  NUMBER_ACTIONS = gym.make(ENVIRONMENT_NAME).action_space.n
  MAX_SIZE = 10000 #100000
  PARALLEL_GAME_UNROLLS = 10 #128
  UNROLL_STEPS = 4
  PREFILL_STEPS = 100
  EPSILON = 0.2
  DISCOUNT_FACTOR = 0.995
  POLYAK_AVERAGING_FACTOR = 0.99           #the more training we do in-between updating steps, the smaller can this factor be
  OPTIMIZER = tf.keras.optimizers.Adam()
  NUM_TRAINING_STEPS_PER_ITERATION = 16
  TRAIN_BATCH_SIZE = 200 #512
  NUM_TRAINING_ITERATIONS = 51 #50000
  TEST_EVERY_N_STEPS = 50
  TEST_NUM_PARALLEL_ENVIRONMENTS = 10 #128

  #create experience replay buffer, DQN agent/network (= the network we train), Target network (= the network we use to calculate the Q-estimation targets)
  erb = ExperienceReplayBuffer(MAX_SIZE, ENVIRONMENT_NAME, PARALLEL_GAME_UNROLLS, observation_preprocessing_function, UNROLL_STEPS)
  dqn_agent = create_dqn_network(NUMBER_ACTIONS)
  target_network = create_dqn_network(NUMBER_ACTIONS)

  #initialize target network with identical weights as the DQN network
  polyak_averaging_weights(dqn_agent, target_network, polyak_averaging_factor = 0.0)

  #initialize trackers for return, prediction error, average q-values
  return_tracker = []
  dqn_prediction_error_tracker = []
  avg_q_values_tracker = []

  #prefill the replay buffer with wide-spread sample trajectories, by choosing totally random actions (no policy)
  for prefill_step in range(PREFILL_STEPS):
    erb.fill_with_samples(dqn_agent, epsilon=1.0)

  #TRAIN AGENT
  for step in range(NUM_TRAINING_ITERATIONS):
    print('Training step: ', step)

    #sample trajectories (s,a,r,s') and store them in replay buffer
    erb.fill_with_samples(dqn_agent, EPSILON)
    #create training dataset by selecting random samples from the replay buffer
    dataset = erb.create_dataset()
    #train DQN using selected samples
    average_loss, average_q_values = train_dqn(dqn_agent, target_network, dataset, OPTIMIZER, DISCOUNT_FACTOR, NUM_TRAINING_STEPS_PER_ITERATION, TRAIN_BATCH_SIZE)

    #update target network via Polyak averaging
    polyak_averaging_weights(dqn_agent, target_network, POLYAK_AVERAGING_FACTOR)

    #TEST AGENT: report return, prediction error, average q-values in N steps intervals
    if (step % TEST_EVERY_N_STEPS == 0):
      #test q-network to get average return
      average_return = test_q_network(dqn_agent, ENVIRONMENT_NAME, TEST_NUM_PARALLEL_ENVIRONMENTS, DISCOUNT_FACTOR, observation_preprocessing_function)
      #save tracked info
      return_tracker.append(average_return)
      dqn_prediction_error_tracker.append(average_loss)
      avg_q_values_tracker.append(average_q_values)
      #print average returns, losses, q-values
      print(f"TESTING: Average return: {average_return}, Average loss: {average_loss}, Average q-value-estimation: {average_q_values}")
      #put all result lists into a Pandas dataframe by transforming them into a dict first
      results_dict = {"average_return": return_tracker, "average_loss": dqn_prediction_error_tracker, "average q-values": avg_q_values_tracker}
      results_df = pd.DataFrame(results_dict)
      #visualize the results with seaborn
      visualize_results(results_df, step)
      print(results_df)


In [15]:
#test everything
if __name__ == '__main__':
  dqn()

Training step:  0


InvalidArgumentError: ignored

#Optimizations
- pre-fill training buffer before training to avoid overestimation bias
- use separate target network, which is a delayed version of Q-network, to avoid moving target problem