The --'temp'-- file module generates temporary files and directories that can be used as a 
temporary storage area for data files.

The --'deque'--  command, from the collections module, creates a double-ended queue, practically 
a list where you can append items at the start or at the end. Interestingly, 
it can be set to a predefined size. When full, older items are discarded in order 
to make the place for new entries.

We will structure this project using a series of classes representing:

## the agent, 

## the agent's brain (our DQN), 

## the agent's memory, 

# and the environment 


which is provided by OpenAI Gym but it needs to be correctly connected to the agent. 
It is necessary to code a class for this

In [1]:
import gym
import os
from gym import wrappers
import numpy as np
import random, 
import tempfile 
from collections import deque
import tensorflow as tf

# Defining the AI brain

The first step in the project is to create a Brain class containing all the neural network code in order to compute a Q-value approximation. The class will contain the necessary initialization, the code for creating a suitable TensorFlow graph for the purpose, a simple neural network (not a complex deep learning architecture but a simple, working network for our project—you can replace it with more complex architectures), and finally, methods for fit and predict operations.

we also have to set the scope. In order to define the scope a string will help us to keep separate networks created for different purposes, and in our project, we have two, one for processing the next reward and one for guessing the final reward.

In [2]:
class Brain:
    """
    A Q-Value approximation obtained using a neural network.
    This network is used for both the Q-Network and the Target Network.
    """
    def __init__(self, nS, nA, scope="estimator",
                 learning_rate=0.0001,
                 neural_architecture=None,
                 global_step=None, summaries_dir=None):
        
        self.nS = nS # the size of the state inputs
        self.nA = nA # the size of the action output
        self.global_step = global_step
        self.scope = scope
        self.learning_rate = learning_rate
        
        if not neural_architecture:
            neural_architecture = self.two_layers_network
        
        # Writes Tensorboard summaries to disk
        with tf.variable_scope(scope):
            # Build the graph
            self.create_network(network=neural_architecture,learning_rate=self.learning_rate)
            if summaries_dir:
                summary_dir = os.path.join(summaries_dir,"summaries_%s" % scope)
                if not os.path.exists(summary_dir):
                    os.makedirs(summary_dir)
                # initializes an event file in a target directory (summary_dir)
                # where we store the key measures of the learning process
                # The handle is kept in self.summary_writer, which we will be using later 
                # for storing the measures we are interested in representing during and after the training 
                # for monitoring and debugging what has been learned.
                self.summary_writer = tf.summary.FileWriter(summary_dir) 
            else:
                self.summary_writer = None
                
                

#  default neural network

As input, it takes the input layer and the respective size of the hidden layers that we will be using. The input layer  is defined by the state that we are using, which could be a vector of measurements, as in our case, or an image, as in the original DQN paper) 

Such layers are simply defined using the higher level ops offered by the Layers module of TensorFlow (https://www.tensorflow.org/api_guides/python/contrib.layers). Our choice goes for the vanilla fully_connected, using the ReLU (rectifier) activation function for the two hidden layers and the linear activation of the output layer. 


In [6]:
    def two_layers_network(self, x, layer_1_nodes=32, layer_2_nodes=32):

        layer_1 = tf.contrib.layers.fully_connected(x, layer_1_nodes, activation_fn=tf.nn.relu)
        layer_2 = tf.contrib.layers.fully_connected(layer_1, layer_2_nodes,activation_fn=tf.nn.relu)
        return tf.contrib.layers.fully_connected(layer_2, self.nA, activation_fn=None)


Also, a few summaries are recorded for TensorBoard: 
    
1.- The average loss of the batch, in order to keep track of the fit during training
2.- The maximum predicted reward in the batch, in order to keep track of extreme positive predictions, 
    pointing out the best-winning moves
3.- The average predicted reward in the batch, in order to keep track of the general tendency 
    of predicting good moves    

#  create_network

combines input, neural network, loss, and optimization. The loss is simply created by taking the difference between the original reward and the estimated result, squaring it, and taking the average through all the examples present in the batch being learned. The loss is minimized using an Adam optimizer.


In [10]:
    def create_network(self, network, learning_rate=0.0001):

        # Placeholders for states input
        self.X = tf.placeholder(shape=[None, self.nS],dtype=tf.float32, name="X")
        
        # The r target value
        self.y = tf.placeholder(shape=[None, self.nA],dtype=tf.float32, name="y")
        
        # Applying the choosen network
        self.predictions = network(self.X)
        
        # Calculating the loss
        sq_diff = tf.squared_difference(self.y, self.predictions)
        self.loss = tf.reduce_mean(sq_diff)
        
        # Optimizing parameters using the Adam optimizer
        self.train_op = tf.contrib.layers.optimize_loss(self.loss, 
                        global_step=tf.train.get_global_step(),                                      
                        learning_rate=learning_rate, 
                        optimizer='Adam')
        
        # Recording summaries for Tensorboard
        self.summaries = tf.summary.merge([
            tf.summary.scalar("loss", self.loss),
            tf.summary.scalar("max_q_value", 
                             tf.reduce_max(self.predictions)),
            tf.summary.scalar("mean_q_value", 
                             tf.reduce_mean(self.predictions))])
        

The class is completed by a predict and a fit method. 

The fit method takes as input the state matrix, s, as the input batch and the vector of reward r as the outcome.

It also takes into account how many epochs you want to train (in the original papers it is suggested using just a single epoch per batch in order to avoid overfitting too much to each batch of observations). 

Then, in the present session, the input is fit with respect to the outcome and summaries (previously defined as we created the network). 

In [11]:
    def predict(self, sess, s):
        """
        Predicting q values for actions
        """
        return sess.run(self.predictions, {self.X: s})

    def fit(self, sess, s, r, epochs=1):
        """
        Updating the Q* function estimator
        """
        feed_dict = {self.X: s, self.y: r}
        for epoch in range(epochs):
            res = sess.run([self.summaries, self.train_op, self.loss,
                            self.predictions,tf.train.get_global_step()],feed_dict)
            
            summaries, train_op, loss, predictions, self.global_step = res

        if self.summary_writer:
            self.summary_writer.add_summary(summaries, self.global_step)

# Creating memory for experience replay

After defining the brain (the TensorFlow neural network), our next step is to define the memory, that is the storage for data that will power the learning process of the DQN network. At each training episode each step, made of a state and an action, is recorded together with the consequent state and the final reward of the episode (something that will be known only when the episode completes).

Adding a flag telling if the observation is a terminal one or not completes the set of recorded information. The idea is to connect certain moves not just to the immediate reward (which could be null or modest) but the ending reward, thus associating every move in that session to it.

The class memory is simply a queue of a certain size, which is then filled with information on the previous game experiences, and it is easy to sample and extract from it. Given its fixed size, it is important that older examples are pushed out of the queue, thus allowing the available examples to always be among the last ones.

The class comprises an initialization, where the data structure takes origin and its size is fixed, the len method (so we know whether the memory is full or not, which is useful, for instance, in order to wait for any training at least until we have plenty of them for better randomization and variety for learning), add_memory for recording in the queue, and recall_memory for recovering all the data from it in a list format:



In [12]:
class Memory:
    """
    A memory class based on deque, a list-like container with 
    fast appends and pops on either end (from the collections 
    package)
    """
    def __init__(self, memory_size=5000):
        self.memory = deque(maxlen=memory_size)

    def __len__(self):
        return len(self.memory)

    def add_memory(self, s, a, r, s_, status):
        """
        Memorizing the tuple (s a r s_) plus the Boolean flag status,
        reminding if we are at a terminal move or not
        """
        self.memory.append((s, a, r, s_, status))

    def recall_memories(self):
        """
        Returning all the memorized data at once
        """
        return list(self.memory)
    

# Creating the agent

The next class is the agent, which has the role of initializing and maintaining the brain (providing the Q-value function approximation) and the memory. It is the agent, moreover, that acts in the environment. Its initialization sets a series of parameters that are mostly fixed given our experience in optimizing the learning for the Lunar Lander game. They can be explicitly changed, though, when the agent is first initialized:

1.- epsilon = 1.0 is the initial value in the exploration-exploitation parameter. The 1.0 value forces the agent to completely rely on exploration, that is, random moving.

2.- epsilon_min = 0.01 sets the minimum value of the exploration-exploitation parameter: a value of 0.01 means that there is a 1% chance that the landing pod will move randomly and not based on Q function feedback. This always provides a minimum chance to find another optimal way of completing the game, without compromising it.

3.- epsilon_decay = 0.9994 is the decay that regulates the speed the epsilon diminishes toward the minimum. In this setting, it is tuned to reach a minimum value after about 5,000 episodes, which on average should provide the algorithm at least 2 million examples to learn from.

4.- gamma = 0.99 is the reward discount factor with which the Q-value estimation weights the future reward with respect to the present reward, thus allowing the algorithm to be short- or long-sighted, according to what is best in the kind of game being played (in Lunar Lander it is better to be long-sighted because the actual reward will be experienced only when the landing pod lands on the Moon).

5.- learning_rate = 0.0001 is the learning rate for the Adam optimizer to learn the batch of examples.

6.- epochs = 1 is the training epochs used by the neural network in order to fit the batch set of examples.

7.- batch_size = 32 is the size of the batch examples.

8.- memory = Memory(memory_size=250000) is the size of the memory queue.


In [13]:
class Agent:
    def __init__(self, nS, nA, experiment_dir):
        # Initializing
        self.nS = nS
        self.nA = nA
        self.epsilon = 1.0  # exploration-exploitation ratio
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.9994
        self.gamma = 0.99  # reward decay
        self.learning_rate = 0.0001
        self.epochs = 1  # training epochs
        self.batch_size = 32
        self.memory = Memory(memory_size=250000)

        # Creating estimators
        self.experiment_dir =os.path.abspath("./experiments/{}".format(experiment_dir))
        
        self.global_step = tf.Variable(0, name='global_step', trainable=False)
        
        self.model = Brain(nS=self.nS, nA=self.nA, scope="q",
                           learning_rate=self.learning_rate,
                           global_step=self.global_step,
                           summaries_dir=self.experiment_dir)
        
        self.target_model = Brain(nS=self.nS, nA=self.nA, 
                                             scope="target_q",
                             learning_rate=self.learning_rate,
                                 global_step=self.global_step)

        # Adding an op to initialize the variables.
        init_op = tf.global_variables_initializer()
        
        # Adding ops to save and restore all the variables.
        self.saver = tf.train.Saver()

        # Setting up the session
        self.sess = tf.Session()
        self.sess.run(init_op)
        

The epsilon dealing with the share of time devoted exploring new solutions compared to exploiting the knowledge of the network is constantly updated with the epsilon_update method, which simply modifies the actual epsilon by multiplying it by epsilon_decay unless it has already reached its allowed minimum value:

In [None]:
    def epsilon_update(self, t):
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def save_weights(self, filename):
        """
        Saving the weights of a model
        """
        save_path = self.saver.save(self.sess, 
                                    "%s.ckpt" % filename)
        print("Model saved in file: %s" % save_path)
def load_weights(self, filename):
    """
    Restoring the weights of a model
    """
    self.saver.restore(self.sess, "%s.ckpt" % filename)
    print("Model restored from file")

The set_weights and  target_model_update methods work together to update the target Q network with the weights of the Q network (set_weights is a general-purpose, reusable function you can use in your solutions, too). 

Since we named the two scopes differently, it is easy to enumerate the variables of each network from the list of trainable variables. Once enumerated, the variables are joined in an assignment to be executed by the running session:

In [None]:
    def set_weights(self, model_1, model_2):
        """
        Replicates the model parameters of one 
        estimator to another.
        model_1: Estimator to copy the parameters from
        model_2: Estimator to copy the parameters to
        """
        # Enumerating and sorting the parameters 
        # of the two models
        model_1_params = [t for t in tf.trainable_variables() \
                          if t.name.startswith(model_1.scope)]
        model_2_params = [t for t in tf.trainable_variables() \
                         if t.name.startswith(model_2.scope)]
        model_1_params = sorted(model_1_params, 
                                key=lambda x: x.name)
        model_2_params = sorted(model_2_params, 
                                key=lambda x: x.name)
        # Enumerating the operations to be done
        operations = [coef_2.assign(coef_1) for coef_1, coef_2 \
                      in zip(model_1_params, model_2_params)]
        # Executing the operations to be done
        self.sess.run(operations)
    def target_model_update(self):
        """
        Setting the model weights to the target model's ones
        """
        self.set_weights(self.model, self.target_model)

The act method is the core of the policy implementation because it will decide, based on epsilon, whether to take a random move or go for the best possible one. 

If it is going for the best possible move, it will ask the trained Q network to provide a reward estimate for each of the possible next moves (represented in a binary way by pushing one of four buttons in the Lunar Lander game) and it will return the move characterized by the maximum predicted reward (a greedy approach to the solution):

In [None]:
    def act(self, s):
        """
        Having the agent act based on learned Q* function
        or by random choice (based on epsilon)
        """
        # Based on epsilon predicting or randomly 
        # choosing the next action
        if np.random.rand() <= self.epsilon:
            return np.random.choice(self.nA)
        else:
            # Estimating q for all possible actions
            q = self.model.predict(self.sess, s)[0]
            # Returning the best action
            best_action = np.argmax(q)
            return best_action

In [None]:
Thereplay method completes the class. It is a crucial method because it makes learning for the DQN algorithm 
possible. We are going, therefore, to discuss how it works thoroughly. The first thing that the replay method 
does is to sample a batch (we defined the batch size at initialization) from the memories of previous game 
episodes (such memories are just the variables containing values about status, action, 
reward, next status, and a flag variable noticing if the observation is a final status or not). 

The random sampling allows the model to find the best coefficients in order to learn the Q function by 
a slow adjustment of the network's weights, batch after batch.

Then the method finds out whether the sampling recalled statuses are final or not. Non-final rewards 
need to be updated in order to represent the reward that you get at the end of the game. 
This is done by using the target network, which represents a snapshot of the Q function 
network as fixed at the end of the previous learning. The target network is fed with the following status, 
and the resulting reward is summed, after being discounted by a gamma factor, with the present reward.


In [None]:
    def replay(self):
        # Picking up a random batch from memory
        batch = np.array(random.sample(\
                self.memory.recall_memories(), self.batch_size))
        # Retrieving the sequence of present states
        s = np.vstack(batch[:, 0])
        # Recalling the sequence of actions
        a = np.array(batch[:, 1], dtype=int)
        # Recalling the rewards
        r = np.copy(batch[:, 2])
        # Recalling the sequence of resulting states
        s_p = np.vstack(batch[:, 3])
        # Checking if the reward is relative to 
        # a not terminal state
        status = np.where(batch[:, 4] == False)
        # We use the model to predict the rewards by 
        # our model and the target model
        next_reward = self.model.predict(self.sess, s_p)
        final_reward = self.target_model.predict(self.sess, s_p)

        if len(status[0]) > 0:
            # Non-terminal update rule using the target model
            # If a reward is not from a terminal state, 
            # the reward is just a partial one (r0)
            # We should add the remaining and obtain a 
            # final reward using target predictions
            best_next_action = np.argmax(\
                             next_reward[status, :][0], axis=1)
            # adding the discounted final reward
            r[status] += np.multiply(self.gamma,
                     final_reward[status, best_next_action][0])

        # We replace the expected rewards for actions 
        # when dealing with observed actions and rewards
        expected_reward = self.model.predict(self.sess, s)
        expected_reward[range(self.batch_size), a] = r

        # We re-fit status against predicted/observed rewards
        self.model.fit(self.sess, s, expected_reward,
                       epochs=self.epochs)

# Specifying the environment

The last class to be implemented is the Environment class. Actually, the environment is provided by the gym command, though you need a good wrapper around it in order to have it work with the previous agent class. That's exactly what this class does. At initialization, it starts the Lunar Lander game and sets key variables such as nS, nA (dimensions of state and action), agent, and the cumulative reward (useful for testing the solution by providing an average of the last 100 episodes):


In [None]:
class Environment:
    def __init__(self, game="LunarLander-v2"):
        # Initializing
        np.set_printoptions(precision=2)
        self.env = gym.make(game)
        self.env = wrappers.Monitor(self.env, tempfile.mkdtemp(), 
                               force=True, video_callable=False)
        self.nS = self.env.observation_space.shape[0]
        self.nA = self.env.action_space.n
        self.agent = Agent(self.nS, self.nA, self.env.spec.id)

        # Cumulative reward
        self.reward_avg = deque(maxlen=100)

# NOTE 

Using incremental training is a bit tricky and it requires some attention if you do not want to spoil the 
results you have obtained with your training so far. The trouble is that when we restart the brain has 
pre-trained coefficients but memory is actually empty (we can call this as a cold restart). 

Being the memory of the agent empty, it cannot support good learning because of too few and limited examples. 

Consequently, the quality of the examples being fed is really not perfect for learning 
(the examples are mostly correlated with each other and very specific to the few newly experienced episodes). 

The risk of ruining the training can be mitigated using a very low epsilon (we suggest set at the minimum, 0.01 ):
in this way, the network  will most of the time simply re-learn its own weights because it will suggest 
for each state the actions it already knows, and its performance shouldnt worsen but oscillate in a stable 
way until there are enough examples in memory and it will start improving again.



Here is the code for issuing the correct methods for training and testing:

In [None]:
    def test(self):
        self.learn(epsilon=0.0, episodes=100, 
                        trainable=False, incremental=False)

    def train(self, epsilon=1.0, episodes=1000):
        self.learn(epsilon=epsilon, episodes=episodes, 
                        trainable=True, incremental=False)

    def incremental(self, epsilon=0.01, episodes=100):
        self.learn(epsilon=epsilon, episodes=episodes, 
                        trainable=True, incremental=True)

The final method islearn, arranging all the steps for the agent to interact with and learn from the environment. The method takes the epsilon value (thus overriding any previous epsilon value the agent had), the number of episodes to run in the environment, whether it is being trained or not (a Boolean flag), and whether the training is continuing from the training of a previous model (another Boolean flag).

In the first block of code, the method loads the previously trained weights of the network for Q value approximation if we want:

    1.- to test the network and see how it works;
    2.- to carry on some previous training using further examples.

Then the method delves into a nested iteration. The outside iteration is running through the required number of episodes (each episode a Lunar Lander game has taken to its conclusion). Whereas the inner iteration is instead running through a maximum of 1,000 steps making up an episode.

At each time step in the iteration, the neural network is interrogated on the next move. If it is under test, it will always simply provide the answer about the next best move. If it is under training, there is some chance, depending on the value of epsilon, that it won't suggest the best move but it will instead propose making a random move

In [None]:
    def learn(self, epsilon=None, episodes=1000, 
              trainable=True, incremental=False):
        """
        Representing the interaction between the enviroment 
        and the learning agent
        """
        # Restoring weights if required
        if not trainable or (trainable and incremental):
            try:
                print("Loading weights")
                self.agent.load_weights('./weights.h5')
            except:
                print("Exception")
                trainable = True
                incremental = False
                epsilon = 1.0

        # Setting epsilon
        self.agent.epsilon = epsilon
        # Iterating through episodes
        for episode in range(episodes):
            # Initializing a new episode
            episode_reward = 0
            s = self.env.reset()
            # s is put at default values
            s = np.reshape(s, [1, self.nS])

            # Iterating through time frames
            for time_frame in range(1000):
                if not trainable:
                    # If not learning, representing 
                    # the agent on video
                    self.env.render()
                # Deciding on the next action to take
                a = self.agent.act(s)
                # Performing the action and getting feedback
                s_p, r, status, info = self.env.step(a)
                s_p = np.reshape(s_p, [1, self.nS])

                # Adding the reward to the cumulative reward
                episode_reward += r

                # Adding the overall experience to memory
                if trainable:
                    self.agent.memory.add_memory(s, a, r, s_p,
                                                 status)

                # Setting the new state as the current one
                s = s_p

                # Performing experience replay if memory length 
                # is greater than the batch length
                if trainable:
                    if len(self.agent.memory) > \
                           self.agent.batch_size:
                        self.agent.replay()

                # When the episode is completed, 
                # exiting this loop
                if status:
                    if trainable:
                       self.agent.target_model_update()
                    break

            # Exploration vs exploitation
            self.agent.epsilon_update(episode)

            # Running an average of the past 100 episodes
            self.reward_avg.append(episode_reward)
            print("episode: %i score: %.2f avg_score: %.2f"
                  "actions %i epsilon %.2f" % (episode,
                                        episode_reward,
                           np.average(self.reward_avg),
                                            time_frame,
                                               epsilon)
        self.env.close()

        if trainable:
            # Saving the weights for the future
            self.agent.save_weights('./weights.h5')

After the move, all the information is gathered (initial state, chosen action, obtained reward, and consequent state) and saved into memory. At this time frame, if the memory is large enough to create a batch for the neural network approximating the Q function, then a training session is run. When all the time frames of the episode have been consumed, the weights of the DQN get stored into another network to be used as a stable reference as the DQN network is learning from a new episode.


# Running the reinforcement learning process

Finally, after all the digression on reinforcement learning and DQN and writing down the complete code for the project, you can run it using a script or a Jupyter Notebook, leveraging the Environment class that puts all the code functionalities together:



In [None]:
lunar_lander = Environment(game="LunarLander-v2")

After instantiating it, you just have to run the train, starting from epsilon=1.0 and setting the goal to 5000 episodes (which corresponds to about 2.2 million examples of chained variables of state, action and reward). The actual code we provided is set to successfully accomplish a fully trained DQN model, though it may take some time, given your GPU's availability and its computing capabilities:

In [None]:
lunar_lander.train(epsilon=1.0, episodes=5000)

In the end, the class will complete the required training, leaving a saved model on disk (which could be run or even reprised anytime). You can even inspect the TensorBoard using a simple command that can be run from a shell:

In [None]:
tensorboard --logdir=./experiments --port 6006

The plots will appear on your browser, and they will be available for inspection at the local address localhost:6006: