# Main Header

## TODOs

- MERGE LEARNING FROM OBS AND FROM BATCH, MAYBE WEIGHT OBS MORE THAN BATCH
- MERGE TWO MODELS: EXISTING DOES WORK NICE FOR OBS LEARNING, OTHER MODEL FOR BATCH LEARNING
- resume = False

- create a desc for each function
- comment code properly
- model:
    - hinton dropout keras.layers.core.Dropout(p)
    - numpy only model
- loss # model.metrics_names: ['loss', 'acc']
- append code with full model
- Computationally intensive! / Use Spark on AWS

## Getting Started

### Purpose

- Get started with neural nets, Convolutions?, Fully-connected layers, activations . not eventually but anyway
- See a thing learn is exciting (05 smartcab)
- The field of ML I know least
- NOT: 2d inputs, convolutions / different input preprocessing // too: I.astype(np.float).ravel()

### Credits and Thanks

- Tambet Matiisen
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/ https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py
- Andrew Trask
https://iamtrask.github.io
- Eder Santana
http://edersantana.github.io/articles/keras_rl/
- Deep Mind
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
- Ben Lau
https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html
- Francois Chollet
https://github.com/fchollet/keras/tree/master/examples 
https://keras.io
- Sebastian Raschka
http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
- Christopher Olah
http://colah.github.io/posts/2014-07-Conv-Nets-Modular/ TODO
- Karpathy
TODO 

### Main Dependencies

We are working with Open AI Gym (https://gym.openai.com/) as a training environment for our to-be-defined AI agent.

Lunar Lander environment (https://gym.openai.com/envs/LunarLander-v2) is particularily appealing to me. It is based on box2d, which simulates real life physics. Charming! And, with its 1D input state vector, it is a first step for creating and tuning an AI agent, before we preceed with convolution preprocessing of 2D inputs.

The environment home page says the following:

*"Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine."*

Model building and training is done with Keras (https://keras.io). This modular, minimalist library makes ANN life as easy as it can get, and in plus runs on both Theano and Tensorflow backends.

## Chapter 1: The Starting Point. Q-Learning, or: Evaluating Bellman Equations from data

Q learning is about revisiting states. We are in a specific state s at time t, and because the state space is sufficiently small, we might discover that the agent has already been in s before. It therefore has made an experience for s by taking an action, and collecting a reward or punishment. All this is stored in the agent's "memory" (possible a dictionary of dictionaries, main keys being the states, values being the actions (keys) and their values). We want the agent to take advantage of this "memory": We look up the expected lifetime rewards per each possible action in s (a.k.a. action-value function, q values), select the maximum q value, and execute the chosen action. 

We now have fresh evidence about the consequence of a specific action in a specific state: We know the initial state s_t, we know the selected action a_t, we know the reward r_t, and we know the new state this all lead to, s_t1. This knowledge we now use to update the agent's memory: We calculate a new q value for s_t by taking the observed reward r_t, and adding to it the discounted maximum q value for s_t1. The difference to the old q value is the new q value for action a_t in state s_t.

### Why Q-Learning Often Does Not Work: Exploding State Spaces

Revisiting states is often not possible, even in the long run, because there are simply too many combinations of relevant inputs which constitute a state. Just think a small number of inputs, each input being a floating number with 4 digits. Even this small setting is creating a large amount of combinations: the state space explodes. Revisiting states is very unlikely, we will need a huge number of trials to generate memory updates. As a consequence, learning is slow, or even not happening.

In order to illustrate the point, we are going to set up a basic q-learning algorithm:

#### Namespace

In [5]:
import numpy as np
import gym

#### Set Hyperparameters

- GAMMA is the factor by which future expected rewards are discounted
- ALPHA is the learning rate TODO
- N_EPISODES denotes the maximum number of episodes an epoch will embrace
- Q_TABLE is the agent's memory. The states are the keys, and dictionaries of actions and their respective values are the values
- VALUE_INIT is the initial value for the actions of a state, once the state is visited the first time. It is set to zero.

In [6]:
GAMMA = 0.99
ALPHA = 0.1
N_EPISODES = 100
Q_TABLE = {}
VALUE_INIT = 0

#### Prepare Environment

We create an instance ENV of the Lunar lander environment. Its input dimensions INPUT_DIM are obtained by resetting the environment, and the actions by the environment method *action_space*.

In [7]:
ENV = gym.make("LunarLander-v2")
INPUT_DIM = ENV.reset().shape[0]
N_ACTIONS = ENV.action_space.n
ACTIONS = np.arange(0, N_ACTIONS)

[2016-10-11 15:04:19,964] Making new env: LunarLander-v2


#### Build Training Epoch

We are going to create the main building block of this exercise: The training block for one epoch.

In [4]:
def train_ql(render):
    # Start a new epoch
    revisited_state = 0
    episode = 0
    success = 0
    solved = False

    for episode in range(N_EPISODES):
        
        x_t = ENV.reset()
        s_t = tuple(x_t)
        done = False

        while not done:
            if render: ENV.render()
            
            # Look up the action with the highest value at s_t, as well as its value
            a_t, q, r_s  = best_action(s_t)
            
            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
            s_t1 = tuple(x_t)
            
            # Look up the action with the highest value at s_t1, as well as its value
            a_t1, Q_sa, _ = best_action(s_t1)
            
            # Update q for s_t and a_t, in hindsight
            Q_TABLE[s_t][a_t] = q + ALPHA * (r_t + GAMMA * Q_sa - q)
            
            # Update state, episode, and revisited state
            s_t = s_t1
            episode += 1
            revisited_state += r_s
            
            # Bookkeeping
            if r_t >= 100: success += 1
            if r_t >= 200: 
                solved = True
                break
                
    print "Number of states in Q table: {}, Number of revisited states: {}, Successes {}, Solved {}".format(len(Q_TABLE), revisited_state, success, solved)

# Create helper function to initialize and query Q-table
def best_action(state):
    if state not in Q_TABLE or sum(Q_TABLE[state].values()) == 0:
        # Bookkeeping: Create revisited states counter
        revisit_state = 0
        # Initialize q function
        q_function = {}
        for A in ACTIONS: q_function[A] = VALUE_INIT
        Q_TABLE[state] = q_function
        # Do random action
        action = np.random.choice(ACTIONS, 1)[0]
    else: 
        revisit_state = 1
        # Select action according to max q
        action = max(Q_TABLE[state], key=Q_TABLE[state].get)
    # Get q value for action selected
    q = Q_TABLE[state][action]
    return action, q, revisit_state

#### Train Model

In [5]:
train_ql(False)

Number of states in Q table: 9429, Number of revisited states: 0, Successes 6, Solved False


#### Aftermath

TODO

## Chapter 2: What to Do? Do Not Update the Q Function, But the Q  Function Estimator: Deep Q Learning

In this situation, we replace the "revisiting states" by a function approximator: We let a Artifical Neural Net (ANN) estimate the q function for the state s the agent is visiting at time t. 

Once we performed the action based on the maximum of the q function (just the action with the highest expected lifetime reward at time t), we know the reward, and the subsequent state. 

Based on this, we are able to update the agent's memory. But this time, we do not update the q function directly. Instead, we are updating the ANN, which means that we are updating the weights used in the ANN. And that is how exactly:

- At time t, we already know the estimation of the q function for state s_t: We used it to pick an action a_t accordingly. Read again: this is the ESTIMATION of the q function.

- After action a_t, we know s_t, a_t, r_t and s_t1. This allows us to update the q function, BUT ONLY FOR THE ACTION TAKEN. We take r_t, and add to it the discounted expected lifetime reward, in other words we let the ANN estimate the q function for state s_t1. 

- For the action taken, we can update the q value now. All the other actions are not performed, we do not know about the reward, or a subsequent state s_t1. So, we cannot learn for those. This updated q function is the TARGET.

- We feed the error, which is the difference between the ESTIMATION and the TARGET.

- We backpropagate the error through the network, such that the weights are updated.

Next time we estimate the q function for another state s, we have updated weights

### Understanding the Gist of ANNs

Firstly, I wanted to understand, what a neural network really does. [iamtrask]'s excellent toy examples helped me understand it completely. I create a network with one hidden layer, and the output layer, both sigmoid activated, which returns probablities for each of the outputs.

layer_2_wloss ("weighted loss") is where the magic happens: Each output loss is multiplied by the slope / gradient of the predicted value on the sigmoid curve.

#### Create A Basic ANN With Only Numpy

In [12]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

np.random.seed(324)

hidden_sizes = np.arange(1, 4)
training_steps = 100

alphas = [0.01, 0.1, 1, 10, 100, 1000, 1000]

X = np.random.randint(2, size=(4, 3))
y = np.random.randint(2, size=(4, 1))

# Test the hidden sizes
for hidden_size in hidden_sizes:
    
    # Initialize 1st set of weights
    W1 = np.random.rand(X.shape[1], hidden_size)

    # Initialize 2nd set of weights
    W2 = np.random.rand(hidden_size, y.shape[1])
        
    # Test the alphas
    for alpha in alphas:
        
        for i in range(training_steps):
            # Forward propagate
            # Initialize hidden layer (fully connected)
            layer_1 = np.dot(X, W1) 
            layer_1 = sigmoid(layer_1)# Apply sigmoid activation
            
            # Initialize output layer(fully connected)
            layer_2 = np.dot(layer_1, W2)
            layer_2 = sigmoid(layer_2) # Apply sigmoid activation 

            # Get loss
            layer_2_loss = y - layer_2

            ''' Apply SGD to the loss: the more certain the estimate, the less weighted it will get: 
                The gradient at the extremes is smaller than in the middle
            '''
            layer_2_wloss = layer_2_loss * sigmoid_derivative(layer_2) # element-wise multiplication!
            
            # Backpropagate
            # Compute the effect of the hidden layer to the weighted loss
            layer_1_loss = np.dot(layer_2_wloss, W2.T)

            # Apply SGD
            layer_1_wloss = layer_1_loss * sigmoid_derivative(layer_1)
                
            # Update the weights
            W2 += alpha * np.dot(layer_1.T, layer_2_wloss)
            W1 += alpha * np.dot(X.T, layer_1_wloss)

            if i == 1: first_error = np.mean(np.abs(layer_2_loss))
            if i == training_steps - 1: print "Hidden Size {}, alpha {}: Final avg loss {}, Improvement {}".format(
                                                hidden_size, alpha, np.mean(np.abs(layer_2_loss)), 
                                                np.mean(np.abs(layer_2_loss)) - first_error)

Hidden Size 1, alpha 0.01: Final avg loss 0.332482398892, Improvement -0.0284334709819
Hidden Size 1, alpha 0.1: Final avg loss 0.173228479187, Improvement -0.156241580326
Hidden Size 1, alpha 1: Final avg loss 0.0417834158496, Improvement -0.122337346044
Hidden Size 1, alpha 10: Final avg loss 0.0114937426301, Improvement -0.0279103012607
Hidden Size 1, alpha 100: Final avg loss 0.00346337950767, Improvement -0.00744523640445
Hidden Size 1, alpha 1000: Final avg loss 0.00107518867278, Improvement -0.00221871846119
Hidden Size 1, alpha 1000: Final avg loss 0.000777709786278, Improvement -0.000287819531062
Hidden Size 2, alpha 0.01: Final avg loss 0.255131325874, Improvement -0.0323567995625
Hidden Size 2, alpha 0.1: Final avg loss 0.125350688814, Improvement -0.126584169372
Hidden Size 2, alpha 1: Final avg loss 0.0335318290234, Improvement -0.0856951767789
Hidden Size 2, alpha 10: Final avg loss 0.00919545195892, Improvement -0.0224759100392
Hidden Size 2, alpha 100: Final avg loss 0.

#### Aftermath

TODO alpha, layer size

### Deep Q Network Step by Step. Step 1: Deep Q Learning from Single Observations

#### Namespace (Extension)

In [5]:
from keras.layers import Dense
from keras.models import Sequential

Using TensorFlow backend.


#### Set Hyperparameters (Extension)

We start with a first set of static hyperparameters. Some of them will undergo changes along the way:

- D_RANGE is the number of time steps the agent should take into account as the current state it is in: This is the "operational" memory of the agent. I will refer to it as time step memory.
- Note that we do not use ALPHA, Q_TABLE, and Q_INIT anymore

In [6]:
D_RANGE = np.arange(1, 21) # Constant over one epoch

#### Build Keras Model

- one fully connected layer
- multicat output (softmax)

In [7]:
def _create_network():
    model = Sequential()
    model.add(Dense(200, init='glorot_normal', input_shape=(D*INPUT_DIM,))) #default is: init='glorot_uniform' 
    model.add(Dense(N_ACTIONS, init='glorot_normal', activation='softmax')) # default is: init='glorot_uniform' 
    model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
    return model

#### Build Training Epoch
  
TODO modified; in what sense, compared to 00

In [21]:
def train_dqn1(model, render):
    # Start a new epoch
    episode = 0
    success = 0
    solved = False

    for episode in range(N_EPISODES):
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
    
            # Estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward
            a_t = np.argmax(q) #argmax returns index

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)

            # Estimate rewards for each action (targets), at s_t1 (again a forward pass)
            Q_sa = model.predict(s_t1[np.newaxis])[0]

            ''' Create reference/targets by updating estimated reward for chosen action
                For action taken, replace estimated reward by remaining cumulative lifetime reward
            ''' 
            targets = q
            targets[a_t] = r_t + GAMMA * np.max(Q_sa) if not done else r_t

            ''' Learn!
                - Again, predict q values for state s_t
                - Calculate loss by comparing predictions to targets: they will differ only for the action taken
                - backpropagate error for action taken, update weights
            ''' 
            model.fit(s_t[np.newaxis], targets[np.newaxis], nb_epoch=10, verbose=0)

            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t >= 100: success += 1
            if r_t >= 200: 
                solved = True
                break

    print "D {}, Successes {}, Solved {}".format(D, success, solved)

#### Train Model

What can learn from this basic implementation? Surely, does it make the impression to learn at all? We also should think of the agent's time step memory. Does it make sense to let the agent know only its current state, or shall we allow him to take into consideration also some of the states before? If yes, how far back should it remember? To get a first hint to answer this question, let's run a simulation. We loop over a range of D candidates, and produce some summary statistics in order to assess the performance: 

In [23]:
for D in D_RANGE:
    # Initialize model
    model = _create_network()
    # Train model
    train_dqn1(model, render=True)

KeyboardInterrupt: 

#### Aftermath

That is quite working nicely already! The frantic, purpose-less behaviour is gone most of the time, or is vanishing quickly within a few episodes only.

Some observations: 
- Too little time step memory? With only one frame in the frame memory, I often observed extensive swinging movement: the agent tries to correct with the left or right engine, fires too much, the lander is tipping over to the opposite side. Then, it corrects again, and again, and from all these corrections forgets to fire against the moon's gravity, and the agent is crashing into the surface. TODO Which are the best?
- Local minima. There are plenty of times one can see the agent trapped into a locally optimal policy. For example, it stays on the ground, engaging left and right engine forever, perfectly stable, but not reaching the ultimate goal. Or a setting where left and right engines are engaged, but the lower engine does not fire at all, over long episodes.

### Deep Q Network Step by Step.  Step 2: *Epsilon Greedy* Deep Q Learning from Single Observations

At this point, let us tackle the issue of getting stuck in local minima. As a remedy, Reinforcement Learning makes use the so called *epsilon greedy* action selection policy. It allows for a random move with probability epsilon, and by that introduces the notion of exploration (random moves) vs. exploitation (act on estimation of the q function). 

*Exploration* will reduce the probability of getting stuck in local minima, which are *not* reflecting the best action given a certain experience level of the agen. It's just like fresh air for the AI brain, introducing random ideas from outside. 

On the other hand, the agent needs to train and get experience with his selected moves. It needs evidence that one decision was (not) the right one, and to update its decicion finding process (the weights of the ANN. It only gets it by acting according to its own decisions undisturbed by random inputs. This is where *exploitation* comes in.

In RL, usually epsilon decreases over a certain exploration period. This reflects the idea that the agent will start with many random moves to fathom the environment by just observing. With time and growing experience, it will decrease the share of random moves, since it feels more confident in its own decisions. 

I will follow th custom of allowing random moves at a linearly decreasing exploration rate during the exploration period. The dqn paper TODO is starting with epsilon = 1 / complete randomness. Let us see what brings good results here. Balancing exploration and exploitation is in itself subject to learning.

Concretely, we now define a interval between the maximum and minimum epsilon allowed: EPSILON_RANGE.

In order to implement exploration and exploitation, we need to keep track of the number of time steps only. We thus stop counting episodes (N_EPISODE) and establish global settings for the total number of time steps (TOTAL_TIME_STEPS) and the total number of exploration time steps (TOTAL_EXPLORATION_STEPS).

Again following TODO the dqn paper, I assign 1/10th of the total time steps to exploration, the rest to training.

During training, the agent will run on the min epsilon constantly.

#### Set Hyperparameters (Extension)

In [8]:
TOTAL_TIME_STEPS = 1e2 * 2 # TODO dqn 10**7
TOTAL_EXPLORATION_STEPS = TOTAL_TIME_STEPS / 10
EPSILON_RANGE = [0.5, 0.0001] # TODO dqn [1, 0.01]

#### Build Training Epoch

In [13]:
def train_dqn2(model, render): 
    # Start a new epoch
    episode = 0
    success = 0
    solved = False
    epsilon = EPSILON_RANGE[0] ##NEW Initialize epsilon at its maximum value

    #for episode in range(N_EPISODES): ###NEW Discard the episode loop
    while episode <= TOTAL_TIME_STEPS: ###NEW Install a loop over all time steps
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
            
            ###NEW Anneal random exploration rate epsilon over exploration period
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # Estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward 
            ###NEW Do this with probability 1-epsilon ("epsilon greedy" policy)
            a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)

            # Estimate rewards for each action (targets), at s_t1 (again a forward pass)
            Q_sa = model.predict(s_t1[np.newaxis])[0]

            ''' Create reference/targets by updating estimated reward for chosen action
                For action taken, replace estimated reward by remaining cumulative lifetime reward
            ''' 
            targets = q
            targets[a_t] = r_t + GAMMA * np.max(Q_sa) if not done else r_t

            ''' Learn!
                - Again, predict q values for state s_t
                - Calculate loss by comparing predictions to targets: they will differ only for the action taken
                - backpropagate error for action taken, update weights
            ''' 
            model.fit(s_t[np.newaxis], targets[np.newaxis], nb_epoch=10, verbose=0)

            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t >= 100: success += 1
            if r_t >= 200: 
                solved = True
                break

    print "D {}, Successes {}, Solved {}".format(D, success, solved)

#### Train Model

In [14]:
for D in D_RANGE:
    # Initialize model
    model = _create_network()
    # Train model
    train_dqn2(model, render=False)

D 1, Successes 1, Solved False


KeyboardInterrupt: 

#### Aftermath

TODO
It gets shakier. It definitively needs more time to learn. This is the cost, at which exploration comes.

### Deep Q Network Step by Step. Step 3: Deep Q Learning *from Stored Experiences* 

It has been shown that learning on the fly from observations XXX TODO. dQN, TODO. Instead, the trick is to learn from a memory storage in batches, the so called Experience Replay Memory (ERM). 

We are thus going to create the main database of the agent: It is the place where it 
- stores states and its experiences with the states (transitions s_t, a_t, r_t,and s_t1)
- recalls on the memory, collects a memory sample, trains on the sample, and updates the Q function estimator.

The ERM is set up once per epoch and is fed at each time step with fresh transition evidence.

#### Namespace (Extension)

In [9]:
from collections import deque

#### Set Hyperparameters (Extension)

ERM_SIZE is setting the size of the experience replay memory. Following the recommendation of the dqn TODO, we set it equal to the number of exploration steps. The BATCH_SIZE denotes the size of the sample which is drawn uniformly without replacement from the ERM at each time step. Again, its size is following the recommendations of the dqn paper TODO. 

In [10]:
ERM_SIZE = TOTAL_EXPLORATION_STEPS
BATCH_SIZE = 32

#### Build Training Epoch

In [None]:
def train_dqn3(model, render):
    # Start a new epoch
    episode = 0
    success = 0
    solved = False
    epsilon = EPSILON_RANGE[0]
    ERM = deque(maxlen=ERM_SIZE) ###NEW If too long, throw away the earliest (latest is ERM[-1])

    while episode <= TOTAL_TIME_STEPS:
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
            
            # Anneal random exploration rate epsilon over exploration period
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # Estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward, "epsilon greedy"
            a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)
            
            ###NEW Store transition in experience replay memory
            ERM.append((s_t, a_t, r_t, s_t1))

            ###NEW Choose a batch of maximum length BATCH_SIZE
            minibatch = np.array([ ERM[i] for i in np.random.choice(np.arange(0, len(ERM)), min(len(ERM), BATCH_SIZE)) ])
            
            ###NEW Compute targets/reference for each transition in minibatch
            inputs = deque()
            targets = deque()
            for m in minibatch:
                inputs.append(m[0]) # Append s_t of batch transition m to inputs
                m_q    = model.predict(m[0][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t
                m_Q_sa = model.predict(m[3][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t1
                m_targets = m_q
                m_targets[m[1]] = m[2] + GAMMA * np.max(m_Q_sa)
                targets.append(m_targets) # Append target of batch transition m to targets
                
            ###NEW Train the model by backpropagating the errors and update weights
            model.train_on_batch(np.array(inputs), np.array(targets))
            
            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t >= 100: success += 1
            if r_t >= 200: 
                solved = True
                break

    print "D {}, Successes {}, Solved {}".format(D, success, solved)

#### Train Model

In [None]:
for D in D_RANGE:
    # Initialize model
    model = _create_network()
    # Train model
    train_dqn3(model, render=False)

#### Aftermath

The question crossed my mind: Why don't we predict beforehand, on the fly, at every time step? Would that not be computationally efficient? This has the huge disadvantage that we predict with the knowledge available at timestep t. This might be faulty, and the faulty prediction stays as target reference in the batch, and is used to compare the loss for the taken action between the prediction at timestep t and the potenitially long ago target estimation. This will bias the learning process significantly. Thus, we select a batch, calculate the estimations and targets for the complete batch, both with the knowledge of the current time steps.

### Deep Q Network Step by Step. Step 4: Deep Q Learning from Stored Experiences, *Refined*

At this point, we will finalize training by applying some more tweaks (outside the model block). The goal is to check parameter combinations by brute force. We install some extra bookeeping, and will save a model for every training method.

#### Namespace (Extension)

In [11]:
from sys import stdout
import json
from keras.models import model_from_json
from os import getcwd, path

#### Set Hyperparameters (Extension)

- Discount factor gamma. We will replace the fixed hyperparameter by an arbitrary range of gamma candidates (GAMMA_range).
- Take action every n-th time step (time step per action TSPA_RANGE) TODO
- Reward Treatment (R_TREATMENT). TODO
- SAVE_PATH is the working directory where the training epoch models are stored

In [12]:
GAMMA_RANGE = np.arange(0.0, 1.01, 0.01) # Constant over one epoch
TSPA_RANGE = np.arange(1, 5) # Constant over one epoch
R_TREATMENT = ['clip', 'normalize', 'none'] # Constant over one epoch
SAVE_PATH = getcwd()

#### Build Training Epoch

In [16]:
def train_dqn4(model, render):
    # Start a new epoch
    episode = 0
    success = 0
    solved = False
    epsilon = EPSILON_RANGE[0]
    ERM = deque(maxlen=ERM_SIZE)
    MH5 = path.join(SAVE_PATH, "Models", MODEL_ID+".h5") ###NEW Define path and name of h5 container
    MJS = path.join(SAVE_PATH, "Models", MODEL_ID+".json") ###NEW Define path and name of json container

    while episode <= TOTAL_TIME_STEPS:
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        rs = deque() ###NEW TODO
        done = False

        while not done:
            if render: ENV.render()
            
            # Anneal random exploration rate epsilon over exploration period
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # Estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward, "epsilon greedy"
            ###NEW Act only every n-th time step
            if episode % TSPA == 0: a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
            
            ###NEW Treat rewards
            ###NEW Clip rewards
            if R_TREATMENT == 'clip'and r_t != 0: r_t = abs(r_t) / r_t
            ###NEW Normalize rewards over epoch
            elif R_TREATMENT == 'normalize' and episode > 0: r_t = (r_t - np.mean(rs))/np.std(rs)
            
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)
            
            # Store transition in experience replay memory
            ERM.append((s_t, a_t, r_t, s_t1))

            # Choose a batch of maximum length BATCH_SIZE
            minibatch = np.array([ ERM[i] for i in np.random.choice(np.arange(0, len(ERM)), min(len(ERM), BATCH_SIZE)) ])
            
            # Compute targets/reference for each transition in minibatch
            inputs = deque()
            targets = deque()
            for m in minibatch:
                inputs.append(m[0]) # Append s_t of batch transition m to inputs
                m_q    = model.predict(m[0][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t
                m_Q_sa = model.predict(m[3][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t1
                m_targets = m_q
                m_targets[m[1]] = m[2] + GAMMA * np.max(m_Q_sa)
                targets.append(m_targets) # Append target of batch transition m to targets
                
            # Train the model by backpropagating the errors and update weights
            model.train_on_batch(np.array(inputs), np.array(targets))
                
            ###NEW Save progress every 100 iterations
            if episode % 100 == 0:
                model.save_weights(MH5, overwrite=True)
            with open(MJS, "w") as outfile: json.dump(model.to_json(), outfile)
            
            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t >= 100: success += 1
            if r_t >= 200: 
                solved = True
                break

    return success, solved ###NEW return values for epoch bookkeeping

#### Train Model

In [20]:
max_success = 0 ###NEW Establish counter to detect champion parameter settings
best_epochs = deque(maxlen=50) ###NEW Establish a list to catch the 50 best performing parameter settings and their performance
epoch = 0

print "Session contains {} epochs\n".format(len(D_RANGE)*len(GAMMA_RANGE)*len(TSPA_RANGE)*len(R_TREATMENT))

for D in D_RANGE:
    # Initialize model
    model = _create_network()
    for GAMMA in GAMMA_RANGE: 
        for TSPA in TSPA_RANGE:
            for R_T in R_TREATMENT:
                # Count epochs
                epoch += 1
                # Create model id
                MODEL_ID = repr(D)+"_"+str(round(GAMMA, 2))+"_"+repr(TSPA)+"_"+repr(R_T)[:1]
                # Train model
                success, solved = train_dqn4(model, render=True) 
                stdout.write("\rEpoch: {}, D: {}, Gamma: {}, Time steps per action: {}, Reward treatment: {}, Successes: {}, Solved: {}".format( \
                                                    epoch, \
                                                    D, \
                                                    round(GAMMA, 2), \
                                                    TSPA, \
                                                    R_T, \
                                                    success, \
                                                    solved))
                stdout.flush()
                if solved or (success >= max_success and success > 0):
                    max_success = success
                    best_epochs.append((D, GAMMA, TSPA, R_T, success))
                    print "\n\nBest Epochs:"
                    for be in best_epochs: print be
                    print "\n"
                    if solved: break

Session contains 24240 epochs

Epoch: 3, D: 1, Gamma: 0.0, Time steps per action: 1, Reward treatment: none, Successes: 1, Solved: False

Best Epochs:
(1, 0.0, 1, 'none', 1)


Epoch: 8, D: 1, Gamma: 0.0, Time steps per action: 3, Reward treatment: normalize, Successes: 0, Solved: False

KeyboardInterrupt: 

## Policy Gradients

In [1]:
import numpy as np
import gym

In [7]:
from keras.layers import Dense
from keras.models import Sequential

Using TensorFlow backend.


In [8]:
ENV = gym.make("LunarLander-v2")
INPUT_DIM = ENV.reset().shape[0]
N_ACTIONS = ENV.action_space.n
ACTIONS = np.arange(0, N_ACTIONS)

[2016-10-10 20:23:31,367] Making new env: LunarLander-v2


In [9]:
def _create_network1():
    model = Sequential()
    model.add(Dense(200, init='glorot_normal', input_shape=(D*INPUT_DIM,))) #default is: init='glorot_uniform' 
    model.add(Dense(N_ACTIONS, init='glorot_normal', activation='softmax')) # default is: init='glorot_uniform' 
    model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
    return model

In [21]:
from keras.models import Model
from keras.layers import Input, Dense

i = Input(shape=(D*INPUT_DIM,))
h = Dense(200, init='glorot_normal')(i)
o = Dense(N_ACTIONS, init='glorot_normal', activation='softmax')
model = Model(input=i, output=o)

Exception: Output tensors to a Model must be Keras tensors. Found: <keras.layers.core.Dense object at 0x114a492d0>

In [11]:
D = 11
x_t = ENV.reset()
s_t = np.tile(x_t, D)
xs, hs, dlogps, drs = [],[],[],[]
model = _create_network2()

# forward the policy network and sample an action from the returned probability
aprob = model.predict(s_t[np.newaxis])[0]
# aprob, h = policy_forward(x) # return probability of taking action 2, and hidden state
# action = 2 if np.random.uniform() < aprob else 3 # roll the dice!

# record various intermediates (needed later for backprop)
xs.append(s_t)
#hs.append(h)

TypeError: add() got an unexpected keyword argument 'input_shape'