## TODOs

- create a desc for each function
- comment code properly
- model:
    - hinton dropout
    - numpy only model
- loss # model.metrics_names: ['loss', 'acc']
- append code with full model


## Purpose

- Get started with neural nets? Convolutions, Fully-connected layers, activations
- See a thing learn is exciting (05 smartcab)
- The field of ML I know least

## Credits and Thanks

- Tambet Matiisen
https://www.nervanasys.com/demystifying-deep-reinforcement-learning/ https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py
- Andrew Trask
https://iamtrask.github.io
- Eder Santana
http://edersantana.github.io/articles/keras_rl/
- Deep Mind
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
- Ben Lau
https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html
- Francois Chollet
https://github.com/fchollet/keras/tree/master/examples 
https://keras.io
- Sebastian Raschka
http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
- Christopher Olah
http://colah.github.io/posts/2014-07-Conv-Nets-Modular/ TODO
- Karpathy
TODO



## Storyline

1.	Q-Learning 
a.	Re-visit state s_t, if already existing 
b.	perform action with maximum value (action-value: q), 

2.	Too many possible states: even with lunar lander (8 inputs). What to do?
a.	ANN-Approximate / estimate q-value based on s_t, 
b.	perform action based on the argmax of the estimation of q-values, 
c.	recalculate q-value for taken action with r_t, and (another approximation, based on s_t1) expected, discounted reward based upon s_t1 (target for supervised learning problem)
d.	calculate error for taken action (mse)
e.	backprop error through ANN
3.	Too shaky! Do experience replay: train ANN on batch of transitions every x-th timestep, in stead of single observations/s_t each time step
4.	Do proper parametrization: epsilon, alpha, gamma (dropout)
5.	1D inputs vs 2D inputs: use different input preprocessing
6.	Computationally intensive! 
a.	Use AWS. 
b.	OK. But how about alternatives?: Policy Gradients TODO COMPARE
7.	SUPPORT FOR 2
a.	What is a neural net; by this meaning fully-connected layer: Create a ANN from scratch
b.	Maybe: What is a convolution? TODO

Start with 1d, maybe 2d later TODO


## Where It Starts: Implementing the Bellman Equation, or: Pure Q-Learning

Q learning is about revisiting states. We are in a specific state s at time t, and because the state space is sufficiently small, we might discover that the agent has already been in s before. It therefore has made an experience for s by taking an action, and collecting a reward or punishment. All this is stored in the agent's "memory" (possible a dictionary of dictionaries, main keys being the states, values being the actions (keys) and their values). We want the agent to take advantage of this "memory": We look up the expected lifetime rewards per each possible action in s (a.k.a. action-value function, q values), select the maximum q value, and execute the chosen action. 

We now have fresh evidence about the consequence of a specific action in a specific state: We know the initial state s_t, we know the selected action a_t, we know the reward r_t, and we know the new state this all lead to, s_t1. This knowledge we now use to update the agent's memory: We calculate a new q value for s_t by taking the observed reward r_t, and adding to it the discounted maximum q value for s_t1. The difference to the old q value is the new q value for action a_t in state s_t.

## Why It Often Does Not Work: Exploding State Spaces

Revisiting states is often not possible, because there are simply too many combinations of relevant inputs which constitute a state. Just think a small number of inputs, each input being a floating number with 4 digits. Even this small setting is creating a large amount of combinations: the state space explodes. Revisiting states is very unlikely, we will need a huge number of trials to generate memory updates. As a consequence, learning is slow, or even not happening.

#### TODO: lunar lander with pure q learning

## Deep Q Learning: Do Not Update the Q Function Directly, but the Q function Estimator

In this situation, we replace the "revisiting states" by a function approximator: We let a Artifical Neural Net (ANN) estimate the q function for the state s the agent is visiting at time t. 

Once we performed the action based on the maximum of the q function (just the action with the highest expected lifetime reward at time t), we know the reward, and the subsequent state. 

Based on this, we are able to update the agent's memory. But this time, we do not update the q function directly. Instead, we are updating the ANN, which means that we are updating the weights used in the ANN. And that is how exactly:

- At time t, we already know the estimation of the q function for state s_t: We used it to pick an action a_t accordingly. Read again: this is the ESTIMATION of the q function.

- After action a_t, we know s_t, a_t, r_t and s_t1. This allows us to update the q function, BUT ONLY FOR THE ACTION TAKEN. We take r_t, and add to it the discounted expected lifetime reward, in other words we let the ANN estimate the q function for state s_t1. 

- For the action taken, we can update the q value now. All the other actions are not performed, we do not know about the reward, or a subsequent state s_t1. So, we cannot learn for those. This updated q function is the TARGET.

- We feed the error, which is the difference between the ESTIMATION and the TARGET.

- We backpropagate the error through the network, such that the weights are updated.

Next time we estimate the q function for another state s, we have updated weights

## Prerequisites for Deep Learning: Understanding the Gist of ANNs

### Namespace

In [94]:
import numpy as np

### Create A Basic ANN With Only Numpy

In [24]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

np.random.seed(324)

hidden_sizes = np.arange(1, 4)
training_steps = 100

alphas = [0.01, 0.1, 1, 10, 100, 1000, 1000]

X = np.random.randint(2, size=(4, 3))
y = np.random.randint(2, size=(4, 1))
#print X
#print y

# Test the hidden sizes
for hidden_size in hidden_sizes:
    
    # Initialize 1st set of weights
    W1 = np.random.rand(X.shape[1], hidden_size)

    # Initialize 2nd set of weights
    W2 = np.random.rand(hidden_size, y.shape[1])
        
    # Test the alphas
    for alpha in alphas:
        
        for i in range(training_steps):
            # Initialize hidden (fully connected) layer
            layer_1 = sigmoid(np.dot(X, W1))

            # Initialize y (fully connected) layer
            layer_2 = sigmoid(np.dot(layer_1, W2))

            # Get loss (MSE)
            layer_2_loss = y - layer_2

            ''' Apply SGD to the loss: the more certain the estimate, the less weighted it will get: 
                The gradient at the extremes is smaller than in the middle
            '''
            layer_2_wloss = layer_2_loss * sigmoid_derivative(layer_2) # element-wise multiplication!
            #print layer_2
            #print sigmoid_derivative(layer_2) 

            # Compute the effect of the hidden layer to the weighted loss
            layer_1_loss = np.dot(layer_2_wloss, W2.T)

            # Apply SGD
            layer_1_wloss = layer_1_loss * sigmoid_derivative(layer_1)

            # Update the weights
            W2 += alpha * np.dot(layer_1.T, layer_2_wloss)
            W1 += alpha * np.dot(X.T, layer_1_wloss)

            if i == 1: first_error = np.mean(np.abs(layer_2_loss))
            if i == training_steps - 1: print "Hidden Size {}, alpha {}: Final avg loss {}, Improvement {}".format(
                                                hidden_size, alpha, np.mean(np.abs(layer_2_loss)), 
                                                np.mean(np.abs(layer_2_loss)) - first_error)  

Hidden Size 1, alpha 0.01: Final avg loss 0.332482398892, Improvement -0.0284334709819
Hidden Size 1, alpha 0.1: Final avg loss 0.173228479187, Improvement -0.156241580326
Hidden Size 1, alpha 1: Final avg loss 0.0417834158496, Improvement -0.122337346044
Hidden Size 1, alpha 10: Final avg loss 0.0114937426301, Improvement -0.0279103012607
Hidden Size 1, alpha 100: Final avg loss 0.00346337950767, Improvement -0.00744523640445
Hidden Size 1, alpha 1000: Final avg loss 0.00107518867278, Improvement -0.00221871846119
Hidden Size 1, alpha 1000: Final avg loss 0.000777709786278, Improvement -0.000287819531062
Hidden Size 2, alpha 0.01: Final avg loss 0.255131325874, Improvement -0.0323567995625
Hidden Size 2, alpha 0.1: Final avg loss 0.125350688814, Improvement -0.126584169372
Hidden Size 2, alpha 1: Final avg loss 0.0335318290234, Improvement -0.0856951767789
Hidden Size 2, alpha 10: Final avg loss 0.00919545195892, Improvement -0.0224759100392
Hidden Size 2, alpha 100: Final avg loss 0.

### Aftermath

TODO alpha, layer size

## Deep Q Learning - Step by Step

We are working with Open AI Gym (https://gym.openai.com/) as a training environment for our to-be-defined AI agent.

Lunar Lander environment is particularily appealing to me due to two reasons:
1. It is based on box2d, which simulates real life physics
2. It is a starting point for creating and tuning an AI agent with a 1D vector of 8 floating numbers as a state, and four actions: left, right, upper, and lower engine fire. 

Model building and training is done with Keras (https://keras.io). This modular, minimalist library makes ANN life as easy as it can get, and in plus runs on both Theano and Tensorflow.


### Step 1: Deep Q Learning from Single Observations


#### Namespace (Extension)

In [95]:
import gym
from keras.layers import Dense
from keras.models import Sequential

#### Set Hyperparameters

We start with a first set of static hyperparameters. Some of them will undergo changes along the way:

- D_RANGE is the number of frames the agent should take into account as the current state it is in: This is the "operational" memory of the agent. I will refer to it as time step memory.
- GAMMA is the factor by which future expected rewards are discounted.
- N_EPISODES denotes the maximum number of episodes an epoch will embrace

In [96]:
D_RANGE = [1, 16]  # to loop over!
GAMMA = 0.99
N_EPISODES = 10

#### Prepare Environment

We create an instance ENV of the Lunar lander environment. Its input dimensions INPUT_DIM are obtained by resetting the environment, and the actions by the environment method *action_space*

In [97]:
ENV = gym.make("LunarLander-v2")
INPUT_DIM = ENV.reset().shape[0]
N_ACTIONS = ENV.action_space.n
ACTIONS = np.arange(0, N_ACTIONS)

[2016-10-07 16:40:12,207] Making new env: LunarLander-v2


#### Build Keras Model

- one fully connected layer
- multicat output (softmax)

In [98]:
def _create_network():
    model = Sequential()
    model.add(Dense(200, init='glorot_normal', input_shape=(D*INPUT_DIM,))) #init='glorot_normal'
    model.add(Dense(N_ACTIONS, init='glorot_normal', activation='softmax')) #init='glorot_normal'
    model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
    return model

#### Build Training Epoch
  
We are going to create the main building block of this exercise: The training block for one epoch.

In [23]:
def train_01(model, render):
    # Start a new epoch
    episode = 0
    success = 0

    for episode in range(N_EPISODES):
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
    
            # estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward
            a_t = np.argmax(q) #argmax returns index

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)

            # Estimate rewards for each action (targets), at s_t1 (again a forward pass)
            Q_sa = model.predict(s_t1[np.newaxis])[0]

            ''' Create reference/targets by updating estimated reward for chosen action
                For action taken, replace estimated reward by remaining cumulative lifetime reward
            ''' 
            targets = q
            targets[a_t] = r_t + GAMMA * np.max(Q_sa) if not done else r_t

            ''' Learn!
                - Again, predict q values for state s_t
                - Calculate loss by comparing predictions to targets: they will differ only for the action taken
                - backpropagate error for action taken, update weights
            ''' 
            model.fit(s_t[np.newaxis], targets[np.newaxis], nb_epoch=10, verbose=0)

            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t > 100: success += 1

    print "Memory Length {}, Episodes {}, Number of Successes {}".format(D, episode, success)

#### Train Model

What can learn from this basic implementation? Surely, does it make the impression to learn at all? We also should think of the agent's time step memory. Does it make sense to let the agent know only its current state, or shall we allow him to take into consideration also some of the states before? If yes, how far back should it remember? To get a first hint to answer this question, let's run a simulation. We loop over a range of D candidates, and produce some summary statistics in order to assess the performance: 

In [15]:
for D in np.arange(D_RANGE[0], D_RANGE[1]):
    # Initialize model
    model = _create_network()
    # Train model
    train_01(model, render=True)

Memory Length 1, Episodes 113, Number of Successes 0
Memory Length 2, Episodes 290, Number of Successes 0
Memory Length 3, Episodes 119, Number of Successes 0
Memory Length 4, Episodes 247, Number of Successes 0
Memory Length 5, Episodes 60, Number of Successes 0
Memory Length 6, Episodes 216, Number of Successes 0
Memory Length 7, Episodes 213, Number of Successes 0
Memory Length 8, Episodes 109, Number of Successes 0
Memory Length 9, Episodes 240, Number of Successes 0
Memory Length 10, Episodes 85, Number of Successes 0
Memory Length 11, Episodes 59, Number of Successes 1
Memory Length 12, Episodes 567, Number of Successes 0
Memory Length 13, Episodes 151, Number of Successes 0
Memory Length 14, Episodes 97, Number of Successes 0
Memory Length 15, Episodes 317, Number of Successes 1


#### Aftermath

That is quite working nicely already! The frantic, purpose-less behaviour is gone most of the time.

Some observations: 
- With only one frame in the frame memory, extensive swinging movement


- TODO Schaukelbewegung bei zu grossen D? 
- Local minima: Not engaging the lower engine, and crash relentlessly
- Getting stuck in local minima: introduce epsilon greedy action selection policy. It allows for a random move with probability epsilon and by that introduces the notion of exploration (random moves) vs. exploitation (act on estimation of the q function). 

Exploitation will reduce the probability of getting stuck in local minima, which are NOT reflecting the best action given a certain state. It's just like fresh air for the AI brain, fresh ideas from outside. 

On the other hand, the agent needs to train and get experience with his selected moves. It needs evidence that one decision was (not) the right one, and to update its decicion finding process (the weights of the ANN. It only gets it by acting according to its own decisions undisturbed by random inputs. This is where exploitation comes in.

Balancing exploration and exploitation is a also subject of learning, i.e. tweaking the parameter and follow/challenge benchmark methods. I will now introduce the random exploration rate epsilon, which determines the rate of random moves at a given time 

I will follow th custom of allowing random moves at a linearly decreasing exploration rate during the so called EXPLORATION PERIOD: 

- We define a maximum and a minimum epsilon. 
- We let epsilon decrease linearly decrease from its max to min value over the period of EXPLORATION PERIOD.
Afterwards, the agent will run on the min epsilon constantly during TRAINING PERIOD, until the total number of time steps is reached.

TODO dqn [1, 0.1], i.e. introduce an observation period at the beginning where every move is random.

Again following TODO dqn example, I assign 1/10th of the total time steps to exploration, the rest to training. 

### Step 2: *Epsilon Greedy* Deep Q Learning from Single Observations

In order to implement exploration and exploitation, we need to keep track of the number of time steps/frames only. We thus stop counting episodes (N_EPISODE) and establish global settings for 
- the total number of time steps: TOTAL_TIME_STEPS
- the total number of exploration time steps: TOTAL_EXPLORATION_STEPS

Since epsilon is not a constant, but decreasing linarily over the exploration period, we create an interval between the maximum and minimum epsilon allowed: EPSILON_RANGE. 

#### Set Hyperparameters (Extension)

In [99]:
#N_EPISODES = 10
TOTAL_TIME_STEPS = 1e2 # TODO dqn 10**7
TOTAL_EXPLORATION_STEPS = TOTAL_TIME_STEPS / 10 # TODO dqn 10**7
EPSILON_RANGE = [0.5, 0.0001] # TODO dqn [1, 0.01]

#### Build Training Epoch

In [25]:
def train_02(model, render): 
    # Start a new epoch
    episode = 0
    success = 0
    epsilon = EPSILON_RANGE[0] ##NEW Initialize epsilon at its maximum value

    #for episode in range(N_EPISODES): ###NEW Discard the episode loop
    while episode <= TOTAL_TIME_STEPS: ###NEW Install a loop over all time steps
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
            
            ###NEW Anneal random exploration rate epsilon over exploration time steps
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward 
            ###NEW Do this with probability 1-epsilon ("epsilon greedy" policy)
            a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)

            # Estimate rewards for each action (targets), at s_t1 (again a forward pass)
            Q_sa = model.predict(s_t1[np.newaxis])[0]

            ''' Create reference/targets by updating estimated reward for chosen action
                For action taken, replace estimated reward by remaining cumulative lifetime reward
            ''' 
            targets = q
            targets[a_t] = r_t + GAMMA * np.max(Q_sa) if not done else r_t

            ''' Learn!
                - Again, predict q values for state s_t
                - Calculate loss by comparing predictions to targets: they will differ only for the action taken
                - backpropagate error for action taken, update weights
            ''' 
            model.fit(s_t[np.newaxis], targets[np.newaxis], nb_epoch=10, verbose=0)

            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t > 100: success += 1

    print "Memory Length {}, Episodes {}, Number of Successes {}".format(D, episode, success)

#### Train Model

In [26]:
for D in np.arange(D_RANGE[0], D_RANGE[1]):
    # Initialize model
    model = _create_network()
    # Train model
    train_02(model, render=True)

Memory Length 1, Episodes 146, Number of Successes 0
Memory Length 2, Episodes 126, Number of Successes 0
Memory Length 3, Episodes 144, Number of Successes 0
Memory Length 4, Episodes 123, Number of Successes 0
Memory Length 5, Episodes 104, Number of Successes 0
Memory Length 6, Episodes 112, Number of Successes 0
Memory Length 7, Episodes 109, Number of Successes 0
Memory Length 8, Episodes 107, Number of Successes 0
Memory Length 9, Episodes 269, Number of Successes 0
Memory Length 10, Episodes 146, Number of Successes 0
Memory Length 11, Episodes 244, Number of Successes 0
Memory Length 12, Episodes 110, Number of Successes 0
Memory Length 13, Episodes 111, Number of Successes 0
Memory Length 14, Episodes 140, Number of Successes 0
Memory Length 15, Episodes 156, Number of Successes 0


#### Aftermath

TODO
It gets shakier. It definitively needs more time to learn. This is the cost, at which exploration comes.

## Step 3: Deep Q Learning *from Stored Experiences* 

It has been shown that learning on the fly from observations XXX TODO. dQN, TODO. Instead, the trick is to learn from a memory storage in batches, the so called Experience Replay Memory (ERM). 

We are thus going to create the main database of the agent: It is the place where it 
- stores states and its experiences with the states (transitions s_t, a_t, r_t,and s_t1)
- recalls on the memory, collects a memory sample, trains on the sample, and updates the Q function estimator.

The ERM is set up once per epoch and is fed at each time step with fresh transition evidence.

#### Namespace (Extension)

In [100]:
from collections import deque

#### Set Hyperparameters (Extension)

ERM_SIZE is setting the size of the experience replay memory. Following the recommendation of the dqn TODO, we set it equal to the number of exploration steps. The BATCH_SIZE denotes the size of the sample which is drawn uniformly without replacement from the ERM at each time step. Again, its size is following the recommendations of the dqn paper TODO. 

In [101]:
ERM_SIZE = TOTAL_EXPLORATION_STEPS
BATCH_SIZE = 32

#### Build Training Epoch

In [12]:
def train_03(model, render):
    # Start a new epoch
    episode = 0
    success = 0
    epsilon = EPSILON_RANGE[0]
    ERM = deque(maxlen=ERM_SIZE) ###NEW If too long, throw away the earliest (latest is ERM[-1])

    while episode <= TOTAL_TIME_STEPS:
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
            
            # Anneal random exploration rate epsilon over exploration time steps
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward, "epsilon greedy"
            a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)
            
            ###NEW Store transition in experience replay memory
            ERM.append((s_t, a_t, r_t, s_t1))

            ###NEW Choose a batch of maximum length BATCH_SIZE
            minibatch = np.array([ ERM[i] for i in np.random.choice(np.arange(0, len(ERM)), min(len(ERM), BATCH_SIZE)) ])
            
            ###NEW Compute targets/reference for each transition in minibatch
            inputs = deque()
            targets = deque()
            for m in minibatch:
                inputs.append(m[0]) # Append s_t of batch transition m to inputs
                m_q    = model.predict(m[0][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t
                m_Q_sa = model.predict(m[3][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t1
                m_targets = m_q
                m_targets[m[1]] = m[2] + GAMMA * np.max(m_Q_sa)
                targets.append(m_targets) # Append target of batch transition m to targets
                
            ###NEW Train the model by backpropagating the errors and update weights
            model.train_on_batch(np.array(inputs), np.array(targets))
            
            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t > 100: success += 1

    print "Memory Length {}, Episodes {}, Number of Successes {}".format(D, episode, success)

#### Train Model

In [30]:
for D in np.arange(D_RANGE[0], D_RANGE[1]):
    # Initialize model with frame memory D
    model = _create_network()
    # Train model
    train_03(model, render=True)

Memory Length 1, Episodes 112, Number of Successes 0
Memory Length 2, Episodes 111, Number of Successes 0
Memory Length 3, Episodes 194, Number of Successes 0
Memory Length 4, Episodes 201, Number of Successes 0
Memory Length 5, Episodes 112, Number of Successes 0
Memory Length 6, Episodes 139, Number of Successes 0
Memory Length 7, Episodes 128, Number of Successes 0
Memory Length 8, Episodes 186, Number of Successes 0
Memory Length 9, Episodes 107, Number of Successes 0
Memory Length 10, Episodes 186, Number of Successes 0
Memory Length 11, Episodes 139, Number of Successes 0
Memory Length 12, Episodes 164, Number of Successes 0
Memory Length 13, Episodes 164, Number of Successes 0
Memory Length 14, Episodes 102, Number of Successes 0
Memory Length 15, Episodes 187, Number of Successes 0


#### Aftermath

The question crossed my mind: Why don't we predict beforehand, on the fly, at every time step? Would that not be computationally efficient? This has the huge disadvantage that we predict with the knowledge available at timestep t. This might be faulty, and the faulty prediction stays as target reference in the batch, and is used to compare the loss for the taken action between the prediction at timestep t and the potenitially long ago target estimation. This will bias the learning process significantly. Thus, we select a batch, calculate the estimations and targets for the complete batch, both with the knowledge of the current time steps.

TODO is np.array(minibatch) really faster

## Step 4: Deep Q Learning from Stored Experiences, *Refined*

At this point, we will finalize training by applying some more tweaks (outside the model block).

#### Set Hyperparameters (Extension)

- The learning rate decreases over the total number of time steps. 
    - We establish a hyperparameter to establish the starting alpha (ALPHA_MAX) 
    - We set up a switch ALPHA_LIN_DECREASE_FLAG, which specifies if alpha decreases linearly or non-linearly
    - If alpha decreases nonlinarly, we set up a range of denominators for the alpha decay ALPHA_DENOM_RANGE
- Discount factor gamma. We will replace the fixed hyperparameter by an arbitrary range of gamma candidates (GAMMA_range).
- Take action every n-th time step (time step per action TSPA) TODO
- Reward Clipping (R_CLIP_FLAG) TODO. Reward clipping is a boolean.

In [135]:
ALPHA_MAX = 3
ALPHA_LIN_DECREASE_FLAG = [False, True] # to loop over!
ALPHA_DENOM_RANGE = [2, 21] # to loop over!
GAMMA_RANGE = [0.1, 1.6] # to loop over!
R_CLIP_FLAG = [False, True] # to loop over!

Last but not least we will store the weights XXX TODO JSON ODER KERAS

#### Build Training Epoch

In [136]:
def train_04(model, render):
    # Start a new epoch
    episode = 0
    success = 0
    epsilon = EPSILON_RANGE[0]
    alpha = ALPHA_MAX ###NEW Initialize the learning rate at its maximum
    ERM = deque(maxlen=ERM_SIZE)

    while episode <= TOTAL_TIME_STEPS:
        
        x_t = ENV.reset()
        s_t = np.tile(x_t, D)
        done = False

        while not done:
            if render: ENV.render()
            
            # Anneal random exploration rate epsilon over exploration time steps
            epsilon = epsilon - epsilon / TOTAL_EXPLORATION_STEPS if epsilon > EPSILON_RANGE[1] else EPSILON_RANGE[1]
    
            # estimate rewards for each action (targets), at s_t
            q = model.predict(s_t[np.newaxis])[0]

            # Take action with highest estimated reward, "epsilon greedy"
            a_t = np.argmax(q) if np.random.random() > epsilon else np.random.choice(ACTIONS, 1)[0]

            # Observe after action a_t
            x_t, r_t, done, info = ENV.step(a_t)
            
            ###NEW Clip rewards
            if R_CLIP and r_t != 0: r_t = abs(r_t) / r_t
        
            # Create state at t1: Append x observations, throw away the earliest
            s_t1 = np.concatenate((x_t, s_t[:(D-1) * INPUT_DIM,]), axis=0)
            
            # Store transition in experience replay memory
            ERM.append((s_t, a_t, r_t, s_t1))

            # Choose a batch of maximum length BATCH_SIZE
            minibatch = np.array([ ERM[i] for i in np.random.choice(np.arange(0, len(ERM)), min(len(ERM), BATCH_SIZE)) ])
            
            # Compute targets/reference for each transition in minibatch
            inputs = deque()
            targets = deque()
            for m in minibatch:
                inputs.append(m[0]) # Append s_t of batch transition m to inputs
                m_q    = model.predict(m[0][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t
                m_Q_sa = model.predict(m[3][np.newaxis])[0] # Estimate rewards for each action (targets), at s_t1
                m_targets = m_q
                m_targets[m[1]] = m[2] + alpha * (GAMMA * np.max(m_Q_sa)) ###NEW Establish learning at rate alpha
                targets.append(m_targets) # Append target of batch transition m to targets
                
            # Train the model by backpropagating the errors and update weights
            model.train_on_batch(np.array(inputs), np.array(targets))
            
            ###NEW Anneal the learning rate alpha over all time steps
            if ALPHA_LIN_DECREASE: alpha -= float(alpha) / TOTAL_TIME_STEPS if alpha >= 0 else 0 
            else: alpha -= float(alpha) / ALPHA_DENOM
            
            # Update state and episode
            s_t = s_t1
            episode += 1
        
            # Bookkeeping
            if r_t > 100: success += 1          

    print "Memory Length {}, Episodes {}, Successes {}".format(D, episode, success)

#### Train Model

Fully fledged training! TODO

In [138]:
for D in np.arange(D_RANGE[0], D_RANGE[1]):
    for ALPHA_LIN_DECREASE in ALPHA_LIN_DECREASE_FLAG:
        for ALPHA_DENOM in ALPHA_DENOM_RANGE:
            for R_CLIP in R_CLIP_FLAG:
                for GAMMA in np.arange(GAMMA_RANGE[0], GAMMA_RANGE[1], 0.1): 
                    # Initialize model with frame memory D
                    model = _create_network()
                    # Train model
                    train_04(model, render=True)

Memory Length 1, Episodes 116, Number of Successes 0
Memory Length 1, Episodes 117, Number of Successes 0
Memory Length 1, Episodes 122, Number of Successes 0
Memory Length 1, Episodes 222, Number of Successes 0
Memory Length 1, Episodes 136, Number of Successes 0
Memory Length 1, Episodes 130, Number of Successes 0
Memory Length 1, Episodes 107, Number of Successes 0


KeyboardInterrupt: 