# Neural Q-Learning (Precursor to DQN)
What happens when we try to do tabular Q-Learning on non-trivial envrionments? If the system we are trying to control has complex internal dynamics, we would need a huge table to record every possible edge case and the associated Q-values. Very quickly our memeory requirements explode as we try to scale this method up.In Neural Q-Learning we use a function approximator (in this case a neural network) to learn a Q-function. Intuiatively, this can be thought of as iteratively fitting an approximator curve to a function as opposed to sampling it directly. Its very similar to plotting a trendline in excel (in fact, it's exactly the same, we are just doing regression in high dimentions).  

If you have a background in engineering the are strong analogies here to fourier series and taylor approximation, with the difference being that instead of finding a close form solution we attempt to "approach" a good enough guess via gradient acent. 

But this obviously creates another massive problem - by updating some weights of our network we can change the Q-values at states that have never been visited. In the tabular case this is impossible, as wecan only make changes locally by adding or updating values in our table, however changing weights in the Neural Network changes the entire function surface. In some cases this is of course what we want - the ability to generalize to state's we haven't seen, however early on when the bellman updates are large this can cause catastrophic changes the the function surface, meaning our exploration strategies become extremely misinformed, subsequent bellman updates are incorrect, and learning fails. These problems are exacerbated by complex network structures, and were the motivation behind the commonly held beleif that convolutional nets and RL were simply not compatible. 

Of course, this was shown in many cases to be untrue, however , "Deep" Reinforcement Learning remains uniquely challenging - in supervised methods we can always potnetially recover, however in RL we can't as the current estimate informs the collection of subsequent training data. There are many methods and techniques to minimize this problem, however it remains an active area of research and as yet can be considered "unsolved". 

In this module we are going to experiment soley with using a neural network as the Q-function approximator, and analyse it's behaviour in classical control problems. 

# Cartpole Environment
Classic Control problem. Equations of motion are: 
$$(M+m)\ddot{x} +  mL\ddot{\theta}\cos{\theta} - mL\dot{\theta}^2\sin{\theta} = F$$
$$mL\ddot{x}\cos{\theta} + mL^2\ddot{\theta} - mgL\sin{\theta} = 0$$

![Cartpole diagram](./images/cartpole.png "Frictionless Cartpole")

## Step 1: Import and Default hyper parameters
Make sure the following packages are installed;
- Jupyter
- Tensorflow
- OpenAI Gym
- Numpy (should install with tensorflow)

```bash
pip install jupyter, tensorflow, gym
```

Begin by importing the following libaries and calling some intial setup code to load the gym environment (what we are going to test our algorithm on) into memory.

In [3]:
import gym
import tensorflow as tf
import numpy as np
import random
import datetime

GAMMA = 0.9  # discount factor for target Q
INITIAL_EPSILON = 0.6  # starting value of epsilon
FINAL_EPSILON = 0.05  # final value of epsilon
EPSILON_DECAY_STEPS = 200
REPLAY_SIZE = 10000  # experience replay buffer size
BATCH_SIZE = 30  # size of minibatch
TEST_FREQUENCY = 100  # How many episodes to run before visualizing test accuracy
SAVE_FREQUENCY = 1000  # How many episodes to run before saving model

HIDDEN_NODES = 20

ENV_NAME = 'CartPole-v0'
EPISODE = 2000  # Episode limitation
STEP = 200  # Step limitation in an episode
TEST = 10  # The number of experiment test every 100 episode

#Load the Gym enviornment
env = gym.make(ENV_NAME)
replay_buffer = []
time_step = 0
epsilon = INITIAL_EPSILON
STATE_DIM = env.observation_space.shape[0]
ACTION_DIM = env.action_space.n

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


## Step 2: Define our Q-network

Before we write any RL code, lets construct the network we want to use to approximate our q-values. First lets write a convenience function that creates a densely connected layer according to some input parameters. Note that we could just use `tf.layers.dense here`, but it's helpfull for understanding to write things out explicitly. 

In [4]:
def dense_layer(inp, layersize, activation=tf.nn.relu):
    inp_size = inp.get_shape().as_list()[1]
    w = tf.Variable(tf.truncated_normal(shape=[inp_size, layersize]))
    b = tf.Variable(tf.constant(0.01, shape=[layersize]))
    pre_activ = tf.matmul(inp, w) + b
    post_activ = activation(pre_activ)

    return post_activ

Now lets define the inputs and outputs we want our network to have as tf placeholders. The `None` value in the second argument allows this value to be determined at runtime according to our batchsize.

In [5]:
state_in = tf.placeholder("float", [None, STATE_DIM])
action_in = tf.placeholder("float", [None, ACTION_DIM]) 
target_in = tf.placeholder("float", [None])

All that remains for this step is to wire these placeholders together, using our dense layer function, in order to structure the neural network we wish to use to approximate the q-function.

Try to create, at minimum, the following tensors;
Q_values: Tensor containing Q_values for every action
Q_action: Q_value for action specified in action_in
loss: Value network is aiming to minimize
optimizer: optimizer for the network

The suggested structure is to have each output node represent a Q value for
one action. e.g. for cartpole there will be two output nodes.

Hint: Given how q-values are used within RL, is it necessary to have output
activation functions?

In [7]:
h1 = dense_layer(
    state_in,
    HIDDEN_NODES,
    activation=tf.nn.relu
)

h2 = dense_layer(
    h1,
    HIDDEN_NODES,
    activation=tf.nn.relu
)

q_values = dense_layer(
    h1,
    ACTION_DIM,
    activation=lambda x: x  # i.e. no activation
)

q_action = tf.reduce_sum(tf.multiply(q_values, action_in), reduction_indices=1)

We just created a MLP with two hidden layers of Relu's and a raw (no activation) output layer. In addition, we included the `action_in` placeholder by using it to mask the raw q-values output - allowing us to know the q-value of a specific action. To train this network, however, we still need to define a loss which we can use as the starting point for backpropogation. In this example we will use standard MSE between what we think the Q-value should be vs. what we got, and will use the ADAM optimizer to minimize this loss. We also add a summary op to log the loss. 

In [11]:
loss = tf.reduce_mean(tf.square(target_in - q_action))
optimizer = tf.train.AdamOptimizer(0.0003).minimize(loss)
train_cost_op = tf.summary.scalar("Loss", loss)

## Step 2: Define the target
We now have a nicely defined network structure which we can use to learn the Q-function, as well as a loss along which we can minimize, however this loss function; $(target - Q(s,a))^2$ still requires a "target" argument. In this simple extention of tabular Q-Learning, our "target" is simply the Q-values provided to us after a one-step bellman update. In other words, once we have a network (either randomly initalized, halfway through learning or almost done), we want to gather some experience from that network (using Q-network to inform exploration), then run an iteration of $TD(0)$ using the collected rewards. We then update the Q-network with these new values (the TD(0) updates), as the targets.
Formally, we want to peform the following update;
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left(r + \gamma \max_x Q(s, x) - Q(s,a)\right)$$

where $r + \gamma \max_x Q(s, x)$ is the $target$ in our loss function. In other words, we want to update the Q-value by some learning rate, $\alpha$, in the direction of our goal (everything inside the brackets). 