# 09. Policy Gradient Methods

Value based methods like Q-learning extract the policy out of the value function. But what do you do if the action space is to huge or has continous values? Policy Gradient Methods just like genetic algorithms work directly on the policy and don't take the detour over the value function. <br>
<br>
__Advantages of policy gradient methods:__ <br>
- value based methods can have huge oscilattions while training since very small changes in Q can result in other policies, following the gradient on the other hand should result in smooth policy updates
- Policy gradients can handle high dimensional action spaces
- Policy gradients can learn stochastic policies which is advantageous in games like stone paper scissors
<br>
<br>
__Disadvantages:__ <br>
- getting stuck in local optima
- can take more iterations for optimization

## Getting some intuition

In supervised learning you get labels with each dataset. e.g. images of cats and dogs and the job is to classify them. Lets say the network outputs a probability of [0.6, 0.4] while the label was [0, 1], the standard approach would be to subtract the prediction from the label / feed it in a loss function and backpropagate the error such that the weights get shifted rowards better predictions.

Policy gradients are very similar to this approach. Instead of a given Label we got some reward. If we recieved high rewards during the episode, all actions we took are seen as correct decisions.
In the case of low rewards we want to change the weights of our policy such that the bad decisions are taken less likely.

A simple criterium to seperate good from bad decisions could be to subtract the average reward from our current one. A positive number means we got more than usual, a negative number means the agent performed worse, and the value can be used as a weight, how much we want to change the policy. Lets call this number advantage (adv)

For every step of the episode our policy gives us the action probablity:
state --> policy --> probability to take an action (prob_a) e.g. [0.6, 0.4] 

Lets say we got a reward which was better than average and the taken action was [1.0, 0], than we can use this sample just like in supervised learning for backpropagation with model.train_on_batch(state, [1.0*adv, 0]) to change the parameter such that the output gets closer to our desired output.

In the case of a low reward (less than average), we want to lower the probability for the chosen action. This is implemented using model.train_on_batch(state, [1.0*adv, 0]) where adv is a negative number.

Lets see if this simple approach already works

In [44]:
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from collections import deque
from keras.initializers import RandomUniform
np.random.seed(3)

env_name = 'FrozenLake-v0'
#env_name = 'Taxi-v2'

env = gym.make(env_name)
state_space = env.observation_space.n
action_space = env.action_space.n

gamma = 0.999
num_games = 30000

reward_list = deque(maxlen=100)
learning_rate = 0.1

def create_model():
    model = Sequential()
    model.add(Dense(10, input_dim=state_space, activation='relu'))
    model.add(Dense(action_space, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=learning_rate))
    return model

def softmax(x):
    return 

model = create_model()
for game in range(num_games):
    state = env.reset()
    state = np.identity(state_space)[state:state+1] # transforms state into 1-hot-encoding
    done = False
    episode_reward = 0
    s = []
    a = []

    while(not done): 
        action_prob = model.predict(state)[0]
        action = np.random.choice(action_space, p=action_prob)

        state_next, reward, done, _ = env.step(action)
        state_next = np.identity(state_space)[state_next:state_next+1]
        
        s.append(state[0])
        a.append(np.identity(action_space)[action:action+1][0])
        
        episode_reward += reward

        state = state_next
        if(done):
            reward_list.append(episode_reward)
            
            R = episode_reward - np.mean(reward_list)
            
            y = np.array(a)*R
            model.train_on_batch(np.array(s), y)
            
            if(game%500 == 0):
                print('episode: ', game, 'avg reward: ', np.mean(reward_list))


episode:  0 avg reward:  0.0
episode:  500 avg reward:  0.01
episode:  1000 avg reward:  0.02
episode:  1500 avg reward:  0.02
episode:  2000 avg reward:  0.01
episode:  2500 avg reward:  0.0
episode:  3000 avg reward:  0.02
episode:  3500 avg reward:  0.02
episode:  4000 avg reward:  0.03
episode:  4500 avg reward:  0.0
episode:  5000 avg reward:  0.01
episode:  5500 avg reward:  0.03
episode:  6000 avg reward:  0.02
episode:  6500 avg reward:  0.02
episode:  7000 avg reward:  0.02
episode:  7500 avg reward:  0.01
episode:  8000 avg reward:  0.01
episode:  8500 avg reward:  0.0
episode:  9000 avg reward:  0.03
episode:  9500 avg reward:  0.02
episode:  10000 avg reward:  0.03
episode:  10500 avg reward:  0.06
episode:  11000 avg reward:  0.04
episode:  11500 avg reward:  0.04
episode:  12000 avg reward:  0.04
episode:  12500 avg reward:  0.02
episode:  13000 avg reward:  0.05
episode:  13500 avg reward:  0.02
episode:  14000 avg reward:  0.02
episode:  14500 avg reward:  0.03
episode:

Now lets try to get to something similar with some math to back it up.

When trying to derive backprop in supervised learning we first need a cost function. Since we don't want to minimize the cost but instead increase the reward this is called score function.

## Policy score function J($\theta$) / also called advantage function
The policy score function calculates the expected reward of a policy. Three methos work equally well depending on the environment.
### 1. Episodic environments: Using the mean from start to end
Calculate the mean return from start to the end of the episode <br>
$J_1(\theta) = E_\pi [R_1 + \gamma R_2 + \gamma^2 R_3 + ...] = E_\pi(v(s_1)) = V^{\pi_\theta}(s_1)$

### 2. Continous environments: Using the average value
In continous environments we can not rely on a specific start state. Instead, the states are weighted based on how often they are visited and this weight is multiplied with the expected reward from this state onwards. <br>
$J_{avg}(\theta) = E_\pi(V(s)) = \sum_s d(s)V(s)$ with $d(s) = N(s)/(\sum_{s'} N(s'))$

### 3. Continous environments: Using the average reward per time step
sum over probability to be in state s <br>
multiplied by <br>
sum over probability to take action a from that state unter that policy <br>
multiplied by <br>
Reward for that action in that state <br>
$J_{avR}(\theta) = E_\pi(r) = \sum_s d(s) \sum_a \pi_\theta (a|s) R^a_s$


### Policy gradient ascent
Instead of gradient descent for a loss function like in supervised learning, we use gradient ascent for the score function. <br>
Update: $\theta$ <-- $ \theta + \alpha \nabla_\theta J(\theta)$

$ \nabla_\pi J(\theta) = \nabla_\theta \sum_t \pi(a|s, \theta)R(t)$ <br>
$ \nabla_\pi J(\theta) =  \sum_t \nabla_\theta \pi(a|s, \theta)R(t)$ <br>

Using the Likelihood ratio trick  <br>
\begin{equation*}
\nabla log(x) =  \frac{\nabla x}{x}
\end{equation*}
We divide and multiply by $\pi(a | s, \theta)$, which results in: <br>
\begin{equation*}
\nabla_\theta \pi(a | s, \theta) = \pi(a | s, \theta)  \frac{ \nabla_\theta \pi(a | s, \theta) }{ \pi(a | s, \theta) }
\end{equation*}

$ \nabla_\pi J(\theta) =  \sum_t \nabla_\theta \pi(t, \theta)R(t) = \sum_t  \pi(t, \theta) \nabla_\theta (log (\pi(t, \theta)))R(t) $ <br>
 <br>
Since $\sum_t  \pi(t, \theta)$ is a sum over the probabilies we can convert the sum to an expectation <br>
$ \nabla_\pi J(\theta) =  E_\pi [ \nabla_\theta (log (\pi(t, \theta)))R(t) ] $ <br>
<br>
As we have seen in the monte-carlo approach, expectations can be approximated by m empirical sample episodes. <br>
$ \nabla_\pi J(\theta) =  1/m \sum_{i=1}^m \nabla_\theta (log (\pi(t_i, \theta)))R(t_i)  $ <br>
<br>
 
We can change the update rule: <br>
$\theta$ <-- $\theta + \alpha \nabla_\theta(log(\pi(a|s,\theta)))R(t)$ <br>

Keras can calculate the gradients for us and update the network but the loss function has to be implemented using Keras-Backend. 
For the rewards two tricks can be used to speed up the computation. Subtract the mean of the past rewards from the reward and devide it by the standard deviation of past rewards. Deviding by the standard deviation scales the gradients and subtracting the mean lets the agent take steps into the oposite direction (cases where the agent ) had a worse performance than usual. 

In [40]:
# seed=1, 2, 3 functions nicely
# seed=4 initializes the neural net such that the network gets unstable, 
# outputs a probability of zero to take a certain action and the log(0) is not defiend. 
import numpy as np
np.random.seed(2)

from tensorflow import set_random_seed
set_random_seed(2)
import gym
from keras.models import Model
from keras.layers import Dense, Input
from keras.optimizers import SGD, Adam, RMSprop
import keras.backend as K
from collections import deque
from keras.initializers import RandomUniform

env_name = 'CartPole-v0'

env = gym.make(env_name)
state_space = env.observation_space.shape[0]
action_space = env.action_space.n
batch_size = 10
gamma = 0.99
num_games = 300

reward_list = deque(maxlen=100)
losses = []

learning_rate = 0.01

def discount_rewards(r, gamma=gamma):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [2.9701, 1.99, 1]
    """
    prior = 0
    out = []
    for val in r:
        new_val = val + prior * gamma
        out.append(new_val)
        prior = new_val
    return np.array(out[::-1])


def create_model(env):
    num_actions = env.reset().shape    
    inp = Input(shape=[state_space], name='input_x')
    adv = Input(shape=[1], name='advantage')
    
    hidden1 = Dense(8, input_dim=state_space, activation='relu', 
                    kernel_initializer='random_uniform', use_bias=False)(inp)
    outp = Dense(action_space, activation='softmax', 
                 kernel_initializer='random_uniform', use_bias=False)(hidden1)
    
    def custom_loss(y_true, y_pred):
        log_prob = -K.log(y_pred)
        return K.sum(log_prob*adv, keepdims=True)


    model_train = Model(inputs=[inp, adv], outputs=outp)
    model_train.compile(loss=custom_loss, optimizer=Adam(lr=learning_rate))
    model_predict = Model(inputs=[inp], outputs=outp)
    return model_train, model_predict


model_train, model_predict = create_model(env)
s = np.empty(0).reshape(0, state_space)
a = np.empty(0).reshape(0,1)
r = np.empty(0).reshape(0,1)
discounted_rewards = np.empty(0).reshape(0,1)

np.seterr(all='raise') # throws an exception if log(0) in the loss function messes up the gradient
for game in range(num_games):
    state = env.reset()
    state = np.reshape(state, [1, state_space])
    done = False
    episode_reward = 0

    while(not done): 
        action_prob = model_predict.predict(state)
        action = np.random.choice(action_space, p=action_prob[0])
        state_next, reward, done, _ = env.step(action)
        state_next = np.reshape(state_next, [1, state_space])
        s = np.vstack([s, state])
        r = np.vstack([r, reward])
        a = np.vstack([a, action])
        episode_reward += reward
        state = state_next
        
        if(done):
            discounted_rewards_episode = discount_rewards(r, gamma=gamma)
            discounted_rewards = np.vstack([discounted_rewards, discounted_rewards_episode])
            reward_list.append(episode_reward)
            r = np.empty(0).reshape(0,1)
            
            if (game+1)%10 == 0:
                print('game: ', game+1, 'avg reward: ', np.mean(reward_list))
                
            discounted_rewards -= discounted_rewards.mean()
            discounted_rewards /= (discounted_rewards.std() + 1e-5)
            discounted_rewards = discounted_rewards.squeeze()
            a = a.squeeze().astype(int)

            actions_train = np.zeros([len(a), action_space])
            actions_train[np.arange(len(a)), a] = 1
            
            loss = model_train.train_on_batch([s, discounted_rewards], actions_train)
            losses.append(loss)

            # Clear out game variables
            s = np.empty(0).reshape(0,state_space)
            a = np.empty(0).reshape(0,1)
            discounted_rewards = np.empty(0).reshape(0,1)
            

game:  10 avg reward:  20.0
game:  20 avg reward:  19.5
game:  30 avg reward:  19.0
game:  40 avg reward:  25.3
game:  50 avg reward:  30.22
game:  60 avg reward:  34.2
game:  70 avg reward:  37.17142857142857
game:  80 avg reward:  40.7625
game:  90 avg reward:  43.44444444444444
game:  100 avg reward:  44.77
game:  110 avg reward:  51.49
game:  120 avg reward:  58.11
game:  130 avg reward:  66.63
game:  140 avg reward:  74.61
game:  150 avg reward:  81.39
game:  160 avg reward:  93.38
game:  170 avg reward:  106.45
game:  180 avg reward:  118.73
game:  190 avg reward:  131.79
game:  200 avg reward:  146.12
game:  210 avg reward:  156.77
game:  220 avg reward:  167.3
game:  230 avg reward:  176.9
game:  240 avg reward:  184.37
game:  250 avg reward:  190.8
game:  260 avg reward:  192.97
game:  270 avg reward:  193.25
game:  280 avg reward:  194.38
game:  290 avg reward:  194.83
game:  300 avg reward:  194.15
