# 09. Policy Gradient Methods

Value based methods like Q-learning extract the policy out of the value function. But what do you do if the action space is to huge or has continous values? Policy Gradient Methods just like genetic algorithms work directly on the policy and don't take the detour over the value function. <br>
<br>
__Advantages of policy gradient methods:__ <br>
- value based methods can have huge oscilattions while training since very small changes in Q can result in other policies, following the gradient on the other hand should result in smooth policy updates
- Policy gradients can handle high dimensional action spaces
- Policy gradients can learn stochastic policies which is advantageous in games like stone paper scissors
<br>
<br>
__Disadvantages:__ <br>
- getting stuck in local optima
- can take more iterations for optimization

## Getting some intuition

In supervised learning you get labels with each dataset. e.g. images of cats and dogs and the job is to classify them. Lets say the network outputs a probability of [0.6, 0.4] while the label was [0, 1], the standard approach would be to subtract the prediction from the label / feed it in a loss function and backpropagate the error such that the weights get shifted rowards better predictions.

Policy gradients are very similar to this approach. Instead of a given Label we got some reward. If we recieved high rewards during the episode, all actions we took are seen as correct decisions.
In the case of low rewards we want to change the weights of our policy such that the bad decisions are taken less likely.

A simple criterium to seperate good from bad decisions could be to subtract the average reward from our current one. A positive number means we got more than usual, a negative number means the agent performed worse, and the value can be used as a weight, how much we want to change the policy. Lets call this number advantage (adv)

For every step of the episode our policy gives us the action probablity:
state --> policy --> probability to take an action (prob_a) e.g. [0.6, 0.4] 

Lets say we got a reward which was better than average and the taken action was [1.0, 0], than we can use this sample just like in supervised learning for backpropagation with model.train_on_batch(state, [1.0*adv, 0]) to change the parameter such that the output gets closer to our desired output.

In the case of a low reward (less than average), we want to lower the probability for the chosen action. This is implemented using model.train_on_batch(state, [1.0*adv, 0]) where adv is a negative number.

Lets see if this simple approach already works

In [5]:
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from collections import deque
from keras.initializers import RandomUniform
np.random.seed(42)

env_name = 'FrozenLake-v0'
#env_name = 'Taxi-v2'

env = gym.make(env_name)
state_space = env.observation_space.n
action_space = env.action_space.n

gamma = 0.999
num_games = 30000

reward_list = deque(maxlen=100)
learning_rate = 0.1

def create_model():
    model = Sequential()
    model.add(Dense(10, input_dim=state_space, activation='relu'))
    model.add(Dense(action_space, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=learning_rate))
    return model

def softmax(x):
    return 

model = create_model()
for game in range(num_games):
    state = env.reset()
    state = np.identity(state_space)[state:state+1] # transforms state into 1-hot-encoding
    done = False
    episode_reward = 0
    s = []
    a = []

    while(not done): 
        action_prob = model.predict(state)[0]
        action = np.random.choice(action_space, p=action_prob)

        state_next, reward, done, _ = env.step(action)
        state_next = np.identity(state_space)[state_next:state_next+1]
        
        s.append(state[0])
        a.append(np.identity(action_space)[action:action+1][0])
        
        episode_reward += reward

        state = state_next
        if(done):
            reward_list.append(episode_reward)
            
            R = episode_reward - np.mean(reward_list)
            
            y = np.array(a)*R
            model.train_on_batch(np.array(s), y)
            
            if(game%500 == 0):
                print('episode: ', game, 'avg reward: ', np.mean(reward_list))


episode:  0 avg reward:  0.0
episode:  500 avg reward:  0.0
episode:  1000 avg reward:  0.0
episode:  1500 avg reward:  0.03
episode:  2000 avg reward:  0.0
episode:  2500 avg reward:  0.01
episode:  3000 avg reward:  0.0
episode:  3500 avg reward:  0.04
episode:  4000 avg reward:  0.01
episode:  4500 avg reward:  0.04
episode:  5000 avg reward:  0.03
episode:  5500 avg reward:  0.0
episode:  6000 avg reward:  0.03
episode:  6500 avg reward:  0.01
episode:  7000 avg reward:  0.04
episode:  7500 avg reward:  0.02
episode:  8000 avg reward:  0.0
episode:  8500 avg reward:  0.04
episode:  9000 avg reward:  0.01
episode:  9500 avg reward:  0.02
episode:  10000 avg reward:  0.01
episode:  10500 avg reward:  0.0
episode:  11000 avg reward:  0.04
episode:  11500 avg reward:  0.01
episode:  12000 avg reward:  0.0
episode:  12500 avg reward:  0.05
episode:  13000 avg reward:  0.01
episode:  13500 avg reward:  0.02
episode:  14000 avg reward:  0.03
episode:  14500 avg reward:  0.02
episode:  150

Now lets try to get to something similar with some math to back it up.

When trying to derive backprop in supervised learning we fest need a cost function. Since we don't want to minimize the cost but instead increase the reward this is called score function.

## Policy score function J($\theta$) / also called advantage function
The policy score function calculates the expected reward of a policy. Three methos work equally well depending on the environment.
### 1. Episodic environments: Using the mean from start to end
Calculate the mean return from start to the end of the episode <br>
$J_1(\theta) = E_\pi [R_1 + \gamma R_2 + \gamma^2 R_3 + ...] = E_\pi(v(s_1)) = V^{\pi_\theta}(s_1)$

### 2. Continous environments: Using the average value
In continous environments we can not rely on a specific start state. Instead, the states are weighted based on how often they are visited and this weight is multiplied with the expected reward from this state onwards. <br>
$J_{avg}(\theta) = E_\pi(V(s)) = \sum_s d(s)V(s)$ with $d(s) = N(s)/(\sum_{s'} N(s'))$

### 3. Continous environments: Using the average reward per time step
sum over probability to be in state s <br>
multiplied by <br>
sum over probability to take action a from that state unter that policy <br>
multiplied by <br>
Reward for that action in that state <br>
$J_{avR}(\theta) = E_\pi(r) = \sum_s d(s) \sum_a \pi_\theta (s,a) R^a_s$


### Policy gradient ascent
Instead of gradient descent for a loss function like in supervised learning, we use gradient ascent for the score function. <br>
Update: $\theta$ <-- $ \theta + \alpha \nabla_\theta J(\theta)$

$ \nabla_\pi J(\theta) = \nabla_\theta \sum_t \pi(t, \theta)R(t)$ <br>
$ \nabla_\pi J(\theta) =  \sum_t \nabla_\theta \pi(t, \theta)R(t)$ <br>

Using the Likelihood ratio trick  <br>
\begin{equation*}
\nabla log(x) =  \frac{\nabla x}{x}
\end{equation*}
We divide and multiply by $\pi(t, \theta)$, which results in: <br>
\begin{equation*}
\nabla_\theta \pi(t, \theta) = \pi(t, \theta)  \frac{ \nabla_\theta \pi(t, \theta) }{ \pi(t, \theta) }
\end{equation*}

$ \nabla_\pi J(\theta) =  \sum_t \nabla_\theta \pi(t, \theta)R(t) = \sum_t  \pi(t, \theta) \nabla_\theta (log (\pi(t, \theta)))R(t) $ <br>
 <br>
Since $\sum_t  \pi(t, \theta)$ is a sum over the probabilies we can convert the sum to an expectation <br>
$ \nabla_\pi J(\theta) =  E_\pi [ \nabla_\theta (log (\pi(t, \theta)))R(t) ] $ <br>
<br>
 
We can change the update rule: <br>
$\theta$ <-- $\theta + \alpha \nabla_\theta(log(\pi(s,a,\theta)))R(t)$ <br>
<br>

Keras can calculate the gradients for us and update the network but the loss function has to be implemented using Keras-Backend. 

In [10]:
# source: https://gist.githubusercontent.com/kkweon/c8d1caabaf7b43317bc8825c226045d2/raw/fb433ae27c57aa41883af613d88227c65e8fb5ab/policy_gradient.py

import gym
import numpy as np

from keras import layers
from keras.models import Model
from keras import backend as K
from keras import utils as np_utils
from keras import optimizers


class Agent(object):

    def __init__(self, input_dim, output_dim, hidden_dims=[10]):
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.__build_network(input_dim, output_dim, hidden_dims)
        self.__build_train_fn()

    def __build_network(self, input_dim, output_dim, hidden_dims=[10]):
        self.X = layers.Input(shape=(input_dim,))
        net = self.X

        for h_dim in hidden_dims:
            net = layers.Dense(h_dim)(net)
            net = layers.Activation("relu")(net)

        net = layers.Dense(output_dim)(net)
        net = layers.Activation("softmax")(net)

        self.model = Model(inputs=self.X, outputs=net)

    def __build_train_fn(self):
        action_prob_placeholder = self.model.output
        action_onehot_placeholder = K.placeholder(shape=(None, self.output_dim),
                                                  name="action_onehot")
        discount_reward_placeholder = K.placeholder(shape=(None,),
                                                    name="discount_reward")

        action_prob = K.sum(action_prob_placeholder * action_onehot_placeholder, axis=1)
        log_action_prob = K.log(action_prob)

        loss = - log_action_prob * discount_reward_placeholder
        loss = K.mean(loss)

        adam = optimizers.Adam()

        updates = adam.get_updates(params=self.model.trainable_weights,
                                   loss=loss)

        self.train_fn = K.function(inputs=[self.model.input,
                                           action_onehot_placeholder,
                                           discount_reward_placeholder],
                                   outputs=[],
                                   updates=updates)

    def get_action(self, state):
        shape = state.shape

        if len(shape) == 1:
            assert shape == (self.input_dim,), "{} != {}".format(shape, self.input_dim)
            state = np.expand_dims(state, axis=0)

        elif len(shape) == 2:
            assert shape[1] == (self.input_dim), "{} != {}".format(shape, self.input_dim)

        else:
            raise TypeError("Wrong state shape is given: {}".format(state.shape))

        action_prob = np.squeeze(self.model.predict(state))
        assert len(action_prob) == self.output_dim, "{} != {}".format(len(action_prob), self.output_dim)
        return np.random.choice(np.arange(self.output_dim), p=action_prob)

    def fit(self, S, A, R):
        action_onehot = np_utils.to_categorical(A, num_classes=self.output_dim)
        discount_reward = compute_discounted_R(R)

        assert S.shape[1] == self.input_dim, "{} != {}".format(S.shape[1], self.input_dim)
        assert action_onehot.shape[0] == S.shape[0], "{} != {}".format(action_onehot.shape[0], S.shape[0])
        assert action_onehot.shape[1] == self.output_dim, "{} != {}".format(action_onehot.shape[1], self.output_dim)
        assert len(discount_reward.shape) == 1, "{} != 1".format(len(discount_reward.shape))

        self.train_fn([S, action_onehot, discount_reward])


def compute_discounted_R(R, discount_rate=.99):
    discounted_r = np.zeros_like(R, dtype=np.float32)
    running_add = 0
    for t in reversed(range(len(R))):

        running_add = running_add * discount_rate + R[t]
        discounted_r[t] = running_add

    discounted_r -= discounted_r.mean() / discounted_r.std()

    return discounted_r


def run_episode(env, agent):
    done = False
    S = []
    A = []
    R = []

    s = env.reset()

    total_reward = 0

    while not done:

        a = agent.get_action(s)

        s2, r, done, info = env.step(a)
        total_reward += r

        S.append(s)
        A.append(a)
        R.append(r)

        s = s2

        if done:
            S = np.array(S)
            A = np.array(A)
            R = np.array(R)

            agent.fit(S, A, R)

    return total_reward

env_name = 'FrozenLake-v0'


try:
    env = gym.make(env_name)
    input_dim = env.observation_space.n
    output_dim = env.action_space.n
    agent = Agent(input_dim, output_dim, [10])

    for episode in range(2000):
        reward = run_episode(env, agent)
        print(episode, reward)

finally:
    env.close()



TypeError: Wrong state shape is given: ()