# REINFORCEMENT LEARNING (RL) – POLICY GRADIENTS

- https://theneuralperspective.com/2016/11/25/reinforcement-learning-rl-policy-gradients-i/
- https://github.com/ashutoshkrjha/Cartpole-OpenAI-Tensorflow/blob/master/cartpole.py
- https://gist.github.com/shanest/535acf4c62ee2a71da498281c2dfc4f4

### OBJECTIVE

The three main components in our model include the state, action and reward. The state can be thought of the environment which generates an action which leads to a reward. Actions can also alter the state and often the reward may be delayed and not always an immediate response to a given action.

```    
               -----------------
              |                 |
             \ /                |
              -                 |
            State-----------> Action-----------> Reward
              |                                    -
              |                                   / \
              |                                    |
              --------------------------------------
```

s  ---> state   
a  ---> action   
a* ---> correct action  
r  ---> reward   
$\pi$  ---> policy  
$\theta$  --->  policy weights   
R ---> total reward   
$\hat A $  ---> Advantage Est   
$\gamma$  ---> discount factor   

$max_\theta \sum_{n=1}^{N}\ \log P(y_n|x_n;\theta)$    
= $max_\theta \sum_{n=1}^{N}\ \log P(a^*_n|s_n;\theta)$    
= $min_\theta \big[- \sum_{n=1}^{N}\ \log P(a^*_n|s_n;\theta)\big]$ 

Since we don’t have the correction actions to take, the best we can do it try some actions that may turn our good/bad and then eventually train our policy weights (theta) so that we increase the chances of the good actions. One common approach is to collect a series of states, actions and the corresponding rewards (s0, a0, r0, s1, a1, r1, … ) and from this we can calculate R, the total reward – sum of all the rewards r. This will give us the policy gradient:

$$
\frac{\partial J}{\partial \theta} = \frac{\partial \sum \log \pi (a|s;\theta)}{\partial \theta} R
$$

We will take a closer look at why this policy gradient is expressed as it is later in the section where we draw parallels with supervised learning.

When calculating the total reward R for an episode (series of events), we will have good and bad actions. But, according to our policy gradient, we will be updating the weights to favor ALL the actions in a given episode if the reward is positive and the magnitude of the update depends on the magnitude of the gradient. When we repeat this with enough episodes, our policy gradient becomes quite precise in modeling what actions to take given a state in order to maximize the reward.

There are several additions we can make to our policy gradient, such as adding on an advantage estimator to determine which specific actions were good/bad instead of just using the total reward to judge an episode as good/bad.  

So how does this supervised learning technique relate to reinforcement learning policy gradients? If you inspect the loss functions, you will see that they are exactly the same in principle. The negative log likelihood loss function above is the simplification of the multi-class cross entropy loss function below.

$$
J(\theta) = - \sum_i y_i ln(\hat y_i)
$$

Eg:   
Computed($\hat y$): [0.3,0.3,0.4]  
Targets ($y$) : [0,0,1]   
$$
J(\theta) = -  [0 * ln(0.3) + 0 * ln(0.3) + 1 * ln(0.4)] = - ln(0.4)
$$

With the multinomial cross entropy, you can see that we only keep the loss contribution from the correct class. Usually, with neural nets, this will be case if our ouputs are sparse (just 1 true class). Therefore, we can rewrite our loss into just a $\sum(-log(\hat y))$ where y_hat will just be the probability of the correct class. We just replace $y_i$ (true y) with 1 and for the probabilities for the other classes, doesn’t matter because their $y_i$ is 0. This is referred to as negative log likelihood. But for drawing the parallel between supervised learning and RL, let’s keep it in the explicit cross-entropy form.

Supervised: $J(\theta) = - \sum y_i \log (\hat y_i)$    
Reninforcement: $J(\theta) = - \sum r \log \pi (a|s;\theta)$


In supervised learning, we have a prediction ($\hat y$) and a true label ($y_i$). In a typical case, only one label will be 1 (true) in $y_i$, so therefore only the log of the true class’s prediction will be taken into account. But as we saw above, the gradient will take into account the predicted probability for all the classes. In RL, we have our action (a) based on our policy ($\pi$) which we take the log of. We multiply the action from the policy with our reward for that action. 

Note: The action is a number from the outputs (ex. chosen action is 2, so we take the 0.9 from  [0.2, 0.3, 0.9, 0.1] to put into the log . The reward can be any magnitude and direction but like our supervised case, it will help determine the loss and properly adjust the weights by influencing the gradient. If the reward is positive, the weights will be altered via the gradient in order to favor the action that was made in that state. If the reward is negative, the gradient will be unfavored to make that action with that particular state. DO NOT draw parallels by saying the reward is like $y_i$ because, as you can see, that is not the case. 

### NUANCES

You may notice that we do an additional operation to our rewards before feeding it in for training. We do what’s known as discounting the reward. The idea is that each reward will be weighted by all the rewards that follow it since the action responsible for the current reward will determine the rewards for the subsequent events. We weight each reward by the discount factor gamma^(time since reward). So each reward will be recalculated by the following expression:

$$
r_t = \sum_{k=0}^{\infty} \gamma ^ k r_{t+1}
$$


https://gym.openai.com/docs

In [1]:
import gym
import time
env = gym.make('CartPole-v0')
env.reset()
total_reward = 0
for _ in range(1000):
    env.render()
    time.sleep(0.01)
    observation, reward, done, info = env.step(env.action_space.sample()) # take a random action
    total_reward += reward
    if done: break
env.close()

print('Total Reward is:', total_reward)

[2017-09-06 19:43:36,715] Making new env: CartPole-v0


Total Reward is: 13.0


In [None]:
#possible actions available
for i in range(20):
    print(env.action_space.sample(), end=' ')

In [None]:
print(env.action_space)
#> Discrete(2) i.e valid actions are either 0 or 1. 
print(env.observation_space)
#> Box(4,)
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

In [None]:
# from gym import envs
# avaiable_envs = envs.registry.all()
# [print(each) for each in list(avaiable_envs)]

In [8]:
import gym
env = gym.make('CartPole-v0')

input_initial = env.reset()

print('input_initial: ', input_initial)
for i_episode in range(10):
    total_reward = 0
    observation = env.reset()
    for t in range(1000):
        env.render()
        time.sleep(0.01)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        total_reward += reward
        if done:
            
#             print(observation, reward, done, info )
            print("Episode finished after {} timesteps with score {}".format(t+1, total_reward))
            break
env.close()


[2017-09-06 19:47:39,126] Making new env: CartPole-v0


input_initial:  [-0.04124413  0.03330742  0.02792075 -0.00368043]
Episode finished after 36 timesteps with score 36.0
Episode finished after 43 timesteps with score 43.0
Episode finished after 32 timesteps with score 32.0
Episode finished after 23 timesteps with score 23.0
Episode finished after 10 timesteps with score 10.0
Episode finished after 11 timesteps with score 11.0
Episode finished after 14 timesteps with score 14.0
Episode finished after 22 timesteps with score 22.0
Episode finished after 22 timesteps with score 22.0
Episode finished after 23 timesteps with score 23.0


In [9]:
import sys
import numpy as np
import json
import os, inspect
import math
sys.path.append("../")
%load_ext autoreload
%autoreload 2
import logging
logger = logging.getLogger(__name__)

In [10]:
from dhira.data.openai.cartpole import CartPole as GymEnv
from dhira.tf.models.rl.cartpole import PolicyGradient as Model
from dhira.data.features.internal.feature_base import FeatureBase

In [None]:
try:
    if env is not None: env.close()
except:
    print('Env not yet defined')
rl = Model()
rl.compile()
env = GymEnv()

[2017-09-06 19:48:52,591] Making new env: CartPole-v0


In [None]:
rl.train_gym(env, batch_size=5,
              num_episodes=10000,
              expected_reward=200)

Episode:5 Average reward for episode:15.800000  Total average reward:15.800000
Episode:10 Average reward for episode:19.400000  Total average reward:19.400000
Episode:15 Average reward for episode:23.600000  Total average reward:23.600000
Episode:20 Average reward for episode:14.000000  Total average reward:14.000000
Episode:25 Average reward for episode:20.000000  Total average reward:20.000000
Episode:30 Average reward for episode:27.600000  Total average reward:27.600000
Episode:35 Average reward for episode:23.400000  Total average reward:23.400000
Episode:40 Average reward for episode:18.000000  Total average reward:18.000000
Episode:45 Average reward for episode:23.600000  Total average reward:23.600000
Episode:50 Average reward for episode:16.000000  Total average reward:16.000000
Episode:55 Average reward for episode:14.800000  Total average reward:14.800000
Episode:60 Average reward for episode:13.000000  Total average reward:13.000000
Episode:65 Average reward for episode:12.

Episode:520 Average reward for episode:11.800000  Total average reward:11.800000
Episode:525 Average reward for episode:10.200000  Total average reward:10.200000
Episode:530 Average reward for episode:13.800000  Total average reward:13.800000
Episode:535 Average reward for episode:10.800000  Total average reward:10.800000
Episode:540 Average reward for episode:11.000000  Total average reward:11.000000
Episode:545 Average reward for episode:11.800000  Total average reward:11.800000
Episode:550 Average reward for episode:10.200000  Total average reward:10.200000
Episode:555 Average reward for episode:11.400000  Total average reward:11.400000
Episode:560 Average reward for episode:10.400000  Total average reward:10.400000
Episode:565 Average reward for episode:11.800000  Total average reward:11.800000
Episode:570 Average reward for episode:10.400000  Total average reward:10.400000
Episode:575 Average reward for episode:10.600000  Total average reward:10.600000
Episode:580 Average reward f

Episode:1040 Average reward for episode:9.400000  Total average reward:9.400000
Episode:1045 Average reward for episode:9.000000  Total average reward:9.000000
Episode:1050 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1055 Average reward for episode:8.800000  Total average reward:8.800000
Episode:1060 Average reward for episode:9.800000  Total average reward:9.800000
Episode:1065 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1070 Average reward for episode:10.000000  Total average reward:10.000000
Episode:1075 Average reward for episode:9.200000  Total average reward:9.200000
Episode:1080 Average reward for episode:9.200000  Total average reward:9.200000
Episode:1085 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1090 Average reward for episode:12.600000  Total average reward:12.600000
Episode:1095 Average reward for episode:10.000000  Total average reward:10.000000
Episode:1100 Average reward for ep

Episode:1550 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1555 Average reward for episode:9.400000  Total average reward:9.400000
Episode:1560 Average reward for episode:8.800000  Total average reward:8.800000
Episode:1565 Average reward for episode:10.000000  Total average reward:10.000000
Episode:1570 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1575 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1580 Average reward for episode:9.600000  Total average reward:9.600000
Episode:1585 Average reward for episode:9.400000  Total average reward:9.400000
Episode:1590 Average reward for episode:10.200000  Total average reward:10.200000
Episode:1595 Average reward for episode:9.000000  Total average reward:9.000000
Episode:1600 Average reward for episode:9.400000  Total average reward:9.400000
Episode:1605 Average reward for episode:10.400000  Total average reward:10.400000
Episode:1610 Average reward for ep

Episode:2070 Average reward for episode:9.200000  Total average reward:9.200000
Episode:2075 Average reward for episode:9.800000  Total average reward:9.800000
Episode:2080 Average reward for episode:9.400000  Total average reward:9.400000
Episode:2085 Average reward for episode:9.200000  Total average reward:9.200000
Episode:2090 Average reward for episode:9.600000  Total average reward:9.600000
Episode:2095 Average reward for episode:9.400000  Total average reward:9.400000
Episode:2100 Average reward for episode:8.800000  Total average reward:8.800000
Episode:2105 Average reward for episode:9.000000  Total average reward:9.000000
Episode:2110 Average reward for episode:9.600000  Total average reward:9.600000
Episode:2115 Average reward for episode:10.000000  Total average reward:10.000000
Episode:2120 Average reward for episode:9.600000  Total average reward:9.600000
Episode:2125 Average reward for episode:10.200000  Total average reward:10.200000
Episode:2130 Average reward for epis

Episode:2590 Average reward for episode:9.600000  Total average reward:9.600000
Episode:2595 Average reward for episode:9.400000  Total average reward:9.400000
Episode:2600 Average reward for episode:9.600000  Total average reward:9.600000
Episode:2605 Average reward for episode:8.600000  Total average reward:8.600000


In [None]:
np.random.uniform()