# Policy Gradient on cartpole
In this notebook, we run the policy gradient algorithm on the cartpole example. 
* [You can read about the cartpole problem here.](cartpole.ipynb)
* [You can read about policy gradient here.](pg_notebook.ipynb)
* [You can see the pure code for policy gradient on cartpole here.](./cartpole/pg_on_cartpole.py)

## Summary of the algorithm
We build a (deep) network to represent the probability density function $\pi_\theta$= `network(state).`


```
network = keras.Sequential([
            keras.layers.Dense(30, input_dim=n_s, activation='relu'),
            keras.layers.Dense(30, activation='relu'),
            keras.layers.Dense(n_a, activation='softmax')])
network.compile(loss='categorical_crossentropy')
```

Then, we iteratively improve the network. In each iteration of the algorithm, we do the following

* i. We rollout the environment to collect data for PG by following these steps:
    * i.a. We initialize empty histories for `states=[]`, `actions=[]`, `rewards=[]`
    * i.b. We observe the `state` $s$ and sample `action` $a$ from the poliy pdf $\pi_{\theta}(s)$
    
    `softmax_out = network(state)`
    
    `a = np.random.choice(n_a, p=softmax_out.numpy()[0])`
    
    * i.c. We derive the environment using $a$ and observe the `reward` $r$.
    * i.d. We add $s,\:a,\:r$ to the history batch `states`, `actions`, `rewards`.
    * i.e. We continue from i.b. until the episode ends.
* ii. We improve the policy by following these steps
    * ii.a. We calculate the reward to go and standardize it. 
    * ii.b. We optimize the policy.
    
    `
    target_actions = tf.keras.utils.to_categorical(np.array(actions), n_a)
    loss = self.network.train_on_batch(states, target_actions, sample_weight=rewards_to_go)
    `
    

## Running on google colab
If you want to run on google colab, go ahead and run the following cell. If you want to run on your computer, skip this cell and start from Importing libraries.

In [1]:
!git clone https://github.com/FarnazAdib/Crash_course_on_RL.git
%cd Crash_course_on_RL
!pip install .

Cloning into 'Crash_course_on_RL'...
remote: Enumerating objects: 158, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 158 (delta 88), reused 142 (delta 72), pack-reused 0[K
Receiving objects: 100% (158/158), 356.75 KiB | 7.75 MiB/s, done.
Resolving deltas: 100% (88/88), done.
/content/Crash_course_on_RL
Processing /content/Crash_course_on_RL
Building wheels for collected packages: Reinforcement-Learning-for-Control
  Building wheel for Reinforcement-Learning-for-Control (setup.py) ... [?25l[?25hdone
  Created wheel for Reinforcement-Learning-for-Control: filename=Reinforcement_Learning_for_Control-0.0.1-cp36-none-any.whl size=3832 sha256=32692a077b39a10f8b561f82b5512f28a1feef007c2b2ee56e584b4f41650d6a
  Stored in directory: /tmp/pip-ephem-wheel-cache-9wxyqh1u/wheels/9b/1d/4f/f6799e86f243362b1f3fb259778269457c7c763ccf94aab885
Successfully built Reinforcement-Learning-for-Control
Installing collected packag

## Importing libraries

We start coding by importing the required libraries. If you get an error, you have possibly forgotten to change the kernel. See [Prepare a virtual environment](Preparation.ipynb).

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
from cartpole.dynamics import CartPole

## Saving directories
Next, we set up some paths to write data and possibly capture some videos for future investigation.

In [3]:
STORE_PATH = '/tmp/cartpole_exp1/PG'
data_path = STORE_PATH + f"/data_{dt.datetime.now().strftime('%d%m%Y%H%M')}"
agent_path = STORE_PATH + f"/agent_{dt.datetime.now().strftime('%d%m%Y%H%M')}"
train_writer = tf.summary.create_file_writer(data_path)

## Making the environment
We select the random seed and make the cartpole environment.


In [4]:
Rand_Seed = 1
env_par = {
    'Rand_Seed': Rand_Seed,
    'STORE_PATH': STORE_PATH,
    'monitor': False,
    'threshold': 195.0
}
Rand_Seed = 1
CP = CartPole(env_par)

## Making the policy gradient agent
We define the policy gradient class. This class receives a dictionary with the following entries:
* `hidden_size`: Number of nodes in the layers.
* `GAMMA`: forgetting factor in the total cost. It should be in $[0\:1]$.
* `num_episodes`: The maximum number of episodes to run.
* `learning_rate_adam`: The learning rate for adam optimization.
* `adam_eps`: The epsilon in adam optimization.



In [5]:
class PG:
    def __init__(self, hparams):
        self.hparams = hparams
        np.random.seed(hparams['Rand_Seed'])
        tf.random.set_seed(hparams['Rand_Seed'])

        # The policy network
        self.network = keras.Sequential([
            keras.layers.Dense(self.hparams['hidden_size'], input_dim=self.hparams['num_state'], activation='relu',
                               kernel_initializer=keras.initializers.he_normal(), dtype='float64'),
            keras.layers.Dense(self.hparams['hidden_size'], activation='relu',
                               kernel_initializer=keras.initializers.he_normal(), dtype='float64'),
            keras.layers.Dense(self.hparams['num_actions'], activation='softmax', dtype='float64')
        ])
        self.network.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(
            epsilon=self.hparams['adam_eps'], learning_rate=self.hparams['learning_rate_adam']))

    def get_action(self, state, env):
        
        # Building the pdf for the given state
        softmax_out = self.network(state.reshape((1, -1)))
        
        # Sampling an action according to the pdf
        selected_action = np.random.choice(self.hparams['num_actions'], p=softmax_out.numpy()[0])
        return selected_action

    def update_network(self, states, actions, rewards):
        reward_sum = 0
        rewards_to_go = []
        for reward in rewards[::-1]:  # reverse buffer r
            reward_sum = reward + self.hparams['GAMMA'] * reward_sum
            rewards_to_go.append(reward_sum)
        rewards_to_go.reverse()
        rewards_to_go = np.array(rewards_to_go)
        # standardise the rewards
        rewards_to_go -= np.mean(rewards_to_go)
        rewards_to_go /= np.std(rewards_to_go)
        states = np.vstack(states)
        target_actions = tf.keras.utils.to_categorical(np.array(actions), self.hparams['num_actions'])
        loss = self.network.train_on_batch(states, target_actions, sample_weight=rewards_to_go)
        return loss


We have configured our network, by selecting the structure and cost function to be minmized. The last step is to feed the network with `states`, `actions`, `next_states`, and `dones` and update the parameters of the network. This is done by the function `update_network(self, states, actions, rewards, next_states, dones)`.


Now that we have defined the $Q$-learning algorithm, it is enough to build an object and to iterate. You can change the following hyper parameters if you like

* `hidden_size`: Number of nodes in the layers.
* `GAMMA`: forgetting factor in the total cost. It should be in $[0\:1]$.
* `num_episodes`: The maximum number of episodes to run.
* `learning_rate_adam`: The learning rate for adam optimization.
* `adam_eps`: The epsilon in adam optimization.

In [6]:
agent_par = {
    'num_state': CP.env.observation_space.shape[0],
    'num_actions': CP.env.action_space.n,
    'Rand_Seed': Rand_Seed,
    'hidden_size': 30,
    'GAMMA': 1.0,
    'num_episodes': 2000,
    'learning_rate_adam': 0.001,
    'adam_eps': 1e-7,
}
policy = PG(agent_par)

## Start learning
Now, we start the learning loop. The learning loop iterates for a maximum of number `num_episodes`. In each iteration
* The agent derives the environment for one episode to collect data for PG.
* We update the agent by policy gradient algorithm using the recorded data.
* We check if the problem is solved.
* We write the data.
At the end of the learning loop, we close the environment.

In [7]:
tot_rews = []
mean_100ep = 0
for episode in range(agent_par['num_episodes']):

    # Do one rollout
    states, actions, rewards, _, _ = CP.one_rollout(policy)

    # Update the network
    loss = policy.update_network(states, actions, rewards)

    # Check if the problem is solved
    if episode > 100:
        mean_100ep = np.mean(tot_rews[-101:-1])

    tot_reward = sum(rewards)
    tot_rews.append(tot_reward)
    print(f"Episode: {episode}, Reward: {tot_reward}, Mean of 100 cons episodes: {mean_100ep}")
    if mean_100ep > env_par['threshold']:
        print(f"Problem is solved.")
        policy.network.save(agent_path)
        break

    # Save data
    with train_writer.as_default():
        tf.summary.scalar('reward', tot_reward, step=episode)

# Close the environment
CP.env.close()

# Print the summary of the solution
if mean_100ep > env_par['threshold']:
    print(f"\n\nProblem is solved after {episode} Episode with the mean reward {mean_100ep} over the last 100 episodes")


Episode: 0, Reward: 11.0, Mean of 100 cons episodes: 0
Episode: 1, Reward: 28.0, Mean of 100 cons episodes: 0
Episode: 2, Reward: 26.0, Mean of 100 cons episodes: 0
Episode: 3, Reward: 19.0, Mean of 100 cons episodes: 0
Episode: 4, Reward: 20.0, Mean of 100 cons episodes: 0
Episode: 5, Reward: 14.0, Mean of 100 cons episodes: 0
Episode: 6, Reward: 12.0, Mean of 100 cons episodes: 0
Episode: 7, Reward: 12.0, Mean of 100 cons episodes: 0
Episode: 8, Reward: 10.0, Mean of 100 cons episodes: 0
Episode: 9, Reward: 37.0, Mean of 100 cons episodes: 0
Episode: 10, Reward: 14.0, Mean of 100 cons episodes: 0
Episode: 11, Reward: 42.0, Mean of 100 cons episodes: 0
Episode: 12, Reward: 15.0, Mean of 100 cons episodes: 0
Episode: 13, Reward: 10.0, Mean of 100 cons episodes: 0
Episode: 14, Reward: 9.0, Mean of 100 cons episodes: 0
Episode: 15, Reward: 11.0, Mean of 100 cons episodes: 0
Episode: 16, Reward: 28.0, Mean of 100 cons episodes: 0
Episode: 17, Reward: 14.0, Mean of 100 cons episodes: 0
Epi

## Results
It will get around 1-2 minutes to run the above cell. You will probably get some WARNING\ERROR. Some of these are related to incompatibility between some libraries. Don't panic. The problem will be solved after possibly after 800-1200 episodes.