# Excercise 9.1 Policy Gradient on Continuous CartPole

## Goal

- understanding policy gradient and implement it
- understand how each hyperparameter contributes to the learning process

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import gym
import numpy as np
import chula_rl as rl
from chula_rl.env.cartpolecont import ContinuousCartPoleEnv

# Step 1: Env

In [None]:
def make_env():
    env = ContinuousCartPoleEnv()
    env = rl.env.wrapper.EpisodeSummary(env)
    return env

## 1.1 Parallel Env (VecEnv)

This kind of env will take a vector of actions, returns a vector of states. This will help stabilize training (and also speed up) greatly especially in on-policy learning.

Example of 2 parallel envs (you could use any):

In [None]:
env = rl.env.DummyVecEnv([make_env] * 2)
s = env.reset()
print('s.shape:', s.shape)

You see (2, 4) which means 2 envs of 4 features (normal to CartPole).

An interesting part of the parallel env is that it will "reset" the underlying env automatically (when it is done). This means we can always take action, do not need to care of the underlying environment.

## 1.3 Continuous CartPole

This is the same as a normal CartPole. The only difference is that the action space is "continuous" dictated by a single "float" within (-1, 1). 

Exmaple of taking action in a parallel env: 

Each action has 1 dimension, parallel action becomes 2 dimensions.

In [None]:
ss, r, done, info = env.step(np.array([[-0.8], [1.0]]))
print('ss.shape:', ss.shape)
print('r.shape:', r.shape)

# Step 2: Vec n-step Explorer

In a parallel environment setting, we also need a compatible parallel explorer. The code is straightforward to the point that we have implemented it for you already. But you are welcome to read the code. 

Go see `chula_rl.explorer.vec_many_step_explorer`

In policy gradient, we usually use an n-step return of some kind because it is more stable!

In [None]:
exp = rl.explorer.VecManyStepExplorer(n_step, n_max_interaction, env)

# Step 3: Advantage Actor-Critic (A2C) policy + n-step TD residual advantage

A2C requires two components: 
- Actor (policy)
- Critic (value function) 

Both are implemented as neural nets. We leave this section to you. 

Your A2C should subclass `chula_rl.policy.base_policy.BasePolicy`. 

## Words of advice: 

- You code will surely contain bugs! Developing in jupyter notebook might not be a good idea. 
- There is a ton of hyperparameters, it is no easy task to find the right parameters
- Finding the right parameters might need some analysis on how the code performs which is hard if you don't "log" enough
- So, log EVERYTHING, use tensorboard to your advantage
- For example, log the std of the policy, log the current value of the value function. These will be invaluable in debugging
- "ทำไมมันช่างเปราะบางเหลือเกิน ~" is a sentence to describe this section

## Run it

If you forgot how to run it already. Here is how: 

```
while True:
    data = exp.step(policy)
    policy.optimize_step(data)
```

## Extra: A2C + n-step Generalized Advantage

You are invited to implement the same A2C but using the generalized advantage instead. Legend has it this is a better advantage estimate! 😎