# REINFORCE - Monte Carlo Policy Gradient

REINFORCE 알고리즘은 다음과 같은 수식을 따라 Gradient Ascent를 수행한다:

$$
\mathbf{\theta}_{t+1} \doteq \mathbf{\theta}_t + \alpha G_t \nabla \ln \pi(A_t \vert S_t, \mathbf{\theta}_t)
$$

각 수식 요소는 아래와 같다:

* $\theta$: policy parameter
* $\alpha$: learning rate
* $G_t$: time step $t$에서의 return
* $\pi$: policy
* $A_t$: time step $t$에서 policy $\pi$를 따라 선택된 action
* $S_t$: time step $t$에서의 state

REINFORCE에 baseline 개념을 적용할 수 있다. baseline 적용의 장점은 업데이트의 기대값을 변화시키지는 않으면서 분산을 낮춰 학습 안정성이 증가하도록 기대할 수 있다. 이 때 baseline은 어떤 것도 가능하다.

REINFORCE 알고리즘에 baseline을 적용하면 아래와 같다:

$$
\mathbf{\theta}_{t+1} \doteq \mathbf{\theta}_t + \alpha \Big(G_t - b(S_t) \Big) \nabla \ln \pi(A_t \vert S_t, \mathbf{\theta}_t)
$$

이 실습 코드에서는 baseline으로 한 episode동안의 return들의 평균을 사용하였다.

## Implement REINFORCE

### REINFORCE Agent

* 신경망 구성
* action 결정
* loss 계산

In [2]:
from typing import Tuple
import torch
import torch.nn as nn
from torch.distributions import Categorical

class REINFORCE(nn.Module):
    def __init__(self, obs_features: int, 
                 num_actions: int,
                 gamma: float = 0.99) -> None:
        super().__init__()
        
        self.gamma = gamma # discount factor
        
        # simple fully connected layer
        self.layer = nn.Sequential(
            nn.Linear(obs_features, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, num_actions)
        )
    
    def select_action(self, obs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        agent의 observation으로부터 action을 선택한다.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: selected_action, log_pi
        """
        pdparam = self.layer(obs)
        dist = Categorical(logits=pdparam)
        action = dist.sample()
        log_pi = dist.log_prob(action)
        return action, log_pi
    
    def compute_return(self, 
                       reward: torch.Tensor) -> torch.Tensor:
        """
        Return G를 계산한다. `reward` tensor는 1차원 벡터로 길이는 episode 길이와 동일하다.
        """
        returns = torch.empty_like(reward)
        G = 0.0
        T = len(reward)
        for t in reversed(range(T)):
            G = reward[t] + self.gamma * G
            returns[t] = G
        return returns
    
    def compute_reinforce_loss(self,
                               returns: torch.Tensor,
                               log_pi: torch.Tensor) -> torch.Tensor:
        """
        REINFORCE loss를 계산한다.
        """
        # REINFORCE with baseline
        returns = returns - returns.mean()
        # gradient ascent를 descent로 바꿔주기 위해 앞에 - 부호를 붙인다
        return -(returns * log_pi).mean()

  from .autonotebook import tqdm as notebook_tqdm


### Make CartPole Environment

In [3]:
import gym

env = gym.make("CartPole-v1") # for training
inference_env = gym.make("CartPole-v1", render_mode="human") # for inference

#### Observation Features

In [4]:
obs_features = env.observation_space.shape
obs_features

(4,)

#### Number of Actions

In [5]:
num_actions = env.action_space.n
num_actions

2

### Instantiate Policy

In [6]:
import torch.optim as optim

agent = REINFORCE(obs_features[0], num_actions)
optimizer = optim.Adam(agent.parameters(), lr=0.001)

### Check Outputs

In [7]:
obs, _ = env.reset()
obs

array([-0.00751555, -0.02465002, -0.00998014,  0.04450459], dtype=float32)

In [8]:
action, log_pi = agent.select_action(torch.from_numpy(obs))
(action, log_pi)

(tensor(1), tensor(-0.7011, grad_fn=<SqueezeBackward1>))

In [9]:
env.step(action.numpy())

(array([-0.00800855,  0.17061362, -0.00909005, -0.25131038], dtype=float32),
 1.0,
 False,
 False,
 {})

### Training Start!

In [10]:
from torch.utils.tensorboard import SummaryWriter

logger = SummaryWriter("results/CartPole-v1_REINFORCE")

total_episodes = 501
inference_freq = 50

for episode in range(total_episodes):
    reward_buffer, log_pi_buffer = [], []
    
    obs, _ = env.reset()
    terminated = False
    cumulative_reward = 0
    
    while not terminated:
        # take action and observe
        action, log_pi = agent.select_action(torch.from_numpy(obs))
        next_obs, reward, terminated, truncated, _ = env.step(action.numpy())
        terminated = terminated | truncated
        
        # update buffer
        reward_buffer.append(reward) # 스칼라
        log_pi_buffer.append(log_pi) # 0차원 tensor
        
        # update info
        cumulative_reward += reward
        obs = next_obs
        
    # buffer to tensor
    reward_tensor = torch.tensor(reward_buffer)
    log_pi_tensor = torch.stack(log_pi_buffer)
    reward_buffer.clear()
    log_pi_buffer.clear()
    
    # compute loss
    returns = agent.compute_return(reward_tensor)
    loss = agent.compute_reinforce_loss(returns, log_pi_tensor)
    
    # gradient step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # log data
    logger.add_scalar("Cumulative Reward", cumulative_reward, episode)
    logger.add_scalar("Policy Loss", loss.item(), episode)
    
    # inference
    if episode % inference_freq == 0:
        obs, _ = inference_env.reset()
        terminated = False
        inference_cumulative_reward = 0
        while not terminated:
            with torch.no_grad():
                action, _ = agent.select_action(torch.from_numpy(obs))
            next_obs, reward, terminated, truncated, _ = inference_env.step(action.numpy())
            terminated = terminated | truncated

            inference_cumulative_reward += reward
            obs = next_obs
            
        print(f"inference - episode: {episode}, cumulative reward: {inference_cumulative_reward}")

logger.flush()
logger.close()

env.close()
inference_env.close()

inference - episode: 0, cumulative reward: 16.0
inference - episode: 50, cumulative reward: 22.0
inference - episode: 100, cumulative reward: 79.0
inference - episode: 150, cumulative reward: 228.0
inference - episode: 200, cumulative reward: 189.0
inference - episode: 250, cumulative reward: 500.0
inference - episode: 300, cumulative reward: 252.0
inference - episode: 350, cumulative reward: 500.0
inference - episode: 400, cumulative reward: 232.0
inference - episode: 450, cumulative reward: 473.0
inference - episode: 500, cumulative reward: 500.0


아래 커맨드를 anaconda shell에 입력해 결과 확인:

```
$ tensorboard --logdir=results
```