<a href="https://colab.research.google.com/github/KyeongHoSeong/abc/blob/main/REINFORCE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 정책 기울기
reference: https://towardsdatascience.com/policy-gradient-reinforce-algorithm-with-baseline-e95ace11c1c4

## REINFORCE
정책 그라데이션 방법은 직접적으로 매개 변수화하여 정책을 배운다

<img src ="https://miro.medium.com/max/258/0*FyVhASxbAyPKGl0Y" />

그러나 최적화에 관해서는 여전히 값 함수 V (θ) 를 목적 함수로 사용해야 한다.

목표: 정책 π에 따라 궤적 τ 에서 예상되는 총 보상 인 V (θ) 를 최대화

value function V(θ) is calculated from the parameterized policy π(θ), not directly parameterized by θ.

<img src="https://miro.medium.com/max/653/1*fu9HGx2iEqvh89-LyCs3VA.png" />

 τ는 상태-행동 궤적

<img src="https://miro.medium.com/max/596/1*1G97-c15iPXFPiWuR2nzRQ.png" />

목표는 V (θ) 를 최대화하는 정책에 대한 매개 변수 θ를 찾는 것

이를 위해 wrt 매개 변수 θ에 대하여, 정책의 기울기를 오름차순으로 V(θ)에서 최대 값을 검색합니다 .

<img src ="https://miro.medium.com/max/518/1*Vmmy0Aq-owGqYvldSE9PPw.png" />

정책 π (θ) 는 일반적으로 소프트 맥스, 가우스 또는 신경망을 사용하여 모델링되어 미분 가능합니다.

여기서 우리는 그라디언트를 계산하기 위해 시간적 차이를 활용하는 바닐라 정책 그라디언트: REINFORCE의 인기있는 변형을 구현합니다. 

<img src ="https://miro.medium.com/max/525/1*LzsbTMrlsKhxNIKeqUsOXw.png" />


## training procedure
<img src="https://miro.medium.com/max/755/1*MhgPkwPnEN2ytvN9mLRxfA.png" />

    loop all episodes
      loop episode in episodes
        for step in episode
          Update maximize (gradient of MSE(policy function)) 

 ## Policy Gradient with Baseline
정책 기울기 방법의 한 가지 부정적인 점은 경험적 수익으로 인한 높은 분산.
- 분산을 줄이는 일반적인 방법 은 정책 기울기의 수익률에서 기준선 b (s) 를 빼는 것 
- 기준선은 본질적으로 예상되는 실제 수익에 대한 프록시이며 정책 기울기에 편향을 도입해서는 안됩니다. 
- 실제로 가치 함수 자체는 기준선에 적합한 후보입니다. 기준선을 뺀 후 얻는 새로운 용어는 advantage A_t 로 정의 됩니다.

<img src ="https://miro.medium.com/max/610/1*j4yoEArIPogOiS96mvICtQ.png" />

- 참고로 또 다른 인기있는 Policy Gradient 변형 인 Actor-critic 메서드가 있습니다.이 메서드 는 경험적 반환 G_t 를 사용하는 대신 다른 매개 변수화 된 모델 Q (s, a) 를 사용하여 이점의 값을 근사화합니다 . 이는 또한 편향이 증가하는 대신 분산을 줄이는 데 도움이됩니다.

baseline은 매개 변수화 된 값 함수로, 경험적 기대 수익률과 기준선 예측의 평균 제곱 오차를 줄임으로써 학습 할 수 있습니다.

<img src="https://miro.medium.com/max/425/1*-deBYHAB5na7tZnYW6twLQ.png" />

기준을 사용하면 새로운 교육 루프에 advantage을 계산하고 baseline은 모델을 업데이트하는 두 가지 추가 단계가 있습니다.

<img src="https://miro.medium.com/max/841/1*6065jhiJnZ3EcnPt7TfvdQ.png" />

# Python 구현 (Tensorflow 2)



## 1. Policy Net

In [1]:
import tensorflow_probability as tfp
from tensorflow import keras
from tensorflow.keras import layers


class PolicyNet():
    def __init__(self, input_size, output_size):
        self.model = keras.Sequential(
            layers=[
                keras.Input(shape=(input_size,)),
                layers.Dense(64, activation="relu", name="relu_layer"),
                layers.Dense(output_size, activation="linear", name="linear_layer")
            ],
            name="policy")

    def action_distribution(self, observations):
        logits = self.model(observations)
        return tfp.distributions.Categorical(logits=logits)

    def sampel_action(self, observations):
        sampled_actions = self.action_distribution(observations).sample().numpy()
        return sampled_actions

## 2.  baseline network

Value function을 baseline으로 사용
- input: current state
- output: predicted value V(s)



In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


class BaselineNet():
    def __init__(self, input_size, output_size):
        self.model = keras.Sequential(
            layers=[
                keras.Input(shape=(input_size,)),
                layers.Dense(64, activation="relu", name="relu_layer"),
                layers.Dense(output_size, activation="linear", name="linear_layer")
            ],
            name="baseline")
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=3e-2)

    def forward(self, observations):
        output = tf.squeeze(self.model(observations))
        return output

    # target: 관찰된 평균 보상(returns) <-게임을 하면서 수집된 값
    # 기준선 네트워크는 수익률returns  및 예측forecasts의 
    # 평균 제곱 오차를 최소화하여 업데이트됩니다.
    def update(self, observations, target):
        with tf.GradientTape() as tape:
            predictions = self.forward(observations)
            loss = tf.keras.losses.mean_squared_error(y_true=target, y_pred=predictions)
        grads = tape.gradient(loss, self.model.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))


## 3.  PolicyGradient class


In [3]:
import numpy as np
import os

import tensorflow as tf
from gym import wrappers
import matplotlib.pyplot as plt

# from model.baseline_net import BaselineNet
# from model.policy_net import PolicyNet


def export_plot(ys, ylabel, title, filename):
    plt.figure()
    plt.plot(range(len(ys)), ys)
    plt.xlabel("Training Episode")
    plt.ylabel(ylabel)
    plt.title(title)
    plt.savefig(filename)
    plt.close()


class PolicyGradient(object):
    # 정의한 정책 및 기본 네트워크를 포함, 모델에 필요한 모든 매개 변수를 초기화
    def __init__(self, env, num_iterations=300, batch_size=2000, max_ep_len=200, output_path="../results/"):
        self.output_path = output_path
        if not os.path.exists(output_path):
            os.makedirs(output_path)
        self.env = env
        self.observation_dim = self.env.observation_space.shape[0]
        self.action_dim = self.env.action_space.n
        self.gamma = 0.99
        self.num_iterations = num_iterations
        self.batch_size = batch_size
        self.max_ep_len = max_ep_len
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=3e-2)
        self.policy_net = PolicyNet(input_size=self.observation_dim, output_size=self.action_dim)
        self.baseline_net = BaselineNet(input_size=self.observation_dim, output_size=1)

    # we can play several roll-outs to collect data.
    # 1. sample an action from the policy network
    # 2. paly a step to get reward and next state
    # 3. at the end of episode, we collect all game trajectories 
    # consisting of states, actions, and rewards
    # * 이 궤적은 정책 네트워크와 baseline network를 업데이트하는데 사용된다.
    def play_games(self, env=None, num_episodes = None):
        episode = 0
        episode_rewards = []
        paths = []
        t = 0
        if not env:
            env = self.env

        while (num_episodes or t < self.batch_size):
            state = env.reset()
            states, actions, rewards = [], [], []
            episode_reward = 0

            for step in range(self.max_ep_len):
                states.append(state)
                action = self.policy_net.sampel_action(np.atleast_2d(state))[0]
                state, reward, done, _ = env.step(action)
                actions.append(action)
                rewards.append(reward)
                episode_reward += reward
                t += 1

                if (done or step == self.max_ep_len-1):
                    episode_rewards.append(episode_reward)
                    break
                if (not num_episodes) and t == self.batch_size:
                    break

            path = {"observation": np.array(states),
                    "reward": np.array(rewards),
                    "action": np.array(actions)}
            paths.append(path)
            episode += 1
            if num_episodes and episode >= num_episodes:
                break
        return paths, episode_rewards

    # 이렇게 수집된 궤적을 가지고 각 상태의 return을 계산할 수 있다.
    # we can technically calculating each return G_t by summing over 
    # all discounted future rewards from step t to the end of each episode:  O(n*n). 
    # 여기서는 To reduce it to O(n), here we are using another approach: rolling average.
    # 1. Essentially for each episode we start with the last state whose return G_t = r_t, 
    # 2. and calculate the returns in a reversed order to utilize the relationship: 
    # 3. G_t = r_t + gamma * G_t+1.

    # Once we have data from all episodes, 
    # we flatten the returns to batch these episodic trajectories. 
    # Earlier in the update() function of the baseline network, 
    # we have an input parameter target , which is exactly the returns we calculate here.

    def get_returns(self, paths):
        all_returns = []
        for path in paths:
            rewards = path["reward"]
            returns = []
            reversed_rewards = np.flip(rewards,0)
            g_t = 0
            for r in reversed_rewards:
                g_t = r + self.gamma*g_t
                returns.insert(0, g_t)
            all_returns.append(returns)
        returns = np.concatenate(all_returns)
        return returns

    # With the forecasted values V(s) from the baseline network, 
    # and the empirical returns, we can also get the advantages.
    def get_advantage(self, returns, observations):
        values = self.baseline_net.forward(observations).numpy()
        advantages = returns - values
        advantages = (advantages-np.mean(advantages)) / np.sqrt(np.sum(advantages**2))
        return advantages

    # Finally, we have all the pieces needed to update the policy network. 
    # Remember in policy gradient, 
    # the goal is to maximize the value we obtained by following the policy, 
    # which is equivalent to minimizing the negative values (loss).
    def update_policy(self, observations, actions, advantages):
        observations = tf.convert_to_tensor(observations)
        actions = tf.convert_to_tensor(actions)
        advantages = tf.convert_to_tensor(advantages)
        with tf.GradientTape() as tape:
            log_prob = self.policy_net.action_distribution(observations).log_prob(actions)
            loss = -tf.math.reduce_mean(log_prob * tf.cast(advantages, tf.float32))
        grads = tape.gradient(loss, self.policy_net.model.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.policy_net.model.trainable_weights))

    # Let’s put everything together and train our model. 
    # The complete training logic is implemented in train(), 
    # where in each iteration, we repeat the procedure: 
    # call play_game() to get several episodic trajectories; 
    # flatten the trajectories(list of lists) into a batch(list); 
    # calculate returns and advantages using the batch data; 
    # update both the baseline network and policy network.
    def train(self):
        all_total_rewards = []
        averaged_total_rewards = []
        for t in range(self.num_iterations):
            paths, total_rewards = self.play_games()
            all_total_rewards.extend(total_rewards)
            observations = np.concatenate([path["observation"] for path in paths])
            actions = np.concatenate([path["action"] for path in paths])
            returns = self.get_returns(paths)
            advantages = self.get_advantage(returns, observations)
            self.baseline_net.update(observations=observations, target=returns)
            self.update_policy(observations, actions, advantages)
            avg_reward = np.mean(total_rewards)
            averaged_total_rewards.append(avg_reward)
            print("Average reward for batch {}: {:04.2f}".format(t,avg_reward))
        print("Training complete")
        np.save(self.output_path+ "rewards.npy", averaged_total_rewards)
        export_plot(averaged_total_rewards, "Reward", "CartPole-v0", self.output_path + "rewards.png")

    # There are 2 more handy functions helping us evaluate the policy gradient model 
    # by making a video of its performance on the CartPole environment.
    def eval(self, env, num_episodes=1):
        paths, rewards = self.play_games(env, num_episodes)
        avg_reward = np.mean(rewards)
        print("Average eval reward: {:04.2f}".format(avg_reward))
        return avg_reward

    def make_video(self):
        env = wrappers.Monitor(self.env, self.output_path+"videos", force=True)
        self.eval(env=env, num_episodes=1)

Let’s run the code and render a video once training is done.

In [None]:
import gym

if __name__ == '__main__':
    env = gym.make("CartPole-v0")
    model = PolicyGradient(env)
    model.train()
    #model.make_video()