[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MaxMitre/DeepLearning/blob/main/Semana11_ProximalPolicyOptimization.ipynb)

# Proximal Policy Optimization

In [None]:
!sudo apt-get install xvfb
!pip install pyvirtualdisplay > /dev/null 2>&1
!pip install git+https://github.com/tensorflow/docs > /dev/null 2>&1

In [None]:
!pip show tensorflow

# Justificación

Alguno de los métodos vistos anteriormente pueden tener pequeños defectos o falncias, entre los problemas que estos pueden presentar se encuentran:

1. **Inestabilidad en las actualización de la política**: Dependiendo de los tamaños de lo gradientes esto puede presentar ligeros problemas.

2. **Ineficiencia del uso de datos**: Actualizamos sin importar el historial de los datos, "se usan y desechan en la misma iteración."

In [None]:
import numpy as np
import tensorflow as tf
import gym
import tensorflow_probability as tfp
import tensorflow.keras.losses as kls
from tensorflow.compat.v1.losses import mean_squared_error

# Idea

Utilizar el como cambian las políticas para generar una solución a la iteración de la política.

En policy gradient, utilizamos la siguiente función objetivo (la diferenciación de ésta nos da la regla de actualización de los parámetros)
$$
L^{PG} = \hat{\mathbb{E}}_t [ log \, \pi_{\theta}(a_t \vert s_t) \hat{A}_t]
$$

Ahora usaremos el cociente $r_t$ que me dice el cambio que hay en la política al modificar el parámetro $\theta$:

$$
\hat{\mathbb{E}}_t [ \dfrac{\pi_{\theta}(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)} \, \hat{A}_t] = \hat{\mathbb{E}}_t [r_t(\theta) \hat{A}_t]
$$

Pero además, restringiremos este valor por si es muy grande, ya que eso podría llevar a una política que se actualice con gradientes muy grandes, lo que daría problemas. lo hacemos del siguiente modo:

$$
L^{CLIP} (\theta) = \hat{\mathbb{E}}_t [ min(r_t(\theta) \hat{A}_t , clip(r_t(\theta), 1 - \epsilon, 1+ \epsilon) \hat{A}_t )]
$$

Este será un auxiliar, el modo final de la función objetivo tendrá 2 partes mas, una referente al cambio en la función de valor (como la usada en Actor-Critic) y una tercera parte que involucra la entropía de la política

**Pregunta**: ¿Que reprensenta la entropía de una política?



In [None]:
probs = [0.1,0.2,0.9,0]

In [None]:
entropia=0
for i in probs:
  if i!=0:
    entropia -= i*np.log(i)
print(entropia)

In [None]:
probs = [0.25,0.25,0.25,0.25]

In [None]:
entropia=0
for i in probs:
  if i!=0:
    entropia -= i*np.log(i)
print(entropia)

La forma final de nuestra función objetivo se muestra a continuación:

$$
L_t{}^{CLIP + VF + S} (\theta) = \hat{\mathbb{E}}_t  [ L_{t}^{CLIP} (\theta) - c_1 L_{t}^{VF} (\theta) - c_2 S[\pi_{\theta}] (s_t) ]
$$

In [None]:
env= gym.make("CartPole-v0")
low = env.observation_space.low
high = env.observation_space.high

# Redes neuronales

Critic: Regresa la función de valor

Actor: Regresa la acción

In [None]:
class critic(tf.keras.Model):
  def __init__(self):
    super().__init__()
    self.d1 = tf.keras.layers.Dense(128,activation='relu')
    self.v = tf.keras.layers.Dense(1, activation = None)

  def call(self, input_data):
    x = self.d1(input_data)
    v = self.v(x)
    return v


class actor(tf.keras.Model):
  def __init__(self):
    super().__init__()
    self.d1 = tf.keras.layers.Dense(128,activation='relu')
    self.a = tf.keras.layers.Dense(2,activation='softmax')

  def call(self, input_data):
    x = self.d1(input_data)
    a = self.a(x)
    return a

Definición del agente y sus métodos

Actor Loss:

- Actor loss toma las probabilidades, acciones, ventajas actuales, probabilidades viejas, y pérdida de "critic" (la de la función de valor) como entradas.

- Primero, se calculan las entropías y medias.

- Luego, pasamos por las probabilidades, ventajas y probabilidades viejas para calcular los cocientes, el clip del cociente y los pegamos al final de listas.

- Luego, calculamos la pérdida. Recordemos que aquí es gradiente ascendiente porque buscamos la política que nos de un valor objetivo mas grande.

In [None]:
class agent():
    def __init__(self, gamma = 0.99):
        self.gamma = gamma
        # self.a_opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
        # self.c_opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
        self.a_opt = tf.keras.optimizers.Adam(learning_rate=7e-3)
        self.c_opt = tf.keras.optimizers.Adam(learning_rate=7e-3)
        self.actor = actor()
        self.critic = critic()
        self.clip_pram = 0.2


    def act(self,state):
        prob = self.actor(np.array([state]))
        prob = prob.numpy()
        dist = tfp.distributions.Categorical(probs=prob, dtype=tf.float32)
        action = dist.sample()
        return int(action.numpy()[0])



    def actor_loss(self, probs, actions, adv, old_probs, closs):

        probability = probs
        entropy = tf.reduce_mean(tf.math.negative(tf.math.multiply(probability,tf.math.log(probability))))
        #print(probability)
        #print(entropy)
        sur1 = []
        sur2 = []

        for pb, t, op,a  in zip(probability, adv, old_probs, actions):
                        t =  tf.constant(t)
                        #op =  tf.constant(op)
                        #print(f"t{t}")
                        #ratio = tf.math.exp(tf.math.log(pb + 1e-10) - tf.math.log(op + 1e-10))
                        ratio = tf.math.divide(pb[a],op[a])
                        #print(f"ratio{ratio}")
                        s1 = tf.math.multiply(ratio,t)
                        #print(f"s1{s1}")
                        s2 =  tf.math.multiply(tf.clip_by_value(ratio, 1.0 - self.clip_pram, 1.0 + self.clip_pram),t)
                        #print(f"s2{s2}")
                        sur1.append(s1)
                        sur2.append(s2)

        sr1 = tf.stack(sur1)
        sr2 = tf.stack(sur2)

        #closs = tf.reduce_mean(tf.math.square(td))
        loss = tf.math.negative(tf.reduce_mean(tf.math.minimum(sr1, sr2)) - closs + 0.001 * entropy)
        #print(loss)
        return loss

    def learn(self, states, actions,  adv , old_probs, discnt_rewards):
        discnt_rewards = tf.reshape(discnt_rewards, (len(discnt_rewards),))
        adv = tf.reshape(adv, (len(adv),))

        old_p = old_probs

        old_p = tf.reshape(old_p, (len(old_p),2))
        with tf.GradientTape() as tape1, tf.GradientTape() as tape2:
            p = self.actor(states, training=True)
            v =  self.critic(states,training=True)
            v = tf.reshape(v, (len(v),))
            td = tf.math.subtract(discnt_rewards, v)
            c_loss = 0.5 * mean_squared_error(discnt_rewards, v)
            a_loss = self.actor_loss(p, actions, adv, old_probs, c_loss)

        grads1 = tape1.gradient(a_loss, self.actor.trainable_variables)
        grads2 = tape2.gradient(c_loss, self.critic.trainable_variables)
        self.a_opt.apply_gradients(zip(grads1, self.actor.trainable_variables))
        self.c_opt.apply_gradients(zip(grads2, self.critic.trainable_variables))
        return a_loss, c_loss

In [None]:
def test_reward(env):
  total_reward = 0
  state = env.reset()
  done = False
  while not done:
    action = np.argmax(agentoo7.actor(np.array([state])).numpy())
    next_state, reward, done, _ = env.step(action)
    state = next_state
    total_reward += reward

  return total_reward

In [None]:

def preprocess1(states, actions, rewards, done, values, gamma):
    g = 0
    lmbda = 0.95
    returns = []
    for i in reversed(range(len(rewards))):
       delta = rewards[i] + gamma * values[i + 1] * done[i] - values[i]
       g = delta + gamma * lmbda * dones[i] * g
       returns.append(g + values[i])

    returns.reverse()
    adv = np.array(returns, dtype=np.float32) - values[:-1]
    adv = (adv - np.mean(adv)) / (np.std(adv) + 1e-10)
    states = np.array(states, dtype=np.float32)
    actions = np.array(actions, dtype=np.int32)
    returns = np.array(returns, dtype=np.float32)
    return states, actions, returns, adv


tf.random.set_seed(336699)
agentoo7 = agent()
steps = 5000
ep_reward = []
total_avgr = []
target = False
best_reward = 0
avg_rewards_list = []


for s in range(steps):
  if target == True:
          break

  done = False
  state = env.reset()
  all_aloss = []
  all_closs = []
  rewards = []
  states = []
  actions = []
  probs = []
  dones = []
  values = []
  print("new episod")

  for e in range(128):

    action = agentoo7.act(state)
    value = agentoo7.critic(np.array([state])).numpy()
    next_state, reward, done, _ = env.step(action)
    dones.append(1-done)
    rewards.append(reward)
    states.append(state)
    #actions.append(tf.one_hot(action, 2, dtype=tf.int32).numpy().tolist())
    actions.append(action)
    prob = agentoo7.actor(np.array([state]))
    probs.append(prob[0])
    values.append(value[0][0])
    state = next_state
    if done:
      env.reset()

  value = agentoo7.critic(np.array([state])).numpy()
  values.append(value[0][0])
  np.reshape(probs, (len(probs),2))
  probs = np.stack(probs, axis=0)

  states, actions,returns, adv  = preprocess1(states, actions, rewards, dones, values, 1)

  for epocs in range(10):
      al,cl = agentoo7.learn(states, actions, adv, probs, returns)
      # print(f"al{al}")
      # print(f"cl{cl}")

  avg_reward = np.mean([test_reward(env) for _ in range(5)])
  print(f"total test reward is {avg_reward}")
  avg_rewards_list.append(avg_reward)
  if avg_reward > best_reward:
        print('best reward=' + str(avg_reward))
        agentoo7.actor.save('model_actor_{}_{}.keras'.format(s, avg_reward))
        agentoo7.critic.save('model_critic_{}_{}.keras'.format(s, avg_reward))
        best_reward = avg_reward
  if best_reward == 200:
        target = True
  env.reset()

env.close()



In [None]:
import matplotlib.pyplot as plt

In [None]:
ep = [i  for i in range(len(avg_rewards_list))]
plt.plot( range(len(avg_rewards_list)),avg_rewards_list,'b')
plt.title("Avg Test Reward Vs Test Episods")
plt.xlabel("Test Episods")
plt.ylabel("Average Test Reward")
plt.grid(True)
plt.show()

# Visualización

In [None]:
from IPython import display as ipythondisplay
from PIL import Image
from pyvirtualdisplay import Display

max_steps_per_episode = 100
display = Display(visible=0, size=(400, 300))
display.start()


def render_episode(env: gym.Env, model: agent, max_steps: int):
  screen = env.render(mode='rgb_array')
  im = Image.fromarray(screen)

  images = [im]

  state = tf.constant(env.reset(), dtype=tf.float32)
  for i in range(1, max_steps + 1):
    state = tf.expand_dims(state, 0)
    #action_probs, _ = model(state)
    #action = np.argmax(np.squeeze(action_probs))
    action = np.argmax(model.actor(np.array([state])).numpy())

    state, _, done, _ = env.step(action)
    state = tf.constant(state, dtype=tf.float32)

    # Render screen every 10 steps
    if i % 10 == 0:
      screen = env.render(mode='rgb_array')
      images.append(Image.fromarray(screen))

    if done:
      break

  return images


# Save GIF image
images = render_episode(env, agentoo7, max_steps_per_episode)
image_file = 'cartpole-v0.gif'
# loop=0: loop forever, duration=1: play each frame for 1ms
images[0].save(
    image_file, save_all=True, append_images=images[1:], loop=0, duration=1)

In [None]:
import tensorflow_docs.vis.embed as embed
embed.embed_file(image_file)

# Referencias

- [*Proximal Policy Optimization Algorithms*, OpenAI](https://arxiv.org/pdf/1707.06347.pdf)
- https://towardsdatascience.com/proximal-policy-optimization-ppo-with-tensorflow-2-x-89c9430ecc26