<a href="https://colab.research.google.com/github/FelipeSotoG/Desafio-4/blob/main/Desafio_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boxing usando Actor Crítico (PPO)
Basado en [este algoritmo](https://keras.io/examples/rl/ppo_cartpole/)

Instalamos paquetes necesarios

In [1]:
!pip3 install gdown > /dev/null 2>&1 
!gdown https://drive.google.com/uc?id=1OylV6F7zJAesWKhNqFx7e2__zJkfVSEE > /dev/null 2>&1 
!unzip files2.zip 
!pip install gym-retro > /dev/null 2>&1 
!python3 -m retro.import .  > /dev/null 2>&1 

Archive:  files2.zip
 extracting: agent.png               
 extracting: bullet.png              
 extracting: bullet2.png             
 extracting: bunker.png              
 extracting: enemy_0.png             
 extracting: enemy_00.png            
 extracting: enemy_1.png             
 extracting: enemy_11.png            
 extracting: enemy_2.png             
 extracting: enemy_22.png            
 extracting: enemy_3.png             
 extracting: enemy_33.png            
 extracting: enemy_4.png             
 extracting: enemy_44.png            
 extracting: enemy_5.png             
 extracting: enemy_55.png            
  inflating: Space Invaders (1983) (CCE) (C-820).bin  


In [2]:
!pip install ale-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ale-py
  Downloading ale_py-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 22.0 MB/s 
Installing collected packages: ale-py
Successfully installed ale-py-0.7.5


In [3]:
!pip install gym[all]
!pip install autorom[accept-rom-license]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mujoco-py<2.0,>=1.50
  Downloading mujoco-py-1.50.1.68.tar.gz (120 kB)
[K     |████████████████████████████████| 120 kB 11.9 MB/s 
Collecting box2d-py~=2.3.5
  Downloading box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 32.1 MB/s 
Collecting glfw>=1.4.0
  Downloading glfw-2.5.3-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (206 kB)
[K     |████████████████████████████████| 206 kB 35.5 MB/s 
Collecting lockfile>=0.12.2
  Downloading lockfile-0.12.2-py2.py3-none-any.whl (13 kB)
Building wheels for collected packages: mujoco-py
  Building wheel for mujoco-py (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for mujoco-py[0m
[?25h  Running setup.py clean for mujoco-py
Failed to build mujoco-py
Installing collected packages: lockfile, glfw, mujoco-py, box2d-py
    Ru

In [4]:
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common > /dev/null 2>&1 

!apt-get install -y patchelf > /dev/null 2>&1 

In [None]:
# Librerías para el entorno de gym
!pip install colabgymrender > /dev/null 2>&1 
!pip install free-mujoco-py > /dev/null 2>&1 

## Librerías necesarias

In [5]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import gym
import scipy.signal
import time

## Parámetros del algoritmo

In [6]:
n_frames= 4
steps_per_epoch = 1000
epochs = 10

hidden_sizes = (64, 64)
policy_learning_rate = 3e-4
value_function_learning_rate = 1e-3
train_policy_iterations = 80
train_value_iterations = 80
clip_ratio = 0.2
gamma = 0.99
lam = 0.97
target_kl = 0.01


## Buffer para almacenar las muestras

In [101]:
class Buffer:
    # Buffer for storing trajectories
    def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
        # Buffer initialization
        self.observation_buffer = np.zeros(
            (size, observation_dimensions), dtype=np.float32
        )
        self.action_buffer = np.zeros(size, dtype=np.int32)
        self.advantage_buffer = np.zeros(size, dtype=np.float32)
        self.reward_buffer = np.zeros(size, dtype=np.float32)
        self.return_buffer = np.zeros(size, dtype=np.float32)
        self.value_buffer = np.zeros(size, dtype=np.float32)
        self.logprobability_buffer = np.zeros(size, dtype=np.float32)
        self.gamma, self.lam = gamma, lam
        self.pointer, self.trajectory_start_index = 0, 0

    def store(self, observation, action, reward, value, logprobability):
        # Append one step of agent-environment interaction
        self.observation_buffer[self.pointer] = observation
        self.action_buffer[self.pointer] = action
        self.reward_buffer[self.pointer] = reward
        self.value_buffer[self.pointer] = value
        self.logprobability_buffer[self.pointer] = logprobability
        self.pointer += 1

    def finish_trajectory(self, last_value=0):
        # Finish the trajectory by computing advantage estimates and rewards-to-go
        path_slice = slice(self.trajectory_start_index, self.pointer)
        rewards = np.append(self.reward_buffer[path_slice], last_value)
        values = np.append(self.value_buffer[path_slice], last_value)

        deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]

        self.advantage_buffer[path_slice] = discounted_cumulative_sums(
            deltas, self.gamma * self.lam
        )
        self.return_buffer[path_slice] = discounted_cumulative_sums(
            rewards, self.gamma
        )[:-1]

        self.trajectory_start_index = self.pointer

    def get(self):
        # Get all data of the buffer and normalize the advantages
        self.pointer, self.trajectory_start_index = 0, 0
        advantage_mean, advantage_std = (
            np.mean(self.advantage_buffer),
            np.std(self.advantage_buffer),
        )
        self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
        return (
            self.observation_buffer,
            self.action_buffer,
            self.advantage_buffer,
            self.return_buffer,
            self.logprobability_buffer,
        )

## Creamos los modelos para el actor y el crítico

In [100]:
import retro
# Initialize the environment and get the dimensionality of the
# observation space and the number of possible actions
# Create our environment

if 'env' in locals(): env.close()

env = gym.make('Boxing-v4')
#env = retro.make(game='SpaceInvaders-Atari2600')
#env = retro.make(game='Boxing-Atari2600')
#observation_dimensions = 14*n_frames # 14 objects
observation_dimensions = 122880
num_actions = env.action_space.n
possible_actions = np.array(np.identity(num_actions,dtype=int).tolist())


print("The size of our frame is: ", env.observation_space)

# Here we create an hot encoded version of our actions
# possible_actions = [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0]...]
possible_actions = np.array(np.identity(num_actions,dtype=int).tolist())


def mlp(x, sizes, activation=tf.tanh, output_activation=None):    
    # Build a feedforward neural network
    for size in sizes[:-1]:
        x = layers.Dense(units=size, activation=activation)(x)
    return layers.Dense(units=sizes[-1], activation=output_activation)(x)

# Train the policy by maxizing the PPO-Clip objective
@tf.function
def train_policy(
    observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):

    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        ratio = tf.exp(
            logprobabilities(actor(observation_buffer), action_buffer)
            - logprobability_buffer
        )
        min_advantage = tf.where(
            advantage_buffer > 0,
            (1 + clip_ratio) * advantage_buffer,
            (1 - clip_ratio) * advantage_buffer,
        )

        policy_loss = -tf.reduce_mean(
            tf.minimum(ratio * advantage_buffer, min_advantage)
        )
    policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
    policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))

    kl = tf.reduce_mean(
        logprobability_buffer
        - logprobabilities(actor(observation_buffer), action_buffer)
    )
    kl = tf.reduce_sum(kl)
    return kl


# Train the value function by regression on mean-squared error
@tf.function
def train_value_function(observation_buffer, return_buffer):
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        value_loss = tf.reduce_mean((return_buffer - critic(observation_buffer)) ** 2)
    value_grads = tape.gradient(value_loss, critic.trainable_variables)
    value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))

# Initialize the actor and the critic as keras models
observation_input = keras.Input(shape=(observation_dimensions,), dtype=tf.float32)
logits = mlp(observation_input, list(hidden_sizes) + [num_actions], tf.tanh, None)
actor = keras.Model(inputs=observation_input, outputs=logits)
value = tf.squeeze(
    mlp(observation_input, list(hidden_sizes) + [1], tf.tanh, None), axis=1
)
critic = keras.Model(inputs=observation_input, outputs=value)

# Initialize the policy and the value function optimizers
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)

The size of our frame is:  Box(0, 255, (210, 160, 3), uint8)


In [None]:
retro.data.list_games()

In [70]:
def discounted_cumulative_sums(x, discount):
    # Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]



def logprobabilities(logits, a):
    # Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
    logprobabilities_all = tf.nn.log_softmax(logits)
    logprobability = tf.reduce_sum(
        tf.one_hot(a, num_actions) * logprobabilities_all, axis=1
    )
    return logprobability


# Sample action from actor
@tf.function
def sample_action(observation):
    logits = actor(observation)
    action = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
    return logits, action

## Extractor de objetos

La función `image2state` es la encargada de convertir un estado (imagen) en un vector de posiciones de objetos.

![image](https://i.imgur.com/fYygQNe.png)

## Función para "stackear" frames

Para que el agente entienda la velocidad y movimiento de los objetos, consideramos que un estado considera un **conjunto de observaciones** y no sólo la última retornada por el ambiente.

La función `stacked_state` mantiene un estado compuesto por los últimos `n_frames=4` estados. Este será el estado que observe el agente.

In [99]:
from collections import deque

stacked_frames = None

def stacked_state(state, is_new_episode):  
    global stacked_frames
    if is_new_episode or stacked_frames==None:

        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((observation_dimensions), dtype=np.int) for i in range(n_frames)], maxlen=n_frames)  
        # Because we're in a new episode, copy the same frame 4x
        for i in range(n_frames):
          stacked_frames.append(state)
        
        # Stack the state
        stacked_state = np.stack(stacked_frames, axis=2)
        
    else:
        # Append frame to deque, automatically removes the oldest frame
        if state is not None: stacked_frames.append(state)
        else: stacked_frames.append(np.zeros((observation_dimensions), dtype=np.int))

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state

## Funciones para interactuar con ambiente

Se encapsulan las funciones del ambiente para realizar un par de modificaciones:

- `env_step`: Permite repetir la acción por `iters` frames y retorna el stacked state resultante junto a la recompensa acumulada de los 5 frames. Esto permite que los episodios se realicen más rápidamente. Además, el episodio termina al perder la primera vida.

- `env_reset`: Además de resetear el ambiente, avanza 130 iteraciones "saltándose" la introducción del juego.

In [12]:
from keras.applications.vgg16 import VGG16
from keras.applications.mobilenet import MobileNet
from keras.applications.resnet import ResNet50

In [None]:
VGG_model = VGG16(weights='imagenet', include_top=False, input_shape=(210, 160, 3),classes=6)
for layer in VGG_model.layers:
	layer.trainable = False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [None]:
VGG_model = MobileNet(weights='imagenet', include_top=False, input_shape=(210, 160, 3), classes=2)
for layer in VGG_model.layers:
	layer.trainable = False



In [98]:
VGG_model = MobileNet(weights='imagenet', include_top=False, input_shape=(210, 160, 3))
for layer in VGG_model.layers:
	layer.trainable = False



In [13]:
VGG_model = ResNet50(weights='imagenet', include_top=False, input_shape=(210, 160, 3), classes=6)
for layer in VGG_model.layers:
	layer.trainable = False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [102]:
def env_reset(env):
  env.reset()
  state, reward, done, _ = env_step(env, 0, iters=130, new_episode=True) #starting
  return state

images = None
def env_step(env, action, iters=5, new_episode=False, show=False):
  reward=0.
  action = possible_actions[action][action]
  if iters==1:
    obs, r, done, info = env.step(action)
    obs=obs.reshape((1,210,160,3))
    next_state = VGG_model.predict(obs)
    return next_state, r, done, _
  for i in range(iters):
    obs, r, done, info = env.step(action)
    reward+=r
    if done: break
  if(show): images.append(obs)
  obs=obs.reshape((1,210,160,3))
  next_state = VGG_model.predict(obs) ## arreglo de posiciones
  
  state = stacked_state(next_state,new_episode)

  return state, reward, done, _

In [39]:
env.reset()
state, reward, done, info = env_step(env, 0, iters=130, new_episode=True)

In [None]:
print(done)
print(info)

False



## Entrenamiento (AC3)

El entrenamiento consiste en lo siguiente:

- Se recolectan datos para entrenar modelos
<estado, acción, recompensa>. Estos datos son guardados en el buffer (línea 29)

- Para la recolección de datos, se sigue política softmax usando predicciones del actor (línea 17).

- Al terminar un episodio, se calculan retornos y advantages para cada estado. Advantages son calculados usando crítico (función `finish_trajectory`, línea 38).

- Se ajustan los modelos (líneas 55-66):
  
  - Crítico predice retorno esperado a partir de cada estado. 
  - Actor predice advantages para cada estado-acción.

- Se repite todo el proceso.

![image](https://i.imgur.com/rd5tda1.png)

In [104]:
# Initialize the buffer
buffer = Buffer(observation_dimensions, steps_per_epoch)

# Initialize the observation, episode return and episode length
observation, episode_return, episode_length = env_reset(env), 0, 0

# Iterate over the number of epochs
for epoch in range(epochs):
    # Initialize the sum of the returns, lengths and number of episodes for each epoch
    sum_return = 0;   sum_length = 0;   num_episodes = 0;   episode_max_return = 0

    # Iterate over the steps of each epoch
    for t in range(steps_per_epoch):
        # Get the logits, action, and take one step in the environment
        observation = observation.reshape(1, -1)
        
        logits, action = sample_action(observation)
        observation_new, reward, done, _ = env_step(env, action[0].numpy(), iters=5)
        episode_return += reward
        episode_length += 1

        # Get the value and log-probability of the action
        value_t = critic(observation)
        logprobability_t = logprobabilities(logits, action)

        # Store obs, act, rew, v_t, logp_pi_t
        buffer.store(observation, action, reward, value_t, logprobability_t)

        # Update the observation
        observation = observation_new

        # Finish trajectory if reached to a terminal state
        terminal = done
        if terminal or (t == steps_per_epoch - 1):
            last_value = 0 if done else critic(observation.reshape(1, -1))
            buffer.finish_trajectory(last_value)
            sum_return += episode_return
            sum_length += episode_length
            num_episodes += 1
            if episode_return > episode_max_return: episode_max_return=episode_return
            observation, episode_return, episode_length = env_reset(env), 0, 0 
            
    # Get values from the buffer
    (
        observation_buffer,
        action_buffer,
        advantage_buffer,
        return_buffer,
        logprobability_buffer,
    ) = buffer.get()

    # Update the policy and implement early stopping using KL divergence
    for _ in range(train_policy_iterations):
        kl = train_policy(
            observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
        )
        if kl > 1.5 * target_kl:
            # Early Stopping
            break

    # Update the value function
    for _ in range(train_value_iterations):
        train_value_function(observation_buffer, return_buffer)

    # Print mean return and length for each epoch
    print(
        f" Epoch: {epoch + 1}. MeanReturn: {sum_return / num_episodes}. MeanLength: {sum_length / num_episodes}. MaxReturn: {episode_max_return}"
    )

 Epoch: 1. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 2. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 3. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 4. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 5. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 6. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 7. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 8. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 9. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0
 Epoch: 10. MeanReturn: -22.5. MeanLength: 250.0. MaxReturn: 0


### Simulación

In [107]:

episode_return=0.0
i=0
while episode_return < 500 and i<10:
  observation, episode_return, episode_length = env_reset(env), 0, 0
  images = []
  for t in range(steps_per_epoch):
      # Get the logits, action, and take one step in the environment
      observation = observation.reshape(1, -1)
      
      logits, action = sample_action(observation)
      observation_new, reward, done, _ = env_step(env, action[0].numpy(), iters=5, show=True)

      episode_return += reward

      # Update the observation
      observation = observation_new

      # Finish trajectory if reached to a terminal state
      if done:
        observation = env_reset(env)
        print(episode_return)
        break
  i+=1
episode_return

-30.0
-30.0
-30.0
-30.0
-30.0
-30.0
-30.0
-30.0
-30.0
-30.0


-30.0

### Creación del video

In [20]:
# Librearías requeridas
!apt-get install x11-utils > /dev/null 2>&1 
!pip install pyglet > /dev/null 2>&1 
!apt-get install -y xvfb python-opengl > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1

##Space invader 10 epocas y 1000 iteraciones Mobile Net

In [29]:
import cv2
from moviepy.editor import *

def create_video(images, filename):
  res=(160,210) #resulotion
  out = cv2.VideoWriter(filename,cv2.VideoWriter_fourcc('M','J','P','G'), 40.0, res)
  for image in images:
      out.write(image)

  out.release()
  return VideoFileClip(filename)

clip=create_video(images, "tmp.avi")
clip.ipython_display(width=400)

100%|██████████| 862/862 [00:00<00:00, 1270.12it/s]


## Boxing v4 10 epocas y 1000 ireraciones Mobile Net

In [48]:
import cv2
from moviepy.editor import *

def create_video(images, filename):
  res=(160,210) #resulotion
  out = cv2.VideoWriter(filename,cv2.VideoWriter_fourcc('M','J','P','G'), 10, res)
  for image in images:
      out.write(image)

  out.release()
  return VideoFileClip(filename)

clip=create_video(images, "tmp.avi")
clip.ipython_display(width=400)

100%|█████████▉| 332/333 [00:00<00:00, 751.77it/s]


Con el cambio de stack frames

In [109]:
import cv2
from moviepy.editor import *

def create_video(images, filename):
  res=(160,210) #resulotion
  out = cv2.VideoWriter(filename,cv2.VideoWriter_fourcc('M','J','P','G'), 20, res)
  for image in images:
      out.write(image)

  out.release()
  return VideoFileClip(filename)

clip=create_video(images, "tmp.avi")
clip.ipython_display(width=400)

100%|█████████▉| 332/333 [00:00<00:00, 1284.02it/s]


Fallo por como funciona el stacking de frames

#Modelo con convoluciones

Step para convoluciones

In [110]:
def env_reset(env):
  env.reset()
  state, reward, done, _ = env_step(env, 0, iters=130, new_episode=True) #starting
  return state

images = None
def env_step(env, action, iters=5, new_episode=False, show=False):
  reward=0.
  action = possible_actions[action][action]
  for i in range(iters):
    obs, r, done, info = env.step(action)
    reward+=r
    if done: break
  if(show): images.append(obs)
  
  state = stacked_state(obs,new_episode)

  return state, reward, done, _

Buffer para convolucion

In [153]:
class Buffer:
    # Buffer for storing trajectories
    def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
        # Buffer initialization
        self.observation_buffer = np.zeros(
            (size,4,observation_dimensions[0],observation_dimensions[1],observation_dimensions[2]), dtype=np.float32
        )
        self.action_buffer = np.zeros(size, dtype=np.int32)
        self.advantage_buffer = np.zeros(size, dtype=np.float32)
        self.reward_buffer = np.zeros(size, dtype=np.float32)
        self.return_buffer = np.zeros(size, dtype=np.float32)
        self.value_buffer = np.zeros(size, dtype=np.float32)
        self.logprobability_buffer = np.zeros(size, dtype=np.float32)
        self.gamma, self.lam = gamma, lam
        self.pointer, self.trajectory_start_index = 0, 0

    def store(self, observation, action, reward, value, logprobability):
        # Append one step of agent-environment interaction
        self.observation_buffer[self.pointer] = observation
        self.action_buffer[self.pointer] = action
        self.reward_buffer[self.pointer] = reward
        self.value_buffer[self.pointer] = value
        self.logprobability_buffer[self.pointer] = logprobability
        self.pointer += 1

    def finish_trajectory(self, last_value=0):
        # Finish the trajectory by computing advantage estimates and rewards-to-go
        path_slice = slice(self.trajectory_start_index, self.pointer)
        rewards = np.append(self.reward_buffer[path_slice], last_value)
        values = np.append(self.value_buffer[path_slice], last_value)

        deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]

        self.advantage_buffer[path_slice] = discounted_cumulative_sums(
            deltas, self.gamma * self.lam
        )
        self.return_buffer[path_slice] = discounted_cumulative_sums(
            rewards, self.gamma
        )[:-1]

        self.trajectory_start_index = self.pointer

    def get(self):
        # Get all data of the buffer and normalize the advantages
        self.pointer, self.trajectory_start_index = 0, 0
        advantage_mean, advantage_std = (
            np.mean(self.advantage_buffer),
            np.std(self.advantage_buffer),
        )
        self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
        return (
            self.observation_buffer,
            self.action_buffer,
            self.advantage_buffer,
            self.return_buffer,
            self.logprobability_buffer,
        )

In [141]:
def discounted_cumulative_sums(x, discount):
    # Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]



def logprobabilities(logits, a):
    # Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
    logprobabilities_all = tf.nn.log_softmax(logits)
    logprobability = tf.reduce_sum(
        tf.one_hot(a, num_actions) * logprobabilities_all, axis=1
    )
    return logprobability


# Sample action from actor
@tf.function
def sample_action(observation):
    logits = actor(observation)
    action = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
    return logits, action

stack para Convoluciones

In [125]:
from collections import deque

stacked_frames = None

def stacked_state(state, is_new_episode):  
    global stacked_frames
    if is_new_episode or stacked_frames==None:

        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((observation_dimensions), dtype=np.int) for i in range(n_frames)], maxlen=n_frames)  
        # Because we're in a new episode, copy the same frame 4x
        for i in range(n_frames):
          stacked_frames.append(state)
        
        # Stack the state
        stacked_state = np.stack(stacked_frames, axis=0)
        
    else:
        # Append frame to deque, automatically removes the oldest frame
        if state is not None: stacked_frames.append(state)
        else: stacked_frames.append(np.zeros(observation_dimensions, dtype=np.int))

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=0) 
    
    return stacked_state

Convolution version

Este es un bloque de convolucion basado en Resnet

In [121]:
def conv_block(X,f,d=0.1):
  c = tf.keras.layers.Conv2D(f[0], (1, 1), activation='relu', kernel_initializer='he_normal', padding='same')(X)
  c = tf.keras.layers.BatchNormalization(axis=3)(c)
  c = tf.keras.layers.Dropout(d)(c)
  c = tf.keras.layers.Conv2D(f[1], (3, 3), activation='relu', kernel_initializer='he_normal', padding='same')(c)
  c = tf.keras.layers.BatchNormalization(axis=3)(c)
  c = tf.keras.layers.Dropout(d)(c)
  c = tf.keras.layers.Conv2D(f[2], (1, 1), kernel_initializer='he_normal', padding='same')(c)
  c = tf.keras.layers.BatchNormalization(axis=3)(c)
  s = tf.keras.layers.Conv2D(f[2], (1, 1), kernel_initializer='he_normal', padding='same')(X)
  s = tf.keras.layers.BatchNormalization(axis=3)(s)
  c = tf.keras.layers.Add()([s,c])
  c = tf.keras.layers.ReLU()(c)
  return c

In [155]:
import retro
# Initialize the environment and get the dimensionality of the
# observation space and the number of possible actions
# Create our environment

if 'env' in locals(): env.close()

env = gym.make('Boxing-v4')
#env = retro.make(game='SpaceInvaders-Atari2600')
#env = retro.make(game='Boxing-Atari2600')
#observation_dimensions = 14*n_frames # 14 objects
observation_dimensions = (210, 160, 3)
num_actions = env.action_space.n
possible_actions = np.array(np.identity(num_actions,dtype=int).tolist())


print("The size of our frame is: ", env.observation_space)

# Here we create an hot encoded version of our actions
# possible_actions = [[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0]...]
possible_actions = np.array(np.identity(num_actions,dtype=int).tolist())


def mlp(x, sizes, activation=tf.tanh, output_activation=None):    
    # Build a feedforward neural network
    for size in sizes[:-1]:
        x = layers.Dense(units=size, activation=activation)(x)
    return layers.Dense(units=sizes[-1], activation=output_activation)(x)

# Train the policy by maxizing the PPO-Clip objective
@tf.function
def train_policy(
    observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):

    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        ratio = tf.exp(
            logprobabilities(actor(observation_buffer), action_buffer)
            - logprobability_buffer
        )
        min_advantage = tf.where(
            advantage_buffer > 0,
            (1 + clip_ratio) * advantage_buffer,
            (1 - clip_ratio) * advantage_buffer,
        )

        policy_loss = -tf.reduce_mean(
            tf.minimum(ratio * advantage_buffer, min_advantage)
        )
    policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
    policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))

    kl = tf.reduce_mean(
        logprobability_buffer
        - logprobabilities(actor(observation_buffer), action_buffer)
    )
    kl = tf.reduce_sum(kl)
    return kl


# Train the value function by regression on mean-squared error
@tf.function
def train_value_function(observation_buffer, return_buffer):
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        value_loss = tf.reduce_mean((return_buffer - critic(observation_buffer)) ** 2)
    value_grads = tape.gradient(value_loss, critic.trainable_variables)
    value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))

# Topologia de convolucion basada en parte por resnet50 y Unet

observation_input = keras.Input(shape=observation_dimensions, dtype=tf.float32)
c1 = conv_block(observation_input,(64,64,256))
c1 = tf.keras.layers.MaxPooling2D((2, 2))(c1)
c2 = conv_block(c1,(128,128,512),d=0.2)
c2 = tf.keras.layers.MaxPooling2D((2, 2))(c2)
c3 = conv_block(c2,(256,256,1024),d=0.3)
c3 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(c3)
c3 = tf.keras.layers.GlobalAveragePooling2D()(c3)


# Initialize the actor and the critic as keras models
logits = mlp(c3, list(hidden_sizes) + [num_actions], tf.tanh, None)
actor = keras.Model(inputs=observation_input, outputs=logits)
value = mlp(c3, list(hidden_sizes) + [1], tf.tanh, None)

critic = keras.Model(inputs=observation_input, outputs=value)

# Initialize the policy and the value function optimizers
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)

The size of our frame is:  Box(0, 255, (210, 160, 3), uint8)


In [142]:
actor.summary()

Model: "model_27"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_23 (InputLayer)          [(None, 210, 160, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_73 (Conv2D)             (None, 210, 160, 64  256         ['input_23[0][0]']               
                                )                                                                 
                                                                                                  
 batch_normalization_72 (BatchN  (None, 210, 160, 64  256        ['conv2d_73[0][0]']              
 ormalization)                  )                                                          

In [158]:
# Initialize the buffer
buffer = Buffer(observation_dimensions, steps_per_epoch)

# Initialize the observation, episode return and episode length
observation, episode_return, episode_length = env_reset(env), 0, 0

# Iterate over the number of epochs
for epoch in range(epochs):
    # Initialize the sum of the returns, lengths and number of episodes for each epoch
    sum_return = 0;   sum_length = 0;   num_episodes = 0;   episode_max_return = 0

    # Iterate over the steps of each epoch
    for t in range(steps_per_epoch):
        # Get the logits, action, and take one step in the environment
        print(observation.shape)
        
        logits, action = sample_action(observation)
        observation_new, reward, done, _ = env_step(env, action[0].numpy(), iters=5)
        episode_return += reward
        episode_length += 1

        # Get the value and log-probability of the action
        value_t = critic(observation)
        logprobability_t = logprobabilities(logits, action)
        # Store obs, act, rew, v_t, logp_pi_t
        buffer.store(observation, action, reward, value_t, logprobability_t)

        # Update the observation
        observation = observation_new

        # Finish trajectory if reached to a terminal state
        terminal = done
        if terminal or (t == steps_per_epoch - 1):
            last_value = 0 if done else critic(observation.reshape(1, -1))
            buffer.finish_trajectory(last_value)
            sum_return += episode_return
            sum_length += episode_length
            num_episodes += 1
            if episode_return > episode_max_return: episode_max_return=episode_return
            observation, episode_return, episode_length = env_reset(env), 0, 0 
            
    # Get values from the buffer
    (
        observation_buffer,
        action_buffer,
        advantage_buffer,
        return_buffer,
        logprobability_buffer,
    ) = buffer.get()

    # Update the policy and implement early stopping using KL divergence
    for _ in range(train_policy_iterations):
        kl = train_policy(
            observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
        )
        if kl > 1.5 * target_kl:
            # Early Stopping
            break

    # Update the value function
    for _ in range(train_value_iterations):
        train_value_function(observation_buffer, return_buffer)

    # Print mean return and length for each epoch
    print(
        f" Epoch: {epoch + 1}. MeanReturn: {sum_return / num_episodes}. MeanLength: {sum_length / num_episodes}. MaxReturn: {episode_max_return}"
    )

(4, 210, 160, 3)


ValueError: ignored

Aqui al final no tubimos tiempo de arreglas los problemas de formato que trajieron el cambiar el modo de procesar el input del modelo.

### Simulación

In [None]:
episode_return=0.0
i=0
while episode_return < 500 and i<10:
  observation, episode_return, episode_length = env_reset(env), 0, 0
  images = []
  for t in range(steps_per_epoch):
      # Get the logits, action, and take one step in the environment
      observation = observation.reshape(1, -1)
      
      logits, action = sample_action(observation)
      observation_new, reward, done, _ = env_step(env, action[0].numpy(), iters=1, show=True)

      episode_return += reward

      # Update the observation
      observation = observation_new

      # Finish trajectory if reached to a terminal state
      if done:
        observation = env_reset(env)
        print(episode_return)
        break
  i+=1
episode_return

KeyboardInterrupt: ignored