In [1]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from IPython.display import clear_output
import matplotlib.pyplot as plt

from keras.callbacks import TensorBoard  # to visualize the training process
import os
import datetime
import pygame

physical_devices = tf.config.list_physical_devices('GPU') 
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)





TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

 The versions of TensorFlow you are currently using is 2.15.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons


## Proximal Policy Optimization


PPO is a policy gradient Actor-Critic algorithm. The policy model, the **actor** network  produces a stochastic policy. It maps the state to a probability distribution over the set of possible actions. The **critic** network is used to approximate the value function and then, the advantage is calculated:

$$
A_\Phi (s_t, a_t) = q_\Phi (s_t,a_t) - v_\Phi (s_t) = R_t + \gamma v_{\Phi'} (s_{t+1}) - v_\Phi (s_t)
$$

The critic, $v_\Phi$ is trained in the same manner, as the DQN model and the critic of DDPG, with TD-learning and a "frozen" and periodically updated target critic network, $v_{\Phi'}$. Instead of approximating a q-value, it approximates the value.

To train the actor, PPO uses the ratio of two policies:
- a current policy $\pi_\Theta$, that is learned currently
- a baseline policy $\pi_{\Theta´}$, an earlier version of the policy

$$
r^t (\Theta)=r_\Theta (s_t,a_t) = \frac{\pi_\Theta (a_t | s_t)}{\pi_{\Theta'} (a_t | s_t)}
$$

It is the ratio of the probabilities of selecting $a_t$ given $\pi_\Theta$ and the probability of selecting the same action with $\pi_{\Theta´}$.

When multiplied with the the approximated advantage, calculated using the critic network, it can be used as the objective function (maximize with SGA)

$$
loss_{actor} = - r_\Theta (s_t, a_t) A_\Phi (s_t, a_t)
$$

as when
- the advantage is positive, meaning, that selecting the action would increase the value, the probability of selecting this action would increase
- the advantage is negative, meaning, that selecting the action would decrease the value, the probability of selecting this action would decrease

Instead of using this directly as loss function, to stabilize the implementation by adjusting the policy optimization step size, the loss is extended in a pessimistic way:

$$
loss_{actor} = \min [r_\Theta (s_t, a_t) A_\Phi (s_t, a_t), clip(r_\Theta (s_t, a_t), 1-\epsilon, 1+\epsilon) A_\Phi (s_t, a_t)]
$$

PPO uses 2 main models. The actor network learns the stochastic policy. It maps the state to a probability distribution over the set of possible actions. The critic network learns the value function. It maps the state to a scalar.

The critic, $v_\Phi$ is trained in the same manner, as the DQN model and the critic of DDPG, with TD-learning and a "frozen" and periodically updated target critic network, $v_{\Phi'}$. Instead of approximating a q-value, it approximates the value.

To train the actor, PPO uses the ratio of two policies:
- a current policy $\pi_\Theta$, that is learned currently
- a baseline policy $\pi_{\Theta´}$, an earlier version of the policy

$$
r^t (\Theta)=r_\Theta (s_t,a_t) = \frac{\pi_\Theta (a_t | s_t)}{\pi_{\Theta'} (a_t | s_t)}
$$

It is the ratio of the probabilities of selecting $a_t$ given $\pi_\Theta$ and the probability of selecting the same action with $\pi_{\Theta´}$.

When multiplied with the the approximated advantage, calculated using the critic network, it can be used as the objective function (maximize with SGA)

$$
loss_{actor} = - r_\Theta (s_t, a_t) A_\Phi (s_t, a_t)
$$

as when
- the advantage is positive, meaning, that selecting the action would increase the value, the probability of selecting this action would increase
- the advantage is negative, meaning, that selecting the action would decrease the value, the probability of selecting this action would decrease

Instead of using this directly as loss function, to stabilize the implementation by adjusting the policy optimization step size, the loss is extended in a pessimistic way:

$$
loss_{actor} = \min [r_\Theta (s_t, a_t) A_\Phi (s_t, a_t), clip(r_\Theta (s_t, a_t), 1-\epsilon, 1+\epsilon) A_\Phi (s_t, a_t)]
$$

## Add a Connection to Tensorboard -> online visualization

In [10]:
# refers to log data and model data -> below for model data
jetzt = datetime.datetime.now()
datum_uhrzeit = jetzt.strftime("%Y%m%d_%H%M%S")
savedir = f'model\\MountainCar_discret_{datum_uhrzeit}'
os.makedirs('model', exist_ok=True)
os.makedirs(savedir, exist_ok=True)

In [11]:
log_dir1 = f"{savedir}\\log"
os.makedirs(log_dir1, exist_ok=True)

if os.path.exists(log_dir1):
    print(f"The directory {log_dir1} exists.")
    absolute_path = os.path.abspath(log_dir1)
    print(absolute_path)
else:
    print(f"The directory {log_dir1} does not exist.")


The directory model\MountainCar_discret_20240108_204123\log exists.
c:\Users\Mathias\Documents\StudiumMaster\Semester1\Roboterprogrammierung_Hein\Projektarbeit_PPO\02_Code\model\MountainCar_discret_20240108_204123\log


## Parameter/ Hyperparameter

In [12]:
# Parameter for the actor and critic networks
actor_learning_rate = 0.00025   # learning rate for the actor
critic_learning_rate = 0.001    # learning rate for the critic

# Parameter for the agent
gamma = 0.99                    # discount factor
epsilon = 0.1                   # clip range for the actor loss function

# Parameter for training
epochs = 1                   # number of learning iterations
n_rollouts = 1#5                  # number of episodes/ rollouts to collect experience
batch_size = 8                  # number of samples per learning step
learn_steps = 1#16                # number of learning steps per epoch

## Environment initialisieren

In [13]:
from CustomMtnCarEnvironments import CustomMountainCarEnv_acceleration

env = gym.make('MountainCar-v0', render_mode='rgb_array')  #human fur pygame gui -> very laggy!
env = CustomMountainCarEnv_acceleration(env)

## PPO-Agent initialisieren

In [14]:
from PPOAgentDiscrete import PPOAgentDiscrete as PPOAgent
agent = PPOAgent(env.action_space, env.observation_space, gamma, epsilon, actor_learning_rate, critic_learning_rate)

## PPO-Agent trainieren

In [15]:
from train_agent import training
training(env, agent, log_dir1, epochs, n_rollouts, batch_size, learn_steps, render=False)

start training


collecting experience in rollouts finished, start learning phase
update online nets, learn step 0 of 1 finished
update frozen nets, epoche 0 of 1 finished
===> epoch 1, total_timesteps 200, actor loss -0.03959915041923523, critic loss 3.139740467071533, avg_epoch_return 64.0, sum_epoch_terminations 0


# Storing and loading models

In [16]:
# save the model to h5 format
filepath_actor = f"{savedir}\\actor.h5"
filepath_critic = f"{savedir}\\critic.h5"
agent.save_models(filepath_actor, filepath_critic)

In [17]:
# load the model from h5 format -> use new agent in new instance of the enviroment to prevent overwriting
load_env = gym.make("MountainCar-v0", render_mode='rgb_array')

load_agent = PPOAgent(env.action_space, env.observation_space)
load_agent._init_networks()

# filepath_actor = f"... .h5"
# filepath_critic = f"... .h5"

load_agent.load_models(filepath_actor, filepath_critic)

Model loaded sucessful


## Rendering with pygame

In [18]:
from render_GUI import render_GUI


# Set up the enviroment and load the trained agent from directory
render_env = gym.make('MountainCar-v0', render_mode = 'human')
render_agent = PPOAgent(render_env.action_space, render_env.observation_space)

# filepath_actor = f"... .h5"
# filepath_critic = f"... .h5"

#call the function
render_GUI(render_env, render_agent, filepath_actor, filepath_critic)


Model loaded sucessful
Episode 0 finished
Episode 1 finished
Episode 2 finished
Closed Rendering sucessful
