Repository containing the code for the paper "NLBAC: A Neural ODE-based Algorithm for State-Wise Stable and Safe Reinforcement Learning through Augmented Lagrangian Method", and this code is developed based on codes: https://github.com/yemam3/Mod-RL-RCBF, and https://github.com/LiqunZhao/A-Barrier-Lyapunov-Actor-Critic-Reinforcement-Learning-Approach-for-Safe-and-Stable-Control for the paper Stable and Safe Reinforcement Learning via a Barrier-Lyapunov Actor-Critic Approach.
This repository only contains the code with clear comments for the algorithms Neural ordinary differential equations-based Lyapunov Barrier Actor Critic (NLBAC), for other algorithms, please refer to:
SAC-RCBF: https://github.com/yemam3/Mod-RL-RCBF
MBPPO-Lagrangian: https://github.com/akjayant/mbppol
LAC: https://github.com/hithmh/Actor-critic-with-stability-guarantee
CPO, PPO-Lagrangian and TRPO-Lagrangian: https://github.com/openai/safety-starter-agents
Three environments called Unicycle
, SimulatedCars (Simulated Car Following)
and Planar Vertical Take-Off and Landing (PVTOL)
are provided in this repository. In Unicycle
, a unicycle is required to arrive at the
desired location, i.e., destination, while avoiding collisions with obstacles. SimulatedCars (Simulated Car Following)
involves a chain of five cars following each other on a straight road. The goal is to control the acceleration of the Planar Vertical Take-Off and Landing (PVTOL)
, a quadcopter is required to reach a destination while avoiding obstacles, keeping within a specified range along the Y-axis and within a specific distance from a safety pilot along the X-axis.
Detailed descriptions of the three environments can be found in the last part of this page, with comparisons of modeling performance between neural ODEs and conventional neural network.
Interested readers can also explore the option of using their own customized environments. Detailed instructions can be found below.
The experiments are run with Pytorch, and wandb (https://wandb.ai/site) is used to save the data and draw the graphs. To run the experiments, some packages are listed below with their versions (in my conda environment).
python: 3.6.13
pytorch: 1.10.2
numpy: 1.17.5
wandb: 0.12.11
gym: 0.15.7
torchdiffeq 0.2.3
Firstly, instructions on how to run the codes for the Unicycle
, SimulatedCars (Simulated Car Following)
and Planar Vertical Take-Off and Landing (PVTOL)
environments are provided. Following that, instruction on applying this NLBAC framework to your customized environment is provided.
You can follow the steps below to run the RL-training part directly since a pre-trained model has been provided:
- Navigate to the directory
Neural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/NLBAC_Unicycle_RL_training/Unicycle_RL_training
- Run the command
python main.py --env Unicycle --gamma_b 50 --max_episodes 200 --cuda --updates_per_step 2 --batch_size 128 --seed 0 --start_steps 1000
Here are the results obtained by my machine:
The experiment where neural barrier certificate is trained and used is within the folder neural_barrier_certificate
.
You can follow the steps below to run the RL-training part directly since a pre-trained model has been provided:
- Navigate to the directory
Neural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/NLBAC_SimulatedCarsFollowing_RL_training/Simulated_Car_Following_RL_training
- Run the command
python main.py --env SimulatedCars --gamma_b 0.5 --max_episodes 200 --cuda --updates_per_step 2 --batch_size 256 --seed 0 --start_steps 200
Here are the results obtained by my machine:
You can follow the steps below to run the RL-training part directly since a pre-trained model has been provided:
- Navigate to the directory
Neural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/NLBAC_pvtol_RL_training/Pvtol_RL_training
- Run the command
python main.py --env Pvtol --gamma_b 0.8 --max_episodes 400 --cuda --updates_per_step 1 --batch_size 256 --seed 10 --start_steps 1000
Here are the results obtained by my machine:
The experiment where neural barrier certificate is trained and used is within the folder neural_barrier_certificate
.
The whole process is similar:
- Copy the folder
Unicycle
and rename it as your customized environmentYour_customized_environment
- Prepare you own customized environment and do some adjustments. Here is one point:
- Outputs of your own customized
env.step
function. Besidesnext obs
,reward
,done
andinfo
that are commonly used in RL literature, here we still need:constraint
: Difference between the current state and the desired state, and is required to decrease. It is also used to approximate the Lyapunov network.- Some lists used as inputs of the Lyapunov network (if
obs
andnext obs
are not used as inputs of the Lyapunov network directly). See the aforementioned environments as examples. - Other info like the number of safety violations and value of safety cost (usually used in algorithms like CPO, PPO-Lagrangian and TRPO-Lagrangian), and barrier signal if a neural barrier certificate needs to be learned.
- Add the new customized environment in the file
build_env.py
, and change someif
statements regardingdynamics_mode
insac_cbf_clf.py
- Change the replay buffer since the outputs of
env.step
are changed. - Tune the hyperparameters like batch size, number of hidden states and so on if necessary.
- Change the input and output dimensions in
Neural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/Your_customized_environment/Your_customized_environment_RL_training/sac_cbf_clf/sac_cbf_clf.py
- Rewrite the functions
get_policy_loss_2
andbackup_get_policy_loss_2
in the fileNeural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/Your_customized_environment/Your_customized_environment_RL_training/sac_cbf_clf/sac_cbf_clf.py
to formulate your own CBF and CLF constraints (either pre-defined CBFs or neural barrier certificate) - Navigate to the directory
Neural-ordinary-differential-equations-based-Lyapunov-Barrier-Actor-Critic-NLBAC/Your_customized_environment/Your_customized_environment_RL_training
- Change some contents in the file
main.py
like when to use and stop the backup controller according to your customized environment - Run the command
python main.py --env Your_customized_environment --gamma_b 0.5 --max_episodes 200 --cuda --updates_per_step 2 --batch_size 256 --seed 0 --start_steps 200
. Change the arguments if necessary.
Here we present comparisons of modeling performance between Neural ODEs, which model system dynamics, and a common neural network baseline (labeled as "Standard NN") that directly outputs the predicted next state. The ground truth is the output of the gym environment and is labeled as "gym".
Unicycle: In this environment, with the prior knowledge that the system is control-affine, we utilize two separate networks to represent nn.MSELoss
function, is 0.0012, and the mean squared error of the common NN-based model computed in the same way is 1.1023.
Simulated Car Following: In this environment, lacking the prior knowledge about whether the system is control-affine or not, we directly use a single network to represent nn.MSELoss
function, is 0.3682, and the mean squared error of the common NN-based model computed in the same way is 1.5534.
Planar Vertical Take-Off and Landing (PVTOL): In this environment, with the prior knowledge that the system is control-affine, we utilize two separate networks to represent nn.MSELoss
function, is 0.1258, and the mean squared error of the common NN-based model computed in the same way is 2.1180.
Unicycle: In this experiment setup, a unicycle is tasked with reaching the designated location, i.e., the destination, while avoiding collisions with obstacles. The real dynamics of the system is:
Here,
The reward signal is formulated as
If the stability constraint is violated due to the inability to satisfy safety and stability constraints simultaneously, the unicycle can become trapped close to obstacles. In such cases, the backup controller takes over from the primary controller. The primary controller is reinstated when the unicycle moves a long distance away from the trapped position, or when the predefined time threshold for using the backup controller is exceeded.
Simulated Car Following: This environment simulates a chain of five cars following each other on a straight road. The objective is to control the acceleration of the
Each state of the system is denoted as
where
The model of the
where
The reward signal is defined to minimize the overall control effort, and an additional reward of 2.0 is granted during timesteps when
Note that when we use NODEs to model this system, we assume that we do not have the priori information that this system is control-affine. The input of network
Planar Vertical Take-Off and Landing (PVTOL): In this experiment, a quadcopter is required to reach a destination while avoiding obstacles and keeping within a specified range along the Y-axis and a specific distance from a safety pilot along the X-axis. The real dynamics of the system is:
Here
The reward signal is defined to minimize the distance from the destination, and an additional reward of 1500 will be given if the quadcopter reaches a small neighborhood of the destination. The cost signal is the current distance from the destination, and the safety pilot tracks the quadcopter via a proportional controller along the X-axis. Collision avoidance and confinement within specific ranges along the X-axis and Y-axis are ensured by pre-defined CBFs, following a similar approach to the previous two environments, if given. The relative degree, and therefore the planning horizon for NODEs predictions, is 3. When no pre-defined CBFs are available, the neural barrier certificate will be learned jointly with the controller by using additional barrier signals with
For NODE-based models, the input of network
If you have some questions regarding the code or the paper, please do not hesitate to contact me by email. My email address is liqun.zhao@eng.ox.ac.uk
.