# Set-Based Reinforcement Learning

In this notebook, we implement a set-based reinforcement learning algorithm, which is based on the paper [Training Verifiably Robust Agents using Set-Based Reinforcement Learning](https://arxiv.org/abs/2408.09112). 

In [1]:
import torch
import sys
sys.path.append('..')
from SBML import ZonoTorch as zt
from SBML import SBRL as sbrl

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

seed = torch.random.manual_seed(0)

  return torch._C._cuda_getDeviceCount() > 0


System dynamics of 1d Quadrotor:
$$\begin{bmatrix}\dot z\\ \ddot z\end{bmatrix} = \begin{bmatrix}\dot z\\ (u+1)/(2*m)-g\end{bmatrix}$$
Reward function: 
$$r = -|z| - 0.1*|\dot z|$$

In [2]:
m = 0.05
g = 9.81

def dynamics(x,u):
    dx = torch.tensor([[x[0,1],(u[0,0]+1)/(2*m)-g]]).to(device=DEVICE)
    return dx

def reward(x,u,x_next):
    r = - torch.sum(torch.abs(x*torch.tensor([[1,0.1]]).to(device=DEVICE)))
    return r

init_state = zt.set.Interval(torch.tensor([[-4,4],[0,0]]).to(device=DEVICE))


Actor and Critic models are implemented using PyTorch. The actor and critic are simple feedforward neural networks with 2 hidden layers.

In [3]:
actor = torch.nn.Sequential(
    torch.nn.Linear(2, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1),
    torch.nn.Tanh()
)

critic = torch.nn.Sequential(
    torch.nn.Linear(3, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1)
)

In [4]:
env_options = {
    'ct': 0.1,
    'dt': 0.01,
    'max_step': 30,
    'initial_ops': 'uniform',
}

senv = sbrl.SetEnvironmnent(init_state,env_options,dynamics,reward,device=DEVICE)

ddpg_ops = {
    'actor_lr': 1e-4,
    'actor_train_mode': 'set',
    'critic_lr': 1e-3,
    'critic_train_mode': 'set',
    'gamma': 0.99,
    'tau': 0.005,
    'buffer_size': 10000,
    'batch_size': 64,
    'exp_noise': 0.2,
    'action_ub': 1,
    'action_lb': -1,
    'noise': .1,
    'actor_eta': 0.01,
}
agent = sbrl.DDPG(actor,critic,ddpg_ops,DEVICE)

In [5]:
agent.train(senv,1000,True)

Reinforcment Learning Parameters:
Standard-RL Options:
--------------------
Discount Factor (gamma): 0.99
Buffer Size: 10000
Batch Size: 64
Episodes: 1000
Device: cpu

Actor Options:
--------------
Learning Rate: 0.0001
Training Mode: set
Eta: 0.01
Omega: 0.5
Noise: 0.1

Critic Options:
---------------
Learning Rate: 0.001
Training Mode: set
Eta: 0.01


Training Information:
|Episode	|Elapsed Time	|Reward	|Q-Value	|Critic-Loss	|Actor-Loss
|-------	|----------	|----------	|-----------	|-----------	|----------
|0	|1.2911546230316162	|-365.95947265625	|-0.11112702637910843	|1.1340760922431945	|0.4561907829344273
|1	|4.103039741516113	|-331.2398986816406	|-2.3122174739837646	|0.340932719707489	|4.110089967250824
|2	|6.805701017379761	|-296.4527893066406	|-0.31406527757644653	|0.3338134071230888	|5.251364207267761
|3	|9.638569355010986	|-190.66384887695312	|-2.3686912059783936	|0.41944458842277527	|6.440931539535523
|4	|12.279626369476318	|-423.9900817871094	|-2.1137397289276123	|0.49585912

KeyboardInterrupt: 