# Set-Based Reinforcement Learning

In this notebook, we implement a set-based reinforcement learning algorithm, which is based on the paper [Training Verifiably Robust Agents using Set-Based Reinforcement Learning](https://arxiv.org/abs/2408.09112). 

In [1]:
import torch
import sys
sys.path.append('..')
from SBML import ZonoTorch as zt
from SBML import SBRL as sbrl

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

seed = torch.random.manual_seed(0)

System dynamics of 1d Quadrotor:
$$\begin{bmatrix}\dot z\\ \ddot z\end{bmatrix} = \begin{bmatrix}\dot z\\ (u+1)/(2*m)-g\end{bmatrix}$$
Reward function: 
$$r = -|z| - 0.1*|\dot z|$$

In [2]:
m = 0.05
g = 9.81

def dynamics(x,u):
    dx = torch.tensor([[x[0,1],(u[0,0]+1)/(2*m)-g]]).to(device=DEVICE)
    return dx

def reward(x,u,x_next):
    r = - torch.sum(torch.abs(x*torch.tensor([[1,0.1]]).to(device=DEVICE)))
    return r

init_state = zt.set.Interval(torch.tensor([[-4,4],[0,0]]).to(device=DEVICE))


Actor and Critic models are implemented using PyTorch. The actor and critic are simple feedforward neural networks with 2 hidden layers.

In [3]:
actor = torch.nn.Sequential(
    torch.nn.Linear(2, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1),
    torch.nn.Tanh()
)

critic = torch.nn.Sequential(
    torch.nn.Linear(3, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 1)
)

In [4]:
env_options = {
    'ct': 0.1,
    'dt': 0.01,
    'max_step': 30,
    'initial_ops': 'uniform',
}

senv = sbrl.SetEnvironmnent(init_state,env_options,dynamics,reward,device=DEVICE)

ddpg_ops = {
    'actor_lr': 0.001,
    'actor_train_mode': 'set',
    'critic_lr': 0.01,
    'critic_train_mode': 'point',
    'gamma': 0.99,
    'tau': 0.005,
    'buffer_size': 1000,
    'batch_size': 64,
    'exp_noise': 0.1,
    'action_ub': 1,
    'action_lb': -1,
}
agent = sbrl.DDPG(actor,critic,ddpg_ops,DEVICE)

In [5]:
agent.train(senv,1000,True)

Reinforcment Learning Parameters:
Standard-RL Options:
--------------------
Discount Factor (gamma): 0.99
Buffer Size: 1000
Batch Size: 64
Episodes: 1000
Device: cuda

Actor Options:
--------------
Learning Rate: 0.001
Training Mode: set
Eta: 0.01
Omega: 0.5
Noise: 0.1

Critic Options:
---------------
Learning Rate: 0.01
Training Mode: point


Training Information:
|Episode	|Elapsed Time	|Reward	|Q-Value	|Critic-Loss	|Actor-Loss
|-------	|----------	|----------	|-----------	|-----------	|----------
|0	|0.6068704128265381	|-478.73883056640625	|-0.10723283886909485	|0.36090936977416277	|1.4464703750610353
|1	|1.449418306350708	|-3331.372314453125	|-2.94315767288208	|0.8142036758596077	|12.264281144142151
|2	|2.2633895874023438	|-1686.52978515625	|-5.619321346282959	|16.098948203362525	|28.694989967346192
|3	|3.1050291061401367	|-4848.80322265625	|-2.057060718536377	|42.94912091493607	|36.22862781524658
|4	|3.9404890537261963	|nan	|-3.5866775512695312	|nan	|nan
|5	|4.770686626434326	|nan	

KeyboardInterrupt: 