# Set-Based Reinforcement Learning Hopper Environment

In this notebook, we implement a set-based reinforcement learning algorithm, which is based on the paper [Training Verifiably Robust Agents using Set-Based Reinforcement Learning](https://arxiv.org/abs/2408.09112). 

In [1]:
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
sys.path.append('..')
from SBML import ZonoTorch as zt
from SBML import SBRL as sbrl

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

seed = 4

seedtorch = torch.random.manual_seed(seed)
seednp = np.random.seed(seed)

Actor and Critic models are implemented using PyTorch. The actor and critic are simple feedforward neural networks with 2 hidden layers.

In [2]:
actor = torch.nn.Sequential(
    torch.nn.Linear(11, 400),
    torch.nn.ReLU(),
    torch.nn.Linear(400, 300),
    torch.nn.ReLU(),
    torch.nn.Linear(300, 3),
    torch.nn.Tanh()
)

critic = torch.nn.Sequential(
    torch.nn.Linear(14, 400),
    torch.nn.ReLU(),
    torch.nn.Linear(400, 300),
    torch.nn.ReLU(),
    torch.nn.Linear(300, 1)
)

In [3]:
env_options = {
    'max_step': 1000,
}

senv = sbrl.GymEnvironment('Hopper-v3',env_options, DEVICE)

ddpg_ops = {
    'actor_lr': 1e-4,
    'actor_train_mode': 'set',
    'critic_lr': 1e-3,
    'critic_l2': 0.01,
    'critic_train_mode': 'point',
    'gamma': 0.99,
    'tau': 0.001,
    'buffer_size': 1e6,
    'batch_size': 64,
    'exp_noise': 0.2,
    'action_ub': 1,
    'action_lb': -1,
    'noise': .1,
    'actor_eta': 0.001,
    'actor_omega': 0.5,
}
agent = sbrl.DDPG(actor,critic,ddpg_ops,DEVICE)

  logger.warn(
  logger.warn(


In [4]:
agent.train(senv,3e6,True)

Reinforcment Learning Parameters:
Standard-RL Options:
--------------------
Discount Factor (gamma): 0.99
Buffer Size: 1000000.0
Batch Size: 64
Steps: 3000000.0
Device: cuda

Actor Options:
--------------
Learning Rate: 0.0001
Training Mode: set
Eta: 0.001
Omega: 0.5
Noise: 0.1

Critic Options:
---------------
Learning Rate: 0.001
Training Mode: point


Found 4 GPUs for rendering. Using device 0.
Training Information:
|Step           |Time   |Reward         |Q-Value        |Critic-Loss    |Actor-Loss     |
|---------------|-------|---------------|---------------|---------------|---------------|
|1.00e+00	|0.0	|1.00e+00	|7.15e-02	|0.00e+00	|0.00e+00	|


  if not isinstance(terminated, (bool, np.bool8)):


|1.00e+03	|0.3	|6.18e+00	|2.07e+00	|2.86e-02	|-2.09e+00	|
|2.00e+03	|0.5	|3.10e+00	|2.99e+00	|2.97e-01	|-4.04e+00	|


KeyboardInterrupt: 

In [None]:
# Extract reward history
rewards = np.array(agent.learn_hist['reward'])

# Compute moving average with a window of 10
window_size = 10
moving_avg = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')

steps_executed = 1000 * np.arange(len(rewards))

# Plot the reward history
plt.figure()
plt.plot(steps_executed,rewards, label='Raw Reward', alpha=0.3)
plt.plot(1000 *np.arange(window_size - 1, len(rewards)), moving_avg, label='Moving Average (10 evals)', color='red')
plt.title('Reward')
plt.legend()

if not os.path.exists('figures'):
    os.makedirs('figures')

plt.savefig('figures/LearnHist.png')
plt.show()


In [None]:
# Save the model:
torch.save(agent.actor.to('cpu').state_dict(), 'actor.pth')
torch.save(agent.critic.to('cpu').state_dict(), 'critic.pth')

# Save the reward history
np.save('rewards.npy', rewards)
np.save('reward_mvg.npy', moving_avg)