# PPO

Proximal Policy Optimization

### Policy gradient methods

Policy gradient methods are group of RL methods that attempt to find a policy gradient, then using a gradient ascent algorithm. Generally, this involves alternating between sampling and optimization. Many samples are taken to construct a policy gradient, this policy gets optimized, then more samples are taken. In reality, there exists a large variety of policy gradient methods, each seeking to do the same thing with a distinct strategy: Utilize gradient ascent on the policy gradient.

### Actor-critic structure



### How does PPO do this?

From Proximal Polyicy Optimization paper by OpenAI: https://doi.org/10.48550/arXiv.1707.06347

![Alt text](PPODescription.png)

### Why PPO?

PPO has been shown to excel at continous environments, which happen to be the environments that we face in robotics. It strikes a great balance between reliability and sample efficiency, outpreforming Deep Q-Networks in a variety of scenarios. Because of this, PPO is available in many of the frameworks included in IsaacLab, which we will be using next.

#### Dependencies

In [5]:
%pip install torch
%pip install torchrl
%pip install gym[mujoco]
%pip install tqdm

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [6]:
import warnings
warnings.filterwarnings("ignore")
from torch import multiprocessing


from collections import defaultdict

import matplotlib.pyplot as plt
import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,
                          TransformedEnv)
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type
from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
from tqdm import tqdm

#### Define hyperparameters

In [9]:
is_fork = multiprocessing.get_start_method() == "fork"
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)
print(device)
num_cells = 256  # number of cells in each layer i.e. output dim.
lr = 3e-4
max_grad_norm = 1.0

# Data collection parameters

frames_per_batch = 1000
# For a complete training, bring the number of frames up to 1M
total_frames = 50_000

# PPO parameters
sub_batch_size = 64  # cardinality of the sub-samples gathered from the current data in the inner loop
num_epochs = 10  # optimization steps per batch of data collected
clip_epsilon = (
    0.2  # clip value for PPO loss: see the equation in the intro for more context.
)
gamma = 0.99
lmbda = 0.95
entropy_eps = 1e-4

cuda:0


#### Define environment

In [10]:
base_env = GymEnv("InvertedDoublePendulum-v4", device=device)

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


#### Normalization parameters

In [None]:
env = TransformedEnv(
    base_env,
    Compose(
        # normalize observations
        ObservationNorm(in_keys=["observation"]),
        DoubleToFloat(),
        StepCounter(),
    ),
)

env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)
print("normalization constant shape:", env.transform[0].loc.shape)

NameError: name 'env' is not defined