### Step 0 : Init RoboFeeder Env
This cell sets up the environment for the RoboFeeder simulation by importing necessary modules and configuring the Python path.

In [None]:
import os 
dir = os.getcwd()
if 'gym4ReaL' in dir:
    os.chdir(os.getcwd().split('gym4ReaL')[0] + 'gym4ReaL')
else:
    print("please set the working directory to the root of the gym4ReaL repository")

# check if the current working directory is the root of the gym4ReaL repository
os.getcwd()

### Step 1 : Import Required Modules
This cell imports the necessary modules and updates the system path to include the gym4ReaL repository. It also imports the robot simulator and `stable-baseline` libraries to train the model.

In [None]:
import sys
sys.path.append(os.getcwd())  # <-- path to the *parent* of gym4real

from stable_baselines3 import PPO
from gym4real.envs.robofeeder.rf_picking_v0 import robotEnv

from stable_baselines3.common.vec_env import SubprocVecEnv,DummyVecEnv
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.env_checker import check_env

### Step 2 : Test Simulator
import the simulator configuration file to pass the required parameters to run. 
Relevant parameters to adapt:

#### ObjectToPick
    NUMBER_OF_OBJECTS: 1           # (int) Number of objects to pick
    SHUFFLE_OBJECTS: True          # (bool) Shuffle object positions at reset
    OBJ_CORRECT_ORIENTATION: True  # (bool) Ensure objects have correct orientation

#### Simulator Setting
    IS_SIMULATION_REAL_TIME: False   # (bool) Run simulation in real time
    IS_SIMULATION_SHOWED: True       # (bool) Show simulation window
    IS_SIMULATION_RECORD: False      # (bool) Record simulation video
    RECORD_FOLDER : "." # (str) Folder to save recorded videos


In [None]:
import shutil

# Copy the default configuration file to a new editable file
default_config_file = os.getcwd() + "/gym4real/envs/robofeeder/configuration.yaml"
config_file = os.getcwd() + "/examples/robofeeder/notebooks/configuration_editable.yaml"
shutil.copy(default_config_file, config_file)


### Step 3 : Define Environment Creation Function

This cell defines a helper function `make_env` that creates and returns a new instance of the `robotEnv` environment using the specified configuration file. This function is used to generate multiple environments for parallel training with vectorized environments.


In [3]:
# Function to generate the environment to stack in a vectorEnv
def make_env(config_file):
    def _init():
        env = robotEnv(config_file=config_file)
        return env
    return _init

### Step 4 : Create Vectorized Environment

This cell sets up a vectorized environment using `SubprocVecEnv` to enable parallel simulation of multiple environments. It uses the `make_env` function to create separate instances of the `robotEnv` environment, each with its own process. The number of parallel environments is determined by `num_cpu`. This setup is essential for efficient training of reinforcement learning models.


In [None]:
num_cpu = 2 # Number of processes/Env to use
env = SubprocVecEnv([make_env(config_file) for i in range(1,num_cpu+ 1)]) # Create the vectorEnv (ROS_ID start at 1)




### Step 5 : Define Custom CNN Feature Extractor and Policy

This cell defines a custom convolutional neural network (CNN) feature extractor by subclassing `BaseFeaturesExtractor` from Stable Baselines3. The custom extractor processes image observations for the reinforcement learning agent. It also sets up the policy architecture and optimizer parameters for the PPO agent, specifying the network layers for the policy (`pi`) and value function (`vf`), as well as other relevant hyperparameters.


In [5]:
from gymnasium import spaces
import torch as th
import torch.nn as nn
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

class CustomCNN(BaseFeaturesExtractor):
    def __init__(self, observation_space: spaces.Box, features_dim: int = 256):
        super().__init__(observation_space, features_dim)

        n_input_channels = observation_space.shape[0]
        ks = 3

        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 16, kernel_size=ks, stride=2, padding=1),
            nn.LeakyReLU(),
            nn.Conv2d(16, 32, kernel_size=ks, stride=2, padding=1),
            nn.LeakyReLU(),
            nn.Conv2d(32, 64, kernel_size=ks, stride=2, padding=1),
            nn.LeakyReLU(),
            nn.Conv2d(64, 64, kernel_size=ks, stride=3, padding=1),
            nn.LeakyReLU(),
            nn.Flatten(),
        )

        # Dynamically calculate CNN output size
        with th.no_grad():
            dummy_input = th.zeros(1, *observation_space.shape)
            flat_output = self.cnn(dummy_input)
            cnn_output_dim = flat_output.shape[1]
            #print(f"Raw CNN output dim: {cnn_output_dim}")

        # Final projection to fixed feature dim
        self.linear = nn.Sequential(
            nn.Linear(cnn_output_dim, features_dim),
            nn.ReLU()
        )

        self._features_dim = features_dim

    def forward(self, observations: th.Tensor) -> th.Tensor:
        features = self.cnn(observations)
        return self.linear(features)


pi = [256,256,128]
vf = [256,256,128]

features_dim = 256
optimizer_kwargs= dict (weight_decay=1e-5,)


policy_kwargs = dict(normalize_images=False,
                     features_extractor_class=CustomCNN,
                     features_extractor_kwargs=dict(features_dim=features_dim),
                     net_arch=dict(pi=pi, vf=vf),
                     optimizer_kwargs=optimizer_kwargs
                     )


### Step 6 : Initialize PPO Model

This cell initializes the Proximal Policy Optimization (PPO) model using the custom CNN policy defined earlier. The model is configured with the vectorized environment, custom policy architecture, optimizer parameters, and other relevant hyperparameters such as number of steps, batch size, learning rate, and entropy coefficient. This setup prepares the reinforcement learning agent for training on the RoboFeeder environment.


In [None]:
n_steps=3

model = PPO(
    "CnnPolicy",
    env,
    n_steps=n_steps,
    batch_size=n_steps*num_cpu,
    n_epochs=20,
    learning_rate=0.003, 
    clip_range=0.3,
    #gamma=0.95,
    ent_coef=0.01, 
    #vf_coef=0.5,
    #max_grad_norm=.5,
    verbose=0,
    seed=123,
    tensorboard_log= ".",
    policy_kwargs=policy_kwargs,
    
)

### Step 7 : Train the PPO Model

This cell starts the training process for the PPO model using the vectorized RoboFeeder environment. The `learn` method is called with a specified number of timesteps (`total_timesteps=100`). The `reset_num_timesteps=False` argument ensures that the training continues from the current timestep count, and `progress_bar=True` displays a progress bar during training. This step is essential for teaching the agent to interact with the environment and improve its performance through reinforcement learning.

In [7]:
model.learn(total_timesteps=100,reset_num_timesteps=False,progress_bar=True)

<stable_baselines3.ppo.ppo.PPO at 0x7052cb777910>