<a href="https://colab.research.google.com/github/LeograndeCode/AW1/blob/main/colab_template/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install and load all dependencies (first time only) \
NOTE: you may need to restart the runtime afterwards (CTRL+M .).

In [1]:
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf

!pip install gym
!pip install free-mujoco-py

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
software-properties-common is already the newest version (0.99.22.9).
The following additional packages will be installed:
  libegl-dev libgl-dev libgles-dev libgles1 libglu1-mesa libglu1-mesa-dev libglvnd-core-dev
  libglvnd-dev libglx-dev libopengl-dev libosmesa6
The following NEW packages will be installed:
  libegl-dev libgl-dev libgl1-mesa-dev libgl1-mesa-glx libgles-dev libgles1 libglew-dev
  libglu1-mesa libglu1-mesa-dev libglvnd-core-dev libglvnd-dev libglx-dev libopengl-dev libosmesa6
  libosmesa6-dev
0 upgraded, 15 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,013 kB of archives.
After this operation, 19.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libglx-dev amd64 1.4.0-1 [14.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libgl-dev amd64 1.4.0-1 [101 kB]
Get:3 http://archive.ubuntu.com/ubuntu 

Set up the custom Hopper environment and provided util functions



1.   Upload `custom_hopper.zip` to the current session's file storage
2.   Un-zip it by running cell below


In [1]:
!unzip custom_hopper.zip

Archive:  custom_hopper.zip
   creating: env/
  inflating: env/__init__.py         
  inflating: env/custom_hopper.py    
  inflating: env/mujoco_env.py       
   creating: env/assets/
  inflating: env/assets/hopper.xml   




---




**Test a random policy on the Gym Hopper environment**

\



Play around with this code to get familiar with the
Hopper environment.

For example, what happens if you don't reset the environment
even after the episode is over?
When exactly is the episode over?
What is an action here?

In [7]:
!pip install git+https://github.com/DLR-RM/stable-baselines3.git
!pip install 'shimmy>=2.0' # Install shimmy for Gym compatibility

  and should_run_async(code)


Collecting git+https://github.com/DLR-RM/stable-baselines3.git
  Cloning https://github.com/DLR-RM/stable-baselines3.git to /tmp/pip-req-build-elv_xtjf
  Running command git clone --filter=blob:none --quiet https://github.com/DLR-RM/stable-baselines3.git /tmp/pip-req-build-elv_xtjf
  Resolved https://github.com/DLR-RM/stable-baselines3.git to commit ee8a77defb0ea8c02d3f1096ea24aa3556452030
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gymnasium<1.1.0,>=0.29.1 (from stable_baselines3==2.5.0)
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium<1.1.0,>=0.29.1->stable_baselines3==2.5.0)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1

# Training

In [2]:
import gym
import torch
import numpy as np
from stable_baselines3 import PPO, SAC
from stable_baselines3.common.evaluation import evaluate_policy
from env.custom_hopper import CustomHopper

  from jax import xla_computation as _xla_computation
  from distutils.dep_util import newer, newer_group


In [None]:


class BalanceHopperEnv(CustomHopper):
    def __init__(self):
        super().__init__(domain='source')

    def step(self, action):
        # Perform action in environment
        obs, _, done, info = super().step(action)
        # Reward for maintaining balance
        reward = -np.abs(self.sim.data.qpos[2])  # Reward for keeping torso upright
        return obs, reward, done, info
class ThrustHopperEnv(CustomHopper):
    def __init__(self):
        super().__init__(domain='source')

    def step(self, action):
        # Perform action in environment
        obs, _, done, info = super().step(action)
        # Reward for achieving thrust/jumping
        reward = self.sim.data.qpos[1]  # Reward based on height or distance
        return obs, reward, done, info
class LandingHopperEnv(CustomHopper):
    def __init__(self):
        super().__init__(domain='source')

    def step(self, action):
        # Perform action in environment
        obs, _, done, info = super().step(action)
        # Reward based on landing impact
        reward = -np.square(self.sim.data.qvel[0])  # Penalize fast landing velocity
        return obs, reward, done, info


class HierarchicalHopperEnv(gym.Env):
    def __init__(self, env, balance_policy, thrust_policy, landing_policy):
        self.env = env
        self.balance_policy = balance_policy
        self.thrust_policy = thrust_policy
        self.landing_policy = landing_policy
        self.action_space = gym.spaces.Discrete(3)  # 3 actions: Balance, Thrust, Landing
        self.observation_space = env.observation_space

    def step(self, action):
        """Selects a policy based on high-level decision."""
        if action == 0:
            chosen_policy = self.balance_policy
        elif action == 1:
            chosen_policy = self.thrust_policy
        else:
            chosen_policy = self.landing_policy

        # Use the policy to generate an action
        obs = self.env.get_obs()  # Corrected observation retrieval
        action_from_policy, _ = chosen_policy.predict(obs, deterministic=True)
        obs, reward, done, info = self.env.step(action_from_policy)

        return obs, reward, done, info

    def reset(self):
        return self.env.reset()

def train_low_level_policy(env, policy_name, device, total_timesteps=200000 ):
    """Trains and saves a SAC policy."""
    model = SAC("MlpPolicy", env, verbose=1, device = device)
    model.learn(total_timesteps=total_timesteps)
    model.save(policy_name)
    return model

def main():
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    # Create randomized source environment
    env = gym.make("CustomHopper-source-v0")

    # Train Low-Level Policies (SAC)
    balance_policy = train_low_level_policy(BalanceHopperEnv(), "balance_policy", device)
    thrust_policy = train_low_level_policy(ThrustHopperEnv(), "thrust_policy", device)
    landing_policy = train_low_level_policy(LandingHopperEnv(), "landing_policy", device)


    # Load trained low-level policies
    balance_policy = SAC.load("balance_policy", device=device)
    thrust_policy = SAC.load("thrust_policy", device=device)
    landing_policy = SAC.load("landing_policy", device=device)


    # Create Hierarchical Environment
    hierarchical_env = HierarchicalHopperEnv(env, balance_policy, thrust_policy, landing_policy)

    # Train High-Level Policy (PPO)
    high_level_policy = PPO("MlpPolicy", hierarchical_env, verbose=1, device=device)
    high_level_policy.learn(total_timesteps=200000)
    high_level_policy.save("high_level_policy")



if __name__ == "__main__":
    main()


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 17.2     |
|    ep_rew_mean     | -1.32    |
| time/              |          |
|    episodes        | 4        |
|    fps             | 1808     |
|    time_elapsed    | 0        |
|    total_timesteps | 69       |
---------------------------------


  deprecation(
  deprecation(


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 19       |
|    ep_rew_mean     | -1.48    |
| time/              |          |
|    episodes        | 8        |
|    fps             | 153      |
|    time_elapsed    | 0        |
|    total_timesteps | 152      |
| train/             |          |
|    actor_loss      | -3.84    |
|    critic_loss     | 0.646    |
|    ent_coef        | 0.985    |
|    ent_coef_loss   | -0.0754  |
|    learning_rate   | 0.0003   |
|    n_updates       | 51       |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 31.6     |
|    ep_rew_mean     | -2.63    |
| time/              |          |
|    episodes        | 12       |
|    fps             | 100      |
|    time_elapsed    | 3        |
|    total_timesteps | 379      |
| train/             |          |
|    actor_loss      | -5.26    |
|    critic_loss     | 0.264    |
|    ent_coef 

#Evaluation


In [3]:
# Sim-to-Real Transfer Testing
device = 'cuda'
target_env = gym.make("CustomHopper-target-v0")
high_level_policy = PPO.load("high_level_policy", device=device)
mean_reward, std_reward = evaluate_policy(high_level_policy, target_env, n_eval_episodes=50)
print(f"Sim-to-Real Transfer Results: Mean Reward = {mean_reward} ± {std_reward}")

  deprecation(
  deprecation(
  logger.warn(
  logger.warn(
  logger.warn(
  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):


Sim-to-Real Transfer Results: Mean Reward = 39.45588179826736 ± 0.993433031804001
