# Robotics Capstone: Teaching a Robot to Walk

The grand finale! We'll use everything we've learned to train a simulated robot to walk.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          THE GOAL: LOCOMOTION                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│     Start: Robot falls over          End: Robot walks forward!              │
│                                                                             │
│         ┌───┐                              ┌───┐                            │
│        /│   │\                            /│   │\                           │
│       X │   │ X                          / │   │ \    --> direction         │
│         └───┘                           /  └───┘  \                         │
│           │                            /     │     \                        │
│          ═══                        ──┘      │      └──                     │
│         (oops)                     /         │         \                    │
│                                ───┘          │          └───                │
│                                           (walking!)                        │
│                                                                             │
│  Using: Ray + RLlib + PPO + MuJoCo                                          │
└─────────────────────────────────────────────────────────────────────────────┘
```

## What We'll Cover

1. **MuJoCo Physics** - Industry-standard robot simulation
2. **Continuous Control** - Real-valued joint torques (not discrete actions)
3. **PPO Algorithm** - State-of-the-art for locomotion
4. **Training at Scale** - Multiple parallel environments
5. **Evaluation & Recording** - Watch your robot walk!

## Setup

```bash
pip install "ray[rllib]" gymnasium[mujoco] mujoco torch numpy
```

In [None]:
# Suppress warnings
import warnings
import logging
warnings.filterwarnings("ignore")
logging.getLogger("ray").setLevel(logging.ERROR)

import ray
from ray.rllib.algorithms.ppo import PPOConfig
import gymnasium as gym
import numpy as np
import time

# Initialize Ray
ray.init(
    num_cpus=4,  # Limit CPUs on M1
    object_store_memory=1 * 1024 * 1024 * 1024,  # 1GB object store
    ignore_reinit_error=True,
)
print(f"Ray initialized with {ray.cluster_resources()}")

## 1. Understanding the Ant Environment

The MuJoCo Ant is a 4-legged robot with 8 actuated joints:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                             ANT ANATOMY                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                              TORSO                                          │
│                             ┌─────┐                                         │
│                            /│     │\                                        │
│           Hip joint ─────>/ │     │ \<───── Hip joint                       │
│                          /  │     │  \                                      │
│                         /   └─────┘   \                                     │
│        Ankle joint ───>/       │       \<─── Ankle joint                    │
│                       /        │        \                                   │
│                   ───┘         │         └───                               │
│                  /             │             \                               │
│             Foot               │              Foot                          │
│                                │                                            │
│                          (same for back legs)                               │
│                                                                             │
│  Observation (27 dims):                                                     │
│  - Torso position, orientation, velocity                                    │
│  - Joint angles and angular velocities                                      │
│  - Contact forces with ground                                               │
│                                                                             │
│  Action (8 dims):                                                           │
│  - Torque for each of the 8 joints (continuous, range [-1, 1])              │
│                                                                             │
│  Reward:                                                                    │
│  - Forward velocity (move in +x direction)                                  │
│  - Alive bonus (don't fall over)                                            │
│  - Control cost penalty (don't waste energy)                                │
│  - Contact cost penalty (smooth movement)                                   │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Let's explore the environment
env = gym.make("Ant-v4")

print("=" * 60)
print("ANT ENVIRONMENT")
print("=" * 60)
print(f"Observation space: {env.observation_space}")
print(f"  - Shape: {env.observation_space.shape}")
print(f"  - Range: [{env.observation_space.low.min():.1f}, {env.observation_space.high.max():.1f}]")
print()
print(f"Action space: {env.action_space}")
print(f"  - Shape: {env.action_space.shape}")
print(f"  - Range: [{env.action_space.low.min():.1f}, {env.action_space.high.max():.1f}]")
print()
print("This is CONTINUOUS control - actions are real numbers, not discrete choices!")

env.close()

In [None]:
# Watch random actions (the robot will fall over immediately)
env = gym.make("Ant-v4", render_mode="human")

obs, info = env.reset()
total_reward = 0

print("Random policy (robot falls over):")
for step in range(200):
    action = env.action_space.sample()  # Random joint torques
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    
    if terminated or truncated:
        print(f"  Episode ended at step {step}, total reward: {total_reward:.1f}")
        break

env.close()
print(f"\nRandom policy reward: {total_reward:.1f} (this is bad!)")

## 2. Configure PPO for Locomotion

PPO (Proximal Policy Optimization) is the go-to algorithm for continuous control:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      WHY PPO FOR LOCOMOTION?                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. STABLE: Clipped objective prevents destructive large updates            │
│                                                                             │
│     Policy Update                                                           │
│     ─────────────                                                           │
│        │                                                                    │
│        │  Too big? ──> Clip it!                                             │
│        │  ┌─────────────────┐                                               │
│        └─>│ max update = ε  │  (ε ≈ 0.2)                                    │
│           └─────────────────┘                                               │
│                                                                             │
│  2. SAMPLE EFFICIENT: Multiple epochs on same batch                         │
│                                                                             │
│     Collect ──> Train ──> Train ──> Train ──> Collect again                 │
│      data       epoch1    epoch2    epoch3     new data                     │
│                                                                             │
│  3. PARALLELIZABLE: Many workers collect experience simultaneously          │
│                                                                             │
│     ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐                         │
│     │Worker 1│  │Worker 2│  │Worker 3│  │Worker 4│                         │
│     │  Ant   │  │  Ant   │  │  Ant   │  │  Ant   │                         │
│     └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘                         │
│         └───────────┴─────┬─────┴───────────┘                               │
│                           v                                                 │
│                    ┌─────────────┐                                          │
│                    │   Trainer   │                                          │
│                    │  (updates)  │                                          │
│                    └─────────────┘                                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Configure PPO for the Ant
config = (
    PPOConfig()
    .environment("Ant-v4")
    .framework("torch")
    
    # Parallel rollouts - more ants = faster learning!
    .env_runners(
        num_env_runners=4,           # 4 parallel workers
        num_envs_per_env_runner=1,   # 1 env per worker
    )
    
    # PPO hyperparameters tuned for locomotion
    .training(
        lr=3e-4,                     # Learning rate
        gamma=0.99,                  # Discount factor (long-term thinking)
        lambda_=0.95,                # GAE parameter
        clip_param=0.2,              # PPO clipping
        train_batch_size=4000,       # Samples per training batch
        sgd_minibatch_size=256,      # Minibatch size
        num_sgd_iter=10,             # Epochs per batch
        
        # Neural network architecture
        model={
            "fcnet_hiddens": [256, 256],  # Two hidden layers
            "fcnet_activation": "tanh",   # Tanh works well for locomotion
        },
    )
    
    .resources(
        num_gpus=0,  # Set to 1 if you have a GPU
    )
)

print("PPO Configuration:")
print(f"  - Workers: {config.num_env_runners}")
print(f"  - Batch size: {config.train_batch_size}")
print(f"  - Learning rate: {config.lr}")
print(f"  - Network: {config.model['fcnet_hiddens']}")

In [None]:
# Build the algorithm
algo = config.build_algo()  # Use new API
print("PPO algorithm built successfully!")

## 3. Train the Robot!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         TRAINING PROGRESS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Iteration 0-10:    Robot falls over immediately                            │
│                     Reward: ~0-500                                          │
│                                                                             │
│  Iteration 10-50:   Robot learns to stay upright                            │
│                     Starts twitching forward                                │
│                     Reward: ~500-1500                                       │
│                                                                             │
│  Iteration 50-100:  Robot develops a walking gait                           │
│                     Movement becomes smoother                               │
│                     Reward: ~1500-3000                                      │
│                                                                             │
│  Iteration 100+:    Robot walks efficiently                                 │
│                     Stable, fast locomotion                                 │
│                     Reward: ~3000-5000+                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Training loop
NUM_ITERATIONS = 100  # Increase for better results (200+ recommended)

print("=" * 70)
print("TRAINING THE ANT TO WALK")
print("=" * 70)
print(f"{'Iter':>5} | {'Reward (mean)':>15} | {'Reward (max)':>12} | {'Episodes':>10}")
print("-" * 70)

best_reward = -float('inf')
rewards_history = []

for i in range(NUM_ITERATIONS):
    result = algo.train()
    
    # Use new API key names
    mean_reward = result['env_runners']['episode_return_mean']
    max_reward = result['env_runners']['episode_return_max']
    episodes = result['env_runners']['num_episodes']
    
    rewards_history.append(mean_reward)
    
    # Track best
    if mean_reward > best_reward:
        best_reward = mean_reward
        checkpoint = algo.save()
        marker = " *NEW BEST*"
    else:
        marker = ""
    
    # Print progress every 10 iterations
    if (i + 1) % 10 == 0 or i == 0:
        print(f"{i+1:>5} | {mean_reward:>15.1f} | {max_reward:>12.1f} | {episodes:>10}{marker}")

print("-" * 70)
print(f"Training complete! Best mean reward: {best_reward:.1f}")
print(f"Best checkpoint saved to: {checkpoint}")

In [None]:
# Plot training progress
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(rewards_history)
plt.xlabel('Iteration')
plt.ylabel('Mean Episode Reward')
plt.title('Training Progress')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Smoothed version
window = 10
smoothed = np.convolve(rewards_history, np.ones(window)/window, mode='valid')
plt.plot(smoothed)
plt.xlabel('Iteration')
plt.ylabel('Mean Episode Reward (smoothed)')
plt.title(f'Training Progress (smoothed, window={window})')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Watch Your Robot Walk!

The moment of truth - let's see if our training paid off!

In [None]:
# Evaluate the trained policy
env = gym.make("Ant-v4", render_mode="human")

print("=" * 60)
print("EVALUATING TRAINED POLICY")
print("=" * 60)

num_episodes = 3
total_rewards = []

for episode in range(num_episodes):
    obs, info = env.reset()
    episode_reward = 0
    steps = 0
    
    while True:
        # Get action from trained policy
        action = algo.compute_single_action(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        steps += 1
        
        if terminated or truncated:
            break
    
    total_rewards.append(episode_reward)
    print(f"Episode {episode + 1}: Reward = {episode_reward:.1f}, Steps = {steps}")

env.close()

print("-" * 60)
print(f"Average reward over {num_episodes} episodes: {np.mean(total_rewards):.1f}")
print(f"Compare to random policy: ~0-500")
print("\nThe robot learned to walk!")

## 5. What Did the Robot Learn?

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      EMERGENT WALKING BEHAVIOR                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  The robot discovered several key principles on its own:                    │
│                                                                             │
│  1. BALANCE: Keep center of mass over support polygon                       │
│                                                                             │
│     Good: ┌───┐           Bad:    ┌───┐                                     │
│          / COM \                   │COM│ <- falling!                        │
│         /   │   \                  │   │                                    │
│        ─────┼─────                 └───┴──                                  │
│             │                        /                                      │
│       ══════╬══════              ═══╱                                       │
│        (stable)                  (unstable)                                 │
│                                                                             │
│  2. GAIT PATTERN: Alternating diagonal legs (like real ants!)               │
│                                                                             │
│     Phase 1:    Phase 2:                                                    │
│     X───X       ───X                                                        │
│     │   │       X   X                                                       │
│     ───X       X───                                                         │
│                                                                             │
│  3. ENERGY EFFICIENCY: Smooth, coordinated movements                        │
│     (penalized by control cost in reward)                                   │
│                                                                             │
│  4. FORWARD MOMENTUM: Lean slightly forward to initiate movement            │
│                                                                             │
│  The network learned all of this from scratch - no human demonstration!     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## 6. Next Steps

Congratulations! You've completed the tutorial series. Here's what you can try next:

### More Robots
```python
# Try other MuJoCo environments
"HalfCheetah-v4"    # 2D running robot (easier)
"Hopper-v4"         # 1-legged hopping robot
"Walker2d-v4"       # 2D bipedal walker
"Humanoid-v4"       # 3D humanoid (hardest!)
```

### Improve Training
- **More iterations**: 200-500 for better policies
- **More workers**: Scale to 8-16 for faster training
- **GPU training**: Add `num_gpus=1` for faster updates
- **Ray Tune**: Hyperparameter search for optimal performance

### Advanced Topics
- **Curriculum learning**: Start with easier tasks
- **Domain randomization**: Vary physics for robustness
- **Sim-to-real**: Transfer to real robots
- **Multi-agent**: Multiple robots cooperating

In [None]:
# Cleanup
algo.stop()
ray.shutdown()

print("\n" + "=" * 60)
print("TUTORIAL COMPLETE!")
print("=" * 60)
print("""
You've learned:

  [x] Ray Core: Tasks, Actors, Object Store
  [x] RL Fundamentals: MDPs, Q-learning, Policy Gradients
  [x] RLlib: PPO, SAC, DQN algorithms
  [x] Custom Environments: Gymnasium interface
  [x] Training at Scale: Distributed workers
  [x] Robotics: Continuous control with PPO

You trained a robot to walk from scratch using reinforcement learning!

Next: Try Humanoid-v4 for the ultimate challenge.
""")