# Reinforcement learning for legged robots

In this notebook, 

## Setup

Before we start, you will need to update your conda environment to use Gymnasium (maintained) rather than OpenAI Gym (discontinued). You can simply run:

```
conda activate robotics-mva
conda install gymnasium tensorboard
```

Import Gymnasium to check that everything is working:

In [23]:
import gymnasium as gym
import numpy as np

# Inverted pendulum environment



In [None]:
with gym.make("InvertedPendulum-v4", render_mode="human") as env:
    action = 0.0 * env.action_space.sample()
    observation, _ = env.reset()
    for step in range(500):
        observation, reward, terminated, truncated, _ = env.step(action)
        if terminated or truncated:
            observation, _ = env.reset()

# Bonus: train for a real robot

In this section we follow the same training pipeline but with the open source software of [Upkie](https://hackaday.io/project/185729-upkie-wheeled-biped-robots). It is completely optional and will only work on Linux or macOS.

## Setup

<img src="https://user-images.githubusercontent.com/1189580/170496331-e1293dd3-b50c-40ee-9c2e-f75f3096ebd8.png" style="height: 100px" align="right" />

First, make sure you have a C++ compiler (setup one-liners: [Fedora](https://github.com/upkie/upkie/discussions/100), [Ubuntu](https://github.com/upkie/upkie/discussions/101)). You can run an Upkie simulation right from the command line. It won't install anything on your machine, everything will run locally from the repository:

```console
git clone https://github.com/upkie/upkie.git
cd upkie
./start_simulation.sh
```

We will use the Python API of the robot to test things from this notebook, or from custom scripts. Install it from PyPI in your Conda environment:

```
pip install upkie
```

## Stepping the environment

If everything worked well, you should be able to step an environment as follows:

In [30]:
import gymnasium as gymW
import upkie.envs

upkie.envs.register()

episode_return = 0.0
with gym.make("UpkieGroundVelocity-v1", frequency=200.0) as env:
    observation, _ = env.reset()  # connects to the spine (simulator or real robot)
    action = 0.0 * env.action_space.sample()
    for step in range(1000):
        pitch = observation[0]
        action[0] = 10.0 * pitch  # 1D action: [ground_velocity]
        observation, reward, terminated, truncated, _ = env.step(action)
        episode_return += reward
        if terminated or truncated:
            observation, _ = env.reset()

print(f"We have stepped the environment {step + 1} times")
print(f"The return of our episode is {episode_return}")



We have stepped the environment 1000 times
The return of our episode is 1000.0


(If you see a message "Waiting for spine /vulp to start", it means the simulation is not running.)

We can double-check the last observation from the episode:

In [32]:
def report_last_observation(observation):
    print("The last observation of the episode is:")
    print(f"- Pitch from torso to world: {observation[0]:.2} rad")
    print(f"- Ground position: {observation[1]:.2} m")
    print(f"- Angular velocity from torso to world in torso: {observation[2]:.2} rad/s")
    print(f"- Ground velocity: {observation[3]:.2} m/s")
    
report_last_observation(observation)

The last observation of the episode is:
- Pitch from torso to world: -0.061 rad
- Ground position: -0.71 m
- Angular velocity from torso to world in torso: -0.042 rad/s
- Ground velocity: -0.52 m/s


## Question B1: PID balancer

Adapt your code from Question 1 to this environment:

In [28]:
def policy_b1(observation):
    return np.array([0.0])  # replace with your solution


def run(policy, nb_steps: int):
    episode_return = 0.0
    with gym.make("UpkieGroundVelocity-v1", frequency=200.0) as env:
        observation, _ = env.reset()  # connects to the spine (simulator or real robot)
        for step in range(nb_steps):
            action = policy_b1(observation)
            observation, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                print("Fall detected!")
                return episode_return
    report_last_observation(observation)
    return episode_return


episode_return = run(policy_b1, 1000)
print(f"The return of our episode is {episode_return}")



Fall detected!
The return of our episode is 0.0


## Training a new policy

The Upkie repository ships three agents based on PID control, model predictive control and reinforcement learning. We now focus on the latter, called the "PPO balancer".

Check that you can run the training part by running, from the root of the repository:

```
./tools/bazel run //agents/ppo_balancer:train -- --nb-envs 1 --show
```

A simulation window should pop, and verbose output from SB3 should be printed to your terminal.

By default, training data will be logged to `/tmp`. You can select a different output path by setting the `UPKIE_TRAINING_PATH` environment variable in your shell. For instance:

```
export UPKIE_TRAINING_PATH="${HOME}/src/upkie/training"
```

Run TensorBoard from the training directory:

```
tensorboard --logdir ${UPKIE_TRAINING_PATH}  # or /tmp if you keep the default
```

### Selecting the number of processes

We can increase the number of parallel CPU environments ``--nb-envs`` to a value suitable to your computer. Let training run for a minute and check `time/fps`. Increase the number of environments and compare the stationary regime of `time/fps`. You should see a performance increase when adding the first few environments, followed by a declined when there are two many parallel processes compared to your number of CPU cores. Pick the value that works best for you.