# Reinforcement learning for legged robots

In this notebook, 

## Setup

Before we start, you will need to update your conda environment to use Gymnasium (maintained) rather than OpenAI Gym (discontinued). You can simply run:

```
conda activate robotics-mva
conda install gymnasium imageio mujoco=2.3.7 tensorboard
```

Import Gymnasium to check that everything is working:

In [None]:
import gymnasium as gym

Let's import the usual suspects as well:

In [None]:
import matplotlib.pylab as plt
import numpy as np

plt.ion()

# Inverted pendulum environment

The inverted pendulum model is not just a toy model reproducing the properties of real robot models for balancing: as it turns out, the inverted pendulum appears in the dynamics of *any* mobile robot, that is, a model with a floating-base joint at the root of the kinematic tree. (If you are curious: the inverted pendulum is a limit case of the [Newton-Euler equations](https://scaron.info/robotics/newton-euler-equations.html) corresponding to floating-base coordinates in the equations of motion $M \ddot{q} + h = S^T \tau + J_c^T f$, in the limit where the robot [does not vary its angular momentum](https://scaron.info/robotics/point-mass-model.html).) Thus, while we work on a simplified inverted pendulum in this notebook, concepts and tools are those used as-is on real robots, as you can verify by exploring the bonus section.

Gymnasium is mainly a single-agent reinforcement learning API, but it also comes with simple environments, including an inverted pendulum sliding on a linear guide:

In [None]:
with gym.make("InvertedPendulum-v4", render_mode="human") as env:
    action = 0.0 * env.action_space.sample()
    observation, _ = env.reset()
    episode_return = 0.0
    for step in range(200):
        # action[0] = 5.0 * observation[1] + 0.3 * observation[0]
        observation, reward, terminated, truncated, _ = env.step(action)
        episode_return += reward
        if terminated or truncated:
            observation, _ = env.reset()
            episode_return = 0.0
    print(f"Return of the episode: {episode_return}")

The structure of the action and observation vectors are documented in [Inverted Pendulum - Gymnasium Documentation](https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/).  The observation, in particular, is a NumPy array with four coordinates that we recall here for reference:

| Num | Observation | Min | Max | Unit |
|-----|-------------|-----|-----|------|
|   0 | position of the cart along the linear surface | -Inf | Inf | position (m) |
|   1 | vertical angle of the pole on the cart | -Inf | Inf | angle (rad) |
|   2 | linear velocity of the cart | -Inf | Inf | linear velocity (m/s) |
|   3 | angular velocity of the pole on the cart | -Inf | Inf | anglular velocity (rad/s) |

Check out the documentation for the definitions of the action and rewards.

## PID balancer

A *massively* used class of policies is the [PID controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). Let's say we have a reference observation, like $o^* = [0\ 0\ 0\ 0]$ for the inverted pendulum. Denoting by $e = o^* - o$ the *error* of the system when it observes a given state, a PID controller will apply the action:

$$
a(t) = K_p^T e(t) + K_d^T \dot{e}(t) + K_i^T \int e(\tau) \mathrm{d} \tau
$$

where $K_{p}, K_i, K_d \in \mathbb{R}^4$ are constants called *gains* and tuned by the user. In discrete time:

$$
a_k = K_p^T e_k + K_d^T \frac{e_k - e_{k-1}}{\delta t} + K_i^T \sum_{i=0}^{k} e_i {\delta t}
$$

Let's refactor the rolling out of our episode into a standalone function:

In [None]:
def rollout(policy, show: bool = True):
    episode = []
    kwargs = {"render_mode": "human"} if show else {}
    with gym.make("InvertedPendulum-v4", **kwargs) as env:
        observation, _ = env.reset()
        episode.append(observation)
        for step in range(1000):
            action = policy(observation)
            observation, reward, terminated, truncated, _ = env.step(action)
            episode.extend([action, reward, observation])
            if terminated or truncated:
                return episode
    return episode

## Question 1: Write a PID controller that balances the inverted pendulum

You can use global variables to store the (discrete) derivative and integral terms, this will be OK here as we only rollout a single trajectory:

In [None]:
previous_observation = np.zeros(4)
integral = np.zeros(4)

def pid_control(observation: np.ndarray) -> np.ndarray:
    global previous_observation, integral
    derivative = observation - previous_observation
    previous_observation = observation.copy()
    integral += observation
    my_action_value: float = 5.0 * observation[1] + 0.15 * observation[0] - 0.01 * observation[2]
    return np.array([my_action_value])

episode = rollout(pid_control, show=False)

You can look at the system using `show=True`, but intuition usually builds faster when looking at relevant plots:

In [None]:
observations = np.array(episode[::3])

plt.plot(observations)
plt.legend(("pitch", "position", "linear_velocity", "angular_velocity"))

Can you reach the full reward of 1000 steps?

In [None]:
print(f"Return of the episode: {sum(episode[2::3])}")

# Bonus: train for a real robot

This section is entirely optional and will only work on Linux or macOS. In this part, we follow the same training pipeline but with the open source software of [Upkie](https://hackaday.io/project/185729-upkie-wheeled-biped-robots).

## Setup

<img src="https://user-images.githubusercontent.com/1189580/170496331-e1293dd3-b50c-40ee-9c2e-f75f3096ebd8.png" style="height: 100px" align="right" />

First, make sure you have a C++ compiler (setup one-liners: [Fedora](https://github.com/upkie/upkie/discussions/100), [Ubuntu](https://github.com/upkie/upkie/discussions/101)). You can run an Upkie simulation right from the command line. It won't install anything on your machine, everything will run locally from the repository:

```console
git clone https://github.com/upkie/upkie.git
cd upkie
./start_simulation.sh
```

We will use the Python API of the robot to test things from this notebook, or from custom scripts. Install it from PyPI in your Conda environment:

```
pip install upkie
```

## Stepping the environment

If everything worked well, you should be able to step an environment as follows:

In [None]:
import gymnasium as gymW
import upkie.envs

upkie.envs.register()

episode_return = 0.0
with gym.make("UpkieGroundVelocity-v1", frequency=200.0) as env:
    observation, _ = env.reset()  # connects to the spine (simulator or real robot)
    action = 0.0 * env.action_space.sample()
    for step in range(1000):
        pitch = observation[0]
        action[0] = 10.0 * pitch  # 1D action: [ground_velocity]
        observation, reward, terminated, truncated, _ = env.step(action)
        episode_return += reward
        if terminated or truncated:
            observation, _ = env.reset()

print(f"We have stepped the environment {step + 1} times")
print(f"The return of our episode is {episode_return}")

(If you see a message "Waiting for spine /vulp to start", it means the simulation is not running.)

We can double-check the last observation from the episode:

In [None]:
def report_last_observation(observation):
    print("The last observation of the episode is:")
    print(f"- Pitch from torso to world: {observation[0]:.2} rad")
    print(f"- Ground position: {observation[1]:.2} m")
    print(f"- Angular velocity from torso to world in torso: {observation[2]:.2} rad/s")
    print(f"- Ground velocity: {observation[3]:.2} m/s")
    
report_last_observation(observation)

## Question B1: PID balancer

Adapt your code from Question 1 to this environment:

In [None]:
def policy_b1(observation):
    return np.array([0.0])  # replace with your solution


def run(policy, nb_steps: int):
    episode_return = 0.0
    with gym.make("UpkieGroundVelocity-v1", frequency=200.0) as env:
        observation, _ = env.reset()  # connects to the spine (simulator or real robot)
        for step in range(nb_steps):
            action = policy_b1(observation)
            observation, reward, terminated, truncated, _ = env.step(action)
            if terminated or truncated:
                print("Fall detected!")
                return episode_return
    report_last_observation(observation)
    return episode_return


episode_return = run(policy_b1, 1000)
print(f"The return of our episode is {episode_return}")

## Training a new policy

The Upkie repository ships three agents based on PID control, model predictive control and reinforcement learning. We now focus on the latter, called the "PPO balancer".

Check that you can run the training part by running, from the root of the repository:

```
./tools/bazel run //agents/ppo_balancer:train -- --nb-envs 1 --show
```

A simulation window should pop, and verbose output from SB3 should be printed to your terminal.

By default, training data will be logged to `/tmp`. You can select a different output path by setting the `UPKIE_TRAINING_PATH` environment variable in your shell. For instance:

```
export UPKIE_TRAINING_PATH="${HOME}/src/upkie/training"
```

Run TensorBoard from the training directory:

```
tensorboard --logdir ${UPKIE_TRAINING_PATH}  # or /tmp if you keep the default
```

### Selecting the number of processes

We can increase the number of parallel CPU environments ``--nb-envs`` to a value suitable to your computer. Let training run for a minute and check `time/fps`. Increase the number of environments and compare the stationary regime of `time/fps`. You should see a performance increase when adding the first few environments, followed by a declined when there are two many parallel processes compared to your number of CPU cores. Pick the value that works best for you.