# Energy-based PPO
#### Team MIRAM
---

Makar Korchagin, Ilya Zherebtsov, Rinat Prochii, Aibek Akhmetkazy, Mikhail Gubanov

For the third project we have chosen the Inverted Double Pendulum environment with the custon reward function. The aim of the enviornmnet is to balance the Inverted Double Pendulum via controlling the movements of the cart, to which the pendulum is attached.

![Inverted double pendulum](https://gymnasium.farama.org/_images/inverted_double_pendulum.gif)



## State space

![Double pendulum system](https://www.researchgate.net/profile/Slavka-Jadlovska-2/publication/258795979/figure/fig2/AS:392622284787714@1470619855067/Classical-double-inverted-pendulum-system-scheme-and-basic-nomenclature.png)

The system has the state space of 12 continuous variables:

$x \in \mathbb{R}$ - is the position of the cart with mass $m$;

$\dot{x} \in \mathbb{R}$ - is the speed of the cart with mass $m$;

$\ddot{x} \in \mathbb{R}$ - is the acceleration of the cart with mass $m$;

$\theta_1 \in [0, 2\pi]$ - is the angle of the hinge with mass $m_1$ w.r.t. the vertical axis;

$\theta_2 \in [0, 2\pi]$ - is the angle of the hinge with mass $m_2$ w.r.t. the vertical axis;

$\dot{\theta}_1 \in \mathbb{R}$ - is the anglular velocity of the hinge with mass $m_1$;

$\dot{\theta}_2 \in \mathbb{R}$ - is the anglular velocity of the hinge with mass $m_2$;

$\ddot{\theta}_1 \in \mathbb{R}$ - is the anglular acceleration of the hinge with mass $m_1$;

$\ddot{\theta}_2 \in \mathbb{R}$ - is the anglular acceleration of the hinge with mass $m_2$;

$f_0, f_1, f_2 \in \mathbb{R}$ - the friction forces for each degree of freedom.

Aside from the state vector, the system has 5 hyperparameters:

$m$ - the mass of the cart;

$m_1, m_2$ - the masses of the first and second hinges respectively;

$J_1, J_2$ - inertia momentums of the first and second poles respectively.


### Mathematical model


Using the state variables and hyperparameters defined above, we can derive the second-order differential equation, that describes the behaviour of the given system

$$H_1(z)\ddot{z}=H_2(z,\dot{z})\dot{z}+h_3(z)+h_0u,$$

where

$$z = (x, \theta_1, \theta_2)^T;$$

$$H_1(z) =
\begin{bmatrix}
a_0 & a_1\cos\theta_1 & a_2\cos\theta_2 \\
a_1\cos\theta_1 & b_1 & a_2l_1\cos(\theta_2-\theta_1) \\
a_2\cos\theta_2 & a_2l_1\cos(\theta_2 - \theta_1) & b_2 \\
\end{bmatrix};
$$

$$H_2(z, \dot{z}) =
\begin{bmatrix}
-f_0 & a_1\sin\theta_1\dot{\theta}_1 & a_2\sin\theta_2\dot{\theta}_2 \\
0 & - f_1 - f_2 & a_2l_1\sin\theta_2 \dot{\theta}_2 \\
0 & -a_2l_1\sin(\theta_2 - \theta_1)\dot{\theta}_1 + f_2 & - f_2 - f_3 \\
\end{bmatrix};
$$

$$
h_3(z) =
\begin{bmatrix}
0 & a_1g\sin\theta_1 & a_2g\sin\theta_2
\end{bmatrix}^T.
$$

The constants above denotes the following formulae

$$a_0 = m + m_1 + m_2;$$

$$a_1 = m_1l_1 + m_2l_2;$$

$$a_2 = m_2l_2;$$

$$b_1 = J_1 + m_1l_1^2 + m_2l_2^2;$$

$$b_2 = J_2 + m_2l_2^2.$$


### Initial state

The initial positions of the poles and cart are randomly sampled from uniformly distributed noise, while the initial velocities are randomly sampled from the normal distributions. The second derivatives and other variables are zeros at the initial state

$$s_0 =
\begin{cases}
(\theta_1, \theta_2, x) \sim \mathcal{U}(-0.1 \times I_3, 0.1 \times I_3); \\
(\dot{\theta}_1, \dot{\theta}_2, \dot{x}) \sim \mathcal{N}(0_3, 0.1 \times I_3); \\
(\ddot{\theta}_1, \ddot{\theta}_2, \ddot{x}, f_0, f_1, f_2) = 0_6.
\end{cases}
$$

### State space outline

In conclusion, the environment of the Inverted Double Pendulum has a state vector of $s \in \mathbb{S} \subset \mathbb{R}^{12}$ with 9 variables of unlimited real numbers and two variables limited by the interval of $[0, 2\pi]$.

## Action space

The action space is a single continuous variable $f \in [-1, 1]$ denoting the force [N] applied to the moving cart along $X$ axis.

## Observation space

The environment support two types of observation spaces: kinmatics vector and the RGB image of the system. In this project we have decided to choose the first option.

The observation vector of kinematics type includes 11 continuous variables:

$x \in \mathbb{R}$ - is the position of the cart along x axis;

$\sin\theta_1 \in [-1, 1]$ - is the sine of the angle of the first hinge;

$\sin\theta_2 \in [-1, 1]$ - is the sine of the angle of the second hinge;

$\cos\theta_1 \in [-1, 1]$ - is the cosine of the angle of the first hinge;

$\cos\theta_2 \in [-1, 1]$ - is the cosine of the angle of the second hinge;

$u \in \mathbb{R}$ - is the velocity of the cart;

$\dot{\theta}_1 \in \mathbb{R}$ - is the angular velocity of the first hinge;

$\dot{\theta}_2 \in \mathbb{R}$ - is the angular velocity of the first hinge;

$f_1, f_2, f_3 \in \mathbb{R}$ - are the contraints for each degree of freedom (cart pole position, first and second hinge angles respectively).


## Reward

The original reward introduce the constant reward for every step that can be decreased because of low position of the pendulum's tip and high speed of the cart

$$r(s,a) = 10 - (0.01 x^2 + (y-2)^2) - 0.001 v_1^2 + 0.005 v_2^2,$$

where:

$x, y$ - are the coordinates of the free tip of the pendulum;

$v_1, v_2$ - are the absolute velocities of the poles' centres of masses.

We decided to change the reward, using the physics assumptions. It is obvoius that the pendulum holding is the task of the potential energy $V$ maximization and kinetic energy $T$ minimization

$$
\begin{cases}
V \rightarrow \max{}; \\
T \rightarrow 0.
\end{cases}
$$

Essentially, we propose is to equalize the reward function to the negative Lagrangian of the system

$$r(s,a) = -L = \sum_{i=0}^2 V_i - T_i,$$

where:

$$T_0 = \frac{m\dot{x}^2}{2}, V_0 = 0;$$

$$T_1 = \frac{m_1v_1^2}{2} + \frac{J_1 \dot{\theta}_1^2}{2}=\frac{7}{24}m_1l_1^2\dot{\theta}_1^2, V_1 = \frac{1}{2}m_1gl_1\cos\theta_1;$$

$$T_2 = \frac{m_2v_2^2}{2} + \frac{J_2 \dot{\theta}_2^2}{2}= \frac{m_2l_2^2}{2} \left( \dot{\theta}_1^2 + \frac{1}{4} \dot{\theta}_2^2 + \dot{\theta}_1\dot{\theta}_2\cos(\theta_1 - \theta_2)\right) + \frac{m_2l_2\dot{\theta}_2^2}{6}, V_2 = m_2g\left(l_1\cos\theta_1 + \frac{1}{2}l_2\cos\theta_2 \right).$$

Moreover, the proposed reward can be calculated using the observation vector and hyperparameters values only.

For simplicity, the reward function can be normalized to the masses $m_1 = m_2 = m = 1$ and lengths $l_1 = l_2 = 1$ of the poles

$$
r(s,a) = g\cos\theta_1 + \frac{1}{2}g\cos\theta_2 - \frac{19}{24}\dot{\theta}_1^2 - \frac{7}{24}\dot{\theta}_2^2 - \frac{1}{2}\dot{\theta}_1\dot{\theta}_2\cos(\theta_1 - \theta_2).
$$



