# Optimal control course project
# Solving the CartPole-v1 problem

by **Artem Petrov** and **Ivan Kudriakov**

Solving CartPole-v1 from openai.gym using methods of the Optimal Control Theory


# Mathematical model

The environment simulation of CartPole-v1 is performed according to the model, described here: https://coneural.org/florian/papers/05_cart_pole.pdf

([Proof from the code of CartPole-v1](https://github.com/openai/gym/blob/2dddaf722acccfd0412d745890c40dcd972586d5/gym/envs/classic_control/cartpole.py#L126))

Here, I will provide the brief explanation of the model

<img src="presentation_media/forces_model.png" alt="drawing" width="500"/>

<!-- Here $G_c$ and $G_p$ are the forces of gravity acting on cart and pole respectively.

Here $N_c$ - reaction force and $F_f$ - friction force between the cart and the rail -->

Here are the parameters of the model:

$m_c$ - the mass of the cart 

$m_p$ - the mass of the pole 

$l$ - half-length of the pole

Original model also considers the friction forces, however in our model there are no friction.

<!-- 
$\mu_c$ - friction coefficient between the cart and the rail

$\mu_p$ - friction coefficient between the cart and the pole

Side note: the initial model assumes Lubricated friction between the pole and the cart which is somewhat counterintuitive. -->

Form the frictionless Newton equations the following  movement equations can be derived:

$$
\ddot{\theta} = \frac{g \sin \theta + \cos \theta \left( \frac{- F - m_p l \dot{\theta}^2\sin \theta}{m_c + m_p} \right)}
{l\left(\frac{4}{3} - \frac{m_p \cos ^2 \theta}{m_c + m_p}\right)}
\label{eq:theta} \tag{1}
$$

$$
\ddot{x} = \frac{F + m_p l\left(\dot{\theta}^2 \sin \theta - \ddot{\theta}\cos\theta\right)}
{m_c + m_p}
\label{eq:x} \tag{2}
$$

Where $F = \pm F_0$ - horizontal force applied to the cart by the agent.

Interested reader who want to check out the whole inference of the equations above may want to read [the original source](https://coneural.org/florian/papers/05_cart_pole.pdf)


#### Now, let's pose the problem in terms of the optimal control theory.

The goal is to keep the pole upright for the longest time. 

Hence, one might suggests the usage of one the following simple functionals:

$$ 
\left[
\begin{array}[lll]
    .
    J_{\text{1}} = \int\limits_{0}^{\infty} \theta^2 dt \rightarrow \min \\
    J_{\text{2}} = \int\limits_{0}^{T} \theta^2 dt \rightarrow \min \\
    J_{\text{3}} = \int\limits_{0}^{T} |\theta| dt \rightarrow \min
\end{array}
\right .
$$ 


However, if we want to optimize for the highest value of the agent's reward function, the following functional must be used:

$$ J = T \rightarrow \max $$




Let $\mathbf{q} = (x, \dot{x}, \theta, \dot{\theta})^T$

$F = u F_0$, where $u \in \{-1, +1\}$ - control

Then, the dynamic constaraints can be written as follows:

$$\mathbf{\dot{q}} = 
\begin{bmatrix}
    \dot{x} \\
    \ddot{x} \\
    \dot{\theta} \\
    \ddot{\theta}
\end{bmatrix} = 
\begin{bmatrix}
    \dot{x} \\
    \frac{F + m_p l\left(\dot{\theta}^2 \sin \theta - \ddot{\theta}\cos\theta\right)}{m_c + m_p} \\
    \dot{\theta} \\
    \frac{g \sin \theta + \cos \theta \left( \frac{- F - m_p l \dot{\theta}^2\sin \theta}{m_c + m_p} \right)}
{l\left(\frac{4}{3} - \frac{m_p \cos ^2 \theta}{m_c + m_p}\right)}
\end{bmatrix} = 
\begin{bmatrix}
    \dot{x} \\
    \frac{uF_0 + m_p l\left(\dot{\theta}^2 \sin \theta - \frac{g \sin \theta + \cos \theta \left( \frac{- uF_0 - m_p l \dot{\theta}^2\sin \theta}{m_c + m_p} \right)}
{l\left(\frac{4}{3} - \frac{m_p \cos ^2 \theta}{m_c + m_p}\right)}\cos\theta\right)}{m_c + m_p} \\
    \dot{\theta} \\
    \frac{g \sin \theta + \cos \theta \left( \frac{- u F_0 - m_p l \dot{\theta}^2\sin \theta}{m_c + m_p} \right)}
{l\left(\frac{4}{3} - \frac{m_p \cos ^2 \theta}{m_c + m_p}\right)}
\end{bmatrix} = 
\mathbf{f}(\mathbf{q}, u)$$



Also, we must consider the fact, that in this particular implementation, the simulation will stop if one of the following conditions is reached:

- $|\theta| > 12^\circ$  (which contradicts the [frontpage](https://gym.openai.com/envs/CartPole-v1/), but is consistent with the [actual code](https://github.com/openai/gym/blob/2dddaf722acccfd0412d745890c40dcd972586d5/gym/envs/classic_control/cartpole.py#L90))

- $|x| > 2.4$

- $ t > 500\tau$, where $\tau = 0.02$sec - is the length of one episode of simulation

These conditions can be formulated as the path constraints:

$$
\begin{cases}
    |q_1| = |x| \le 2.4\\
    |q_3| = |\theta| \le 12^\circ \\
    t \le 500\tau
\end{cases}
$$

#### Parameters of the environment

In order to get the model's parameters one can look at [the source code](https://github.com/openai/gym/blob/2dddaf722acccfd0412d745890c40dcd972586d5/gym/envs/classic_control/cartpole.py#L80) of the CartPole-v1 environment.

$g = 9.8$

$m_c = 1.0$ 

$m_p = 0.1$

$l = 0.5$ 

$F_0 = 10.0$
