In the non linear case our model is given by:
\begin{gather}
\dot{x} = f(x) + g(x) \cdot u \\
x \in \mathbb{R}^n , u \in \mathbb{R}^m
\end{gather}

The optimal controller is calculated by solving the following optimization problem:
\begin{gather}
J(u^*) = \min_{u} \int_{0}^{\infty} x^T Q x + u^T R u dt \\
\text{subject to:} \\
h(0) = 0 , R > 0 , Q > 0
\end{gather}

And we know that the optimal controller is given by:
\begin{gather}
u^* = -R^{-1} g^T(x) \cdot \nabla V(x)
\end{gather}

Where V(x) is the value function of the system and is given by:
\begin{gather}
\nabla V^T f(x) + h^T(x)\cdot h(x) - \frac{1}{4} \nabla V(x) g(x) R^{-1} g^T(x) \nabla V(x) = 0
\end{gather}

Based on Kleinman algorithm we can write the equations for the non linear case:
\begin{gather}
V_i (x) = \int_{t}^{\infty} (x^T Q x + u_i^T R u_i) dt \\
\dot{V_i} = \nabla V_i^T(x)\cdot \dot x = -x^T Q x - u_i^T R u_i \\

\end{gather}

If we substitude in the $V_i^T(x) \cdot \dot x $ the model equation we will receive $\textcolor{red}{\text{Bellman Equation:}}$:

\begin{gather}
\nabla V_i^T \left[ f(x) + g(x) u_i \right] = -x^T Q x - u_i^T R u_i \\
\downarrow \\
\nabla V_i^T \left[ f(x) + g(x) u_i \right] + x^T Q x + u_i^T R u_i = 0\\
\end{gather}

The equation above also called $\textcolor{red}{\text{Policy Evaluation}}$.


For a given $u_i$, we can calculate $V_i(x)$ with the initial condition $V_i(0)=0$

The Policy evaluation equation is analagous to the Lyapunov equation from the linear case.

Based the new $V_i(x)$ calculated from the Policy evaluation equation we can calculate the optimal controller using the relation of $V_i(x)$ and $u_i$:

\begin{gather}
u_{i+1}(x) = -\frac{1}{2}\cdot R^{-1} g^T(x) \cdot \nabla V_i(x)
\end{gather}

The equation above is called as $\textcolor{red}{\text{Policy Improvement}}$.

This equation is analagous to the linear calculation of the optimal controller:

\begin{gather}
K_{i+1} = -R^{-1} B^T P_i
\end{gather}

Where $P_i$ is the solution of the Lyapunov equation.

The Data based algorithm we will use here is called Off- Policy reinforcement learning. 

Thats because that the data we will use to update the policy didnt obtain from the previous policy.

# Deriving the new model:

We will write the model as we did in the linear case:

\begin{gather}
\dot{x} = f(x) + g(x) \cdot u_i +  \\
x \in \mathbb{R}^n , u \in \mathbb{R}^m
\end{gather}

Taking The derivative of the value function $V_i(x)$ with respect to time we will have:

\begin{gather}
\dot{V_i} = \nabla V_i^T(x)\cdot \dot x = -x^T Q x - u_i^T R u_i \\
\nabla V_i^T(x) \cdot \dot x= \nabla V_i^T(x) \left[f(x) + g(x) u_i \right] + \nabla V_i^T(x) \cdot g(x) \left[u - u_i \right]\\

\end{gather}

We can use Bellman equation to replace $f(x) and g(x)$ in the equation above:

\begin{gather}
\nabla V_i^T(x) \dot x = \nabla V_i^T(x) \left[f(x) + g(x) u_i \right] + \nabla V_i^T(x) \cdot g(x) \left[u - u_i \right] = -x^T Q x - u_i^T R u_i +  \nabla V_i^T(x) \cdot g(x) \left[u - u_i \right]\\
\end{gather}

Using the policy improvement equation we can write:

\begin{gather}
\dot V(x(t)) = \nabla V_i^T(x) \dot x = -x^T Q x - u_i^T R u_i - 2 u_{i+1}^T(x) R \left[u-u_i \right] \\
\end{gather}

## Integrating both sides:

\begin{gather}
\int_{t}^{t+T} \dot V_i(x(t)) dt = - \int_{t}^{t+T} x^T Q x dt - \int_{t}^{t+T} u_i^T R u_i dt - 2 \int_{t}^{t+T} u_{i+1}^T(x) R \left[u-u_i \right] dt \\
\downarrow \\
V_i(x(t+T)) - V_i(x(t)) = - \int_{t}^{t+T} x^T Q x dt - \int_{t}^{t+T} u_i^T R u_i dt - 2 \int_{t}^{t+T} u_{i+1}^T(x) R \left[u-u_i \right] dt \\
\end{gather}



In the linear case we knew that $V_i(x) = x^T P_i x$. In the non linear case we dont know the form of $V_i(x)$.

A neural network can be an approximation of a non linear function, we will use the following approximations:

\begin{gather}
\hat V_i(x) = \sum_{j=1}^{N_1} \hat c_{i,j} \phi_j(x) = \hat C_i \phi(x) \\
\hat u_i(x) = \sum_{j=1}^{N_2} \hat w_{i,j} \psi_j(x) = \hat W_i \psi(x) \\
\end{gather}
    

$\phi(x)$ and $\psi(x)$ are the basis functions of the neural network.

$\hat C_i$ and $\hat W_i$ are the weights of the neural network.
    
Substituting the approximations in the equation above we will have:

\begin{gather}
\sum_{j=1}^{N_1} \hat c_{i,j} \left[\phi_j(x(t+T)) - \phi_j(x(t)) \right]= - \int_{t}^{t+T} x^T Q x +u_i^T \cdot R \cdot u_idt - \int_{t}^{t+T} \sum_{j=1}^{N_2} \hat w_{i,j} \psi_j(x) R \sum_{j=1}^{N_2} \hat w_{i,j} \psi_j(x) dt \\

\end{gather}

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import math
import scipy as sp
import control as ct
import control.matlab as matlab
from HelperFunctions import *
from scipy.signal import butter, lfilter

%matplotlib notebook