# Reinforcement Learning to Control a cartpole

We aim to use two basic RL routines; namely _Policy Gradient (PG)_ and $Q$-_learning_ to control a cartpole environment.

## Cartpole environment
We consider cartpole which is a  classical toy problem in control, see [barto83](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) for the cartpole dynamics. The cartpole system represent a simplified model of a harbor crane and it is simple enough to be solved in a couple of minutes on an ordinary PC.

<center>A harbor crane<center> | <center>Cartpole<center>
- | - 
![harbor.jpg](attachment:harbor.jpg)  | ![cartpole.jpg](attachment:cartpole.jpg)
<center>Photo credit: [http://rhm.rainbowco.com.cn/](http://rhm.rainbowco.com.cn/product/141.html)<center>|<center>Photo credit: [https://gym.openai.com/](https://gym.openai.com/envs/CartPole-v0/)<center>

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. We use the [cartpole environment provided by OpenAI GYM](https://gym.openai.com/envs/CartPole-v0/). 

The episode ends when
* the pole is more than 15 degrees from vertical or
* the cart moves more than 2.4 units from the center or
* the pole has been upright on 200 samples (need not to be consequitive).

### Reward 
In each step, the cartpole environment releases an immediate reward $r_t$ 

\begin{align}
r_t = \begin{cases}
1,\quad \text{if the pendulum is upright}\\
0,\quad \text{otherwise}
\end{cases}
\end{align}

_It is up to us how to define the total reward to minimize  
\begin{align}
R= \mathbb{E}[\sum_{t=0}^{T} \gamma^{t}r_t]
\end{align}_
For $\gamma = 1$, we have the total reward and for $0<\gamma<1$ we have a discounted cost. 

### Solvability Criterion
The CartPole-v0 defines "solving" as getting average sum reward of 195.0 over 100 consecutive trials.

### Why Cartpole is an interesting setup?

But why cartpole is an interesting setup to study?

* The problem is small so it can be solved in a couple of minutes.
* The time horizon of this problem is finite. So, we'll have a good idea about how to solve a finite time optimal control problem with RL. 
* The state space is continuous while the action space is descrete.
* This is a classical control problem. We love to study it 😊.

## What is next?
Here is a summary of things to do\read
* [Prepare a virtual environment](Preparation.ipynb)
* [Policy Gradient on cartpole](pg_cartpole_notebook.ipynb)
* [$Q$-learning on cartpole](q_cartpole_notebook.ipynb)
* [Experience replay $Q$ learning on cartpole](replay_q_cartpole_notebook.ipynb)