# <font color='blue'> <center> Classic Control: Control theory problems from the classic RL literature </center> </font>

<br><br>

In this notebook we will present some classic environments in Reinforcement Learning research. These environments have continuous states spaces (i.e., infinite possible states) and therefore tabular methods cannot solve them. To tackle these environments (and more complex ones) we will have two tools:

- Extend the tabular methods with the techniques of discretization and tile coding
- Use function approximators (Neural Networks)

<br>

## <font color='#2874A6'> Tabla de Contenido </font>

1. [Modules](#1)
2. [Auxiliary Functions](#2)
3. [Examples](#3)
    - 3.1. [CartPole](#3.1)
    - 3.2. [Acrobot](#3.2)
    - 3.3. [Mountain Car](#3.3)
    - 3.4. [Pendulum](#3.4)

<a name="1"></a>
## <font color='#0E6655'> 1. Modules </font> 

In [2]:
#!pip install -qq gym==0.23.0


import matplotlib
from matplotlib import animation
from IPython.display import HTML

import gym
import numpy as np
from IPython import display
from matplotlib import pyplot as plt
%matplotlib inline

<a name="2"></a>
## <font color='#0E6655'> 2. Auxiliary Functions </font> 

In [3]:
def display_video(frames):
    # Copied from: https://colab.research.google.com/github/deepmind/dm_control/blob/master/tutorial.ipynb
    orig_backend = matplotlib.get_backend()
    matplotlib.use('Agg')
    fig, ax = plt.subplots(1, 1, figsize=(5, 5))
    matplotlib.use(orig_backend)
    ax.set_axis_off()
    ax.set_aspect('equal')
    ax.set_position([0, 0, 1, 1])
    im = ax.imshow(frames[0])
    def update(frame):
        im.set_data(frame)
        return [im]
    anim = animation.FuncAnimation(fig=fig, func=update, frames=frames,
                                    interval=50, blit=True, repeat=False)
    return HTML(anim.to_html5_video())


def test_env(environment, episodes=10):
    frames = []
    for episode in range(episodes):
        state = environment.reset()
        done = False
        frames.append(environment.render(mode="rgb_array"))

        while not done:
            action = environment.action_space.sample()
            next_state, reward, done, extra_info = environment.step(action)
            img = environment.render(mode="rgb_array")
            frames.append(img)
            state = next_state

    return display_video(frames)

<a name="3"></a>
## <font color='#0E6655'> 3. Examples </font> 

<a name="3.1"></a>
### <font color='green'> 3.1. Cart Pole </font> 

El carro se mueve horizontalmente y el objetivo es mantener la barra vertical.

Cuando pasa del ángulo máximo termina la tarea.

In [4]:
env = gym.make('CartPole-v1')
test_env(env, 1)
env.close()

El estado está formado por:

- La posición del carro
- La velocidad del carro
- El ángulo del palo en radianes
- La velocidad angular del palo

##### The state

The states of the cartpole task will be represented by a vector of four real numbers:

        Num     Observation               Min                     Max
        0       Cart Position             -4.8                    4.8
        1       Cart Velocity             -Inf                    Inf
        2       Pole Angle                -0.418 rad (-24 deg)    0.418 rad (24 deg)
        3       Pole Angular Velocity     -Inf                    Inf

In [5]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

Las acciones posibles son:

- Empujar el carro a la derecha
- Empujar el carro a la izquierda

##### The actions available

We can perform two actions in this environment:

        0     Push cart to the left.
        1     Push cart to the right.

In [6]:
env.action_space

Discrete(2)

Mientras lo mantengo la recompensa es positiva (quiero que esté vertical tanto como sea posible).

<a name="3.2"></a>
### <font color='green'> 3.2. Acrobot </font> 

Es un péndulo doble que se balancea, el objetivo es tocar una barra horizontal que se encuentra arriba.

In [22]:
env = gym.make('Acrobot-v1')
test_env(env, 1)
env.close()

##### The state

The states of the cartpole task will be represented by a vector of six real numbers. The first two are the cosine and sine of the first joint. The next two are the cosine and sine of the other joint. The last two are the angular velocities of each joint.
    
$\cos(\theta_1), \sin(\theta_1), \cos(\theta_2), \sin(\theta_2), \dot\theta_1, \dot\theta_2$

In [8]:
env.observation_space

Box([ -1.        -1.        -1.        -1.       -12.566371 -28.274334], [ 1.        1.        1.        1.       12.566371 28.274334], (6,), float32)

##### The actions available

We can perform two actions in this environment:

    0    Apply +1 torque on the joint between the links.
    1    Apply -1 torque on the joint between the links.

In [11]:
env.action_space

Discrete(3)

La recompensa será negativa mientras no lo logra.

<a name="3.3"></a>
### <font color='green'> 3.3. MountainCar: Reach the goal from the bottom of the valley</font> 

Que el auto llegue a la meta marcada.

In [13]:
env = gym.make('MountainCar-v0')
test_env(env, 1)
env.close()

##### The state

The observation space consists of the car position $\in [-1.2, 0.6]$ and car velocity $\in [-0.07, 0.07]$

In [15]:
env.observation_space

Box([-1.2  -0.07], [0.6  0.07], (2,), float32)

##### The actions available


The actions available three:

    0    Accelerate to the left.
    1    Don't accelerate.
    2    Accelerate to the right.

In [16]:
env.action_space

Discrete(3)

La recompensa será negativa mientras no lo logre.

<a name="3.4"></a>
### <font color='green'> 3.4 Pendulum: swing it and keep it upright</font> 

El objetivo es que el péndule quede vertical.

Es la única de las vistas que tiene acciones continuas.


In [17]:
env = gym.make('Pendulum-v1')
test_env(env, 1)
env.close()

##### The state

The state is represented by a vector of three values representing $\cos(\theta), \sin(\theta)$ and speed ($\theta$ is the angle of the pendulum).

In [18]:
env.observation_space

Box([-1. -1. -8.], [1. 1. 8.], (3,), float32)

##### The actions available

The action is a real number in the interval $[-2, 2]$ that represents the torque applied on the pendulum.

In [20]:
env.action_space

Box(-2.0, 2.0, (1,), float32)

La recompensa será positiva mientras se mantenga vertical.