## Neste notebook, você codificará do zero seu terceiro agente de Reinforcement Learning jogando Montain Car usando Q-Learning

<img src="https://www.gymlibrary.dev/_images/mountain_car.gif" alt="Environments"/>

###🎮 Environments:

>

- [Mountain Car](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)


###📚 RL-Library:

- Python and NumPy
- [Gym](https://www.gymlibrary.dev/)

## Instalar dependências e criar um display virtual 🔽


In [None]:
!pip install gym==0.24
!pip install pygame
!pip install numpy

!pip install huggingface_hub
!pip install pickle5
!pip install pyyaml==6.0
!pip install imageio
!pip install imageio_ffmpeg
!pip install pyglet==1.5.1
!pip install tqdm

In [None]:
%%capture
!apt update
!apt install ffmpeg xvfb
!pip install xvfbwrapper
!pip install pyvirtualdisplay

Para garantir que as novas bibliotecas instaladas sejam usadas, **às vezes é necessário reiniciar o tempo de execução do notebook**. A próxima célula forçará o **tempo de execução a travar, então você precisará se conectar novamente e executar o código a partir daqui**.

In [None]:
#import os
#os.kill(os.getpid(), 9)

## Importação de pacotes 📦

Além das bibliotecas instaladas, utilizamos também:

- `random`: Para gerar números aleatórios (que serão úteis para a política epsilon-greedy).
- `imageio`: Para gerar um vídeo de replay.

In [None]:
import gym
import tensorflow as tf
import numpy as np
from tensorflow import keras

from collections import deque
import time
import random

from IPython import display
from IPython.display import HTML
import pygame
from base64 import b64encode
import matplotlib.pyplot as plt
import imageio
from time import sleep
import tqdm
from tqdm.notebook import tqdm
import os

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

In [None]:
RANDOM_SEED = 5
tf.random.set_seed(RANDOM_SEED)

In [None]:
env = gym.make('MountainCar-v0')
env.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

In [None]:
print("Action Space: {}".format(env.action_space))
print("State space: {}".format(env.observation_space))

### Verifique o Environment:


In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action

In [None]:
state_space = env.observation_space
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

Estudando melhor o espaço de Observação temos a seguinte descrição do Enviroment:

Given an action, the mountain car follows the following transition dynamics:

velocityt+1 = velocityt + (action - 1) * force - cos(3 * positiont) * gravity

positiont+1 = positiont + velocityt+1

where force = 0.001 and gravity = 0.0025. The collisions at either end are inelastic with the velocity set to 0 upon collision with the wall. The position is clipped to the range [-1.2, 0.6] and velocity is clipped to the range [-0.07, 0.07].


In [None]:
min_v = env.observation_space.low
print("Min possible values ", min_v)

In [None]:
max_v = env.observation_space.high
print("Max possible values ", max_v)

In [None]:
state = env.reset()

In [None]:
env.step(env.action_space.sample())
img = env.render(mode='rgb_array')

In [None]:
plt.imshow(img)

##O Grande desafio deste notebook é pensar em uma forma de discretizar a velocidade e a posição do carrinho em estados para criar a nossa Q-Table


Processo de Validação

In [None]:
import gym
import numpy as np
from IPython.core.display import display, HTML
from base64 import b64encode
import imageio

# Configurações e constantes
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 5000
SHOW_EVERY = 500
STATS_EVERY = 1

# Variáveis para a decaimento do epsilon
epsilon = 1
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2
epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

# Inicialização do ambiente
env = gym.make("MountainCar-v0")
env.reset()

# Inicialização da tabela Q
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low) / DISCRETE_OS_SIZE
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))

# Variáveis para estatísticas
ep_rewards = []
aggr_ep_rewards = {'ep': [], 'avg': [], 'min': [], 'max': []}

# Variável para armazenar imagens
images = []

def get_discrete_state(state):
    """Transforma estados contínuos em estados discretos"""
    discrete_state = (state - env.observation_space.low) / discrete_os_win_size
    return tuple(discrete_state.astype(np.int))

def play_episode(images, video_path):
    kargs = {'macro_block_size': 1}
    imageio.mimsave(video_path, [np.array(img) for i, img in enumerate(images)], fps=15, **kargs)

for episode in range(EPISODES):
    episode_reward = 0
    discrete_state = get_discrete_state(env.reset())
    done = False

    if episode % SHOW_EVERY == 0:
        render = True
        print(f"Episode: {episode}")
    else:
        render = False

    while not done:
        if render:
            images.append(env.render(mode='rgb_array'))

        if np.random.random() > epsilon:
            action = np.argmax(q_table[discrete_state])
        else:
            action = np.random.randint(0, env.action_space.n)

        new_state, reward, done, _ = env.step(action)
        episode_reward += reward

        new_discrete_state = get_discrete_state(new_state)

        if not done:
            max_future_q = np.max(q_table[new_discrete_state])
            current_q = q_table[discrete_state + (action, )]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action, )] = new_q

        elif new_state[0] >= env.goal_position:
            q_table[discrete_state + (action, )] = 0

        discrete_state = new_discrete_state

    if render:
        video_path = "replay.mp4"
        play_episode(images, video_path)
        mp4 = open(video_path,'rb').read()
        data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
        display(HTML(f"""
        <video width=400 controls>
              <source src="{data_url}" type="video/mp4">
        </video>
        """))
        images = []

    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_decay_value

    ep_rewards.append(episode_reward)

    if episode % STATS_EVERY == 0:
        average_reward = sum(ep_rewards[-STATS_EVERY:])/STATS_EVERY
        aggr_ep_rewards['ep'].append(episode)
        aggr_ep_rewards['avg'].append(average_reward)
        aggr_ep_rewards['min'].append(min(ep_rewards[-STATS_EVERY:]))
        aggr_ep_rewards['max'].append(max(ep_rewards[-STATS_EVERY:]))
        print(f"Episode: {episode}, average: {average_reward}, min: {min(ep_rewards[-STATS_EVERY:])}, max: {max(ep_rewards[-STATS_EVERY:])}")

env.close()


In [None]:
video_path = "replay.mp4"
play_episode(images, video_path)
mp4 = open(video_path,'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
display(HTML(f"""
<video width=400 controls>
      <source src="{data_url}" type="video/mp4">
</video>
"""))
images = []