# Reinforcement Learning
Let's describe the "taxi problem". We want to build a self-driving taxi that can pick up passengers at one of a set of fixed locations, drop them off at another location, and get there in the quickest amount of time while avoiding obstacles.

The AI Gym lets us create this environment quickly:

In [7]:
import gym # importa o gym
import random # importa o random

random.seed(1234)

streets = gym.make("Taxi-v3").env
streets.reset()
print(streets.render())

+---------+
|R: | : :[34;1m[43mG[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+

None


Let's break down what we're seeing here:

- R, G, B, and Y are pickup or dropoff locations.
- The BLUE letter indicates where we need to pick someone up from.
- The MAGENTA letter indicates where that passenger wants to go to.
- The solid lines represent walls that the taxi cannot cross.
- The filled rectangle represents the taxi itself - it's yellow when empty, and green when carrying a passenger.

Our little world here, which we've called "streets", is a 5x5 grid. The state of this world at any time can be defined by:

- Where the taxi is (one of 5x5 = 25 locations)
- What the current destination is (4 possibilities)
- Where the passenger is (5 possibilities: at one of the destinations, or inside the taxi)

So there are a total of 25 x 4 x 5 = 500 possible states that describe our world.

For each state, there are six possible actions:

- Move South, East, North, or West
- Pickup a passenger
- Drop off a passenger

Q-Learning will take place using the following rewards and penalties at each state:

- A successfull drop-off yields +20 points
- Every time step taken while driving a passenger yields a -1 point penalty
- Picking up or dropping off at an illegal location yields a -10 point penalty
- Moving across a wall just isn't allowed at all.

Let's define an initial state, with the taxi at location (2, 3), the passenger at pickup location 2, and the destination at location 0:

In [2]:
initial_state = streets.encode(2, 3, 2, 0) # define o estado inicial do taxi

streets.s = initial_state # define o estado inicial

streets.render() # renderiza o ambiente

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [3]:
streets.P[initial_state]
# o dicionário abaixo mostra as possíveis ações que o taxi pode tomar em cada estado

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

Here's how to interpret this - each row corresponds to a potential action at this state: move South, North, East, or West, pickup, or dropoff. The four values in each row are the probability assigned to that action, the next state that results from that action, the reward for that action, and whether that action indicates a successful dropoff took place.

So for example, moving North from this state would put us into state number 368, incur a penalty of -1 for taking up time, and does not result in a successful dropoff.

So, let's do Q-learning! First we need to train our model. At a high level, we'll train over 10,000 simulated taxi runs. For each run, we'll step through time, with a 10% chance at each step of making a random, exploratory step instead of using the learned Q values to guide our actions.

In [4]:
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n]) # cria a q_table que no caso é uma matriz de zeros 

learning_rate = 0.1 # indica quanto o agente aprende a cada iteração
discount_factor = 0.6 # fator de desconto
exploration = 0.1 # chance de escolher uma ação aleatória ao invés da melhor ação
epochs = 10000 # número de iterações

for taxi_run in range(epochs):
    state = streets.reset() # reseta o ambiente
    done = False
    
    while not done:
        random_value = random.uniform(0, 1) # gera um valor aleatório entre 0 e 1 
        if (random_value < exploration): # condição para explorar
            action = streets.action_space.sample()  
        else:
            action = np.argmax(q_table[state])
        
        next_state, reward, done, info, = streets.step(action)
        # a função step executa a ação escolhida e retorna o próximo estado

        prev_q = q_table[state, action] # valor de q do estado anterior
        next_max_q = np.max(q_table[next_state]) # valor de q do próximo estado
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q) # calcula o novo valor de q para o estado atual 
        q_table[state, action] = new_q # atualiza o valor de q na q_table

        state = next_state

So now we have a table of Q-values that can be quickly used to determine the optimal next step for any given state! Let's check the table for our initial state above:

In [5]:
q_table[initial_state]

array([-2.38438861, -2.40342946, -2.42231272, -2.3639511 , -8.9724024 ,
       -7.31330384])

The lowest q-value here corresponds to the action "go West", which makes sense - that's the most direct route toward our destination from that point. It seems to work! Let's see it in action!

In [None]:
from IPython.display import clear_output
from time import sleep

# permite visualizar o taxi se movendo no ambiente
for tripnum in range(1, 11):
    state = streets.reset()
   
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        next_state, reward, done, info, = streets.step(action)
        clear_output(wait=True)# Limpa a tela anterior (para animação suave)
        print("Trip number " + str(tripnum))
        # Renderiza e imprime o estado visual do ambiente
        # mode='ansi' retorna uma string com representação ASCII do ambiente
        print(streets.render(mode= 'ansi'))
        sleep(.5)
          # Atualiza o estado atual para o próximo estado
        state = next_state
        
    # Pausa de 2 segundos entre viagens
    sleep(2)

Trip number 4
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

