# Reinforcement Learning


Vamos usar a biblioteca gym para resolver um problema que consiste no menor caminho para um taxi pegar o passageiro e levar ao ponto desejado

In [45]:
import gym
import random

random.seed(1234)

streets = gym.make("Taxi-v3", render_mode='ansi').env #New versions keep getting released; if -v3 doesn't work, try -v2 or -v4
streets.reset()
print("\n" + streets.render())


+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : : :[43m [0m|
| | : | : |
|Y| : |B: |
+---------+




## Regras

Let's break down what we're seeing here:

-  R, G, B, and Y are pickup or dropoff locations.
-  The BLUE letter indicates where we need to pick someone up from.
-  The MAGENTA letter indicates where that passenger wants to go to.
-  The solid lines represent walls that the taxi cannot cross.
-  The filled rectangle represents the taxi itself - it's yellow when empty, and green when carrying a passenger.

Our little world here, which we've called "streets", is a 5x5 grid. The state of this world at any time can be defined by:

-  Where the taxi is (one of 5x5 = 25 locations)
-  What the current destination is (4 possibilities)
-  Where the passenger is (5 possibilities: at one of the destinations, or inside the taxi)

So there are a total of 25 x 4 x 5 = 500 possible states that describe our world.

For each state, there are six possible actions:

-  Move South, East, North, or West
-  Pickup a passenger
-  Drop off a passenger

Q-Learning will take place using the following rewards and penalties at each state:

-  A successfull drop-off yields +20 points
-  Every time step taken while driving a passenger yields a -1 point penalty
-  Picking up or dropping off at an illegal location yields a -10 point penalty

Moving across a wall just isn't allowed at all.



Definir o estado inicial do taxi na posição (2,3), passageiro na localização 2 e o destino na localização 0

In [46]:
initial_state = streets.encode(2, 3, 2, 0)

streets.s = initial_state

print("\n" + streets.render())


+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : : :[43m [0m|
| | : | : |
|Y| : |B: |
+---------+




In [47]:
streets.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}


Agora vamos treinar nosso modelo

Vamos buscar a ação de maior q (com uma pequena chance de ação aleatória)

Realizar a ação e obter um novo valor de q dado a recompensa recebida pela ação

In [52]:
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

learning_rate = 0.1 # porcentagem de aprendizado por ciclo
discount_factor = 0.6
exploration = 0.1 # chance de tomar uma decisão aleatória
epochs = 10000 # quantidade de ciclos

for taxi_run in range(epochs):
    state = streets.reset()
    done = False
    state = state[0]
    while not done:
        
        random_value = random.uniform(0, 1)
        if (random_value < exploration):
            action = streets.action_space.sample() # Toma uma decisão aleatória
        else:
            action = np.argmax(q_table[state]) # Pega a posição do maior valor de q
        
        
        next_state, reward, done, info, x = streets.step(action)
        
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q
        
        state = next_state
        
        

In [50]:
q_table[initial_state]

array([-2.41462954, -2.40720313, -2.40565082, -2.3639511 , -6.86286427,
       -8.09578865])

Vamos ver os resultados obtidos

In [51]:
from IPython.display import clear_output
from time import sleep

for tripnum in range(1, 11):
    state = streets.reset()
    state = state[0]
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, info, x = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render())
        sleep(.5)
        state = next_state
        trip_length += 1
        
    sleep(2)
    

Trip number 10 Step 9
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)

