<a href="https://colab.research.google.com/github/CeHaga/qlearning-taxi/blob/main/QLearning_Taxi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q-Learning Algorithm in OpenAi Gym
This notebook implements q-learning algorithm for reinforcement learning in OpenAI Gym's Taxi environment

## Imports and Setup


In [1]:
import numpy as np
import random # Exploration X Exploit
from IPython.display import clear_output
import gym # Environment
from time import sleep

The Taxi environment is a 5x5 grid where the player controls a taxi and must pickoff a passanger from one of the 4 corners and deliver it to the desired corner.

In [4]:
env = gym.make("Taxi-v3").env
env.render()

print('There are {} states (5x5 grid * 5 passanger location * 4 destinations)'.format(env.observation_space.n))
print('There are {} actions (4 directions + pickup + dropoff)'.format(env.action_space.n))

+---------+
|[34;1mR[0m: | : :G|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+

There are 500 states (5x5 grid * 5 passanger location * 4 destinations)
There are 6 actions (4 directions + pickup + dropoff)


## Variable Setup

*   Alpha: Learning Rate
*   Gamma: Future Reward Importance
*   Epsilon: Random Action Frequency
*   Episode: Session from start to end

The Q-Table maps a given state and action to the corresponding reward, initialize it with 0



In [5]:
alpha = 0.1
gamma = 0.6
epsilon = 0.1
nEpisodes = 10000
ckpEpisode = 1000
nCkp = int(nEpisodes/ckpEpisode)
q = np.zeros([env.observation_space.n, env.action_space.n])

frames = [[] for _ in range(nCkp)]

## Learning

In [9]:
for episode in range(nEpisodes):
  state = env.reset() # Set environment to a initial state

  reward = 0
  terminated = False
  action = 0

  while not terminated:
    if random.uniform(0,1) < epsilon: # If explore
      action = env.action_space.sample() # Get a random action
    else:
      action = np.argmax(q[state]) # Get the best

    nextState, reward, terminated, info = env.step(action) # Do the action

    # Update Q based on formula
    qValue = q[state, action]
    maxValue = np.max(q[nextState])
    qValue = (1 - alpha) * qValue + alpha * (reward + gamma * maxValue)
    q[state, action] = qValue

    state = nextState

    # Save the env frame after ckpEpisode episodes
    if((episode + 1) % ckpEpisode == 0):
      epInd = int((episode+1)/ckpEpisode)-1
      frames[epInd].append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
      )
  
  if((episode + 1) % ckpEpisode == 0):
    clear_output(wait=True)
    print("({}%) Episode {} of {}".format(((episode+1)/nEpisodes)*100,episode+1,nEpisodes))

print('='*30)
print('\tTraining Over!')
print('='*30)

(100.0%) Episode 10000 of 10000
	Training Over!


In [11]:
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(0.4)
        
print_frames(frames[9])

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)

Timestep: 25
State: 410
Action: 5
Reward: 20


In [13]:
totalEpochs = 0
totalPenalties = 0
totalTimeoffs = 0
nEpisodes = 100

# Repeat best action for nEpisodes to compare penalties
for episode in range(nEpisodes):
  state = env.reset()
  
  epochs = 0
  penalties = 0
  timeoffs = 0
  reward = 0
  terminated = False
  limit = 500

  while not terminated:
    action = np.argmax(q[state])
    state, reward, terminated, info = env.step(action)

    if(reward == -10):
      penalties += 1

    epochs += 1
    if(epochs == limit):
      timeoffs += 1
      break

  totalPenalties += penalties
  totalEpochs += epochs
  totalTimeoffs += timeoffs

print('*'*25)
print("\tResults")
print('*'*25)
print("Epochs per episode: {}".format(totalEpochs / nEpisodes))
print("Penalties per episode: {}".format(totalPenalties / nEpisodes))
print("Timeoffs per episode: {}".format(totalTimeoffs / nEpisodes))

*************************
	Results
*************************
Epochs per episode: 13.09
Penalties per episode: 0.0
Timeoffs per episode: 0.0




---


## Credits

Made by Carlos Bravo

GitHub: https://github.com/CeHaga

Telegram: @CeHaga