<a href="https://colab.research.google.com/github/Akshay-A-Kulkarni/Reinforcement-Learning/blob/master/Q_Learning_with_OpenAI_Taxi_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 0: Import the dependencies 
First, we need to import the libraries <b>that we need to create our agent.</b></br>
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our Taxi Environment
- `Random` to generate random numbers

In [0]:
import numpy as np
import gym
import random

## Step 1: Create the environment 
- Here we'll create the Taxi environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>

In [44]:
env = gym.make("Taxi-v3")
env.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m|[43m [0m: |B: |
+---------+



### Random/ Brute Force Animation



In [45]:
from IPython.display import clear_output
from time import sleep

# env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

env.reset()
while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 200
Penalties incurred: 56


In [66]:
def animate_frames(frames, wait_time):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(wait_time)
        
animate_frames(frames, 0.05)

+---------+
|[35mR[0m: | : :G|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (East)

Timestep: 200
State: 188
Action: 2
Reward: -1


## Step 2: Create the Q-table and initialize it 🗄️
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [47]:
action_size = env.action_space.n
print("Action size ", action_size)

state_size = env.observation_space.n
print("State size ", state_size)

Action size  6
State size  500


In [48]:
Qtable = np.zeros((state_size,action_size))
Qtable

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

## Step 3: Create the hyperparameters
Here, we'll specify the hyperparameters to control our Q learning.

In [0]:
# Episode Params
total_episodes = 50000        
total_test_episodes = 100      
max_steps = 100 


# Learning & Discount(gamma) Rates
lr = 0.7
gamma = 0.618

# Exploration Parameter 
epsilon = 1.0       # Exploration rate
max_eps = 1.0
min_eps = 0.01
decay_rate = 0.01   # Exponential decay rate for exploration prob

## Step 4: The Q learning algorithm 🧠
- Now we implement the Q learning algorithm:


<img src="https://cdn-media-1.freecodecamp.org/images/1*jmcVWHHbzCxDc-irBy9JTw.png" alt="Q algo" width = "95%"/>

In [0]:
# Begin Learning until convegence ie. until iteration limit of total episodes.
for episode in range(total_episodes):
  # reset the world
  state = env.reset() 
  step = 0 
  done = False

  for step in range(max_steps):
    # Choose an Action in the current world state (s)

    # First randomize a number
    exp_exp_tradeoff = random.uniform(0,1)

    #if this number is greater thatn our exploration paramter epsilon 
    #take action with the biggest Q value for this state
    if exp_exp_tradeoff > epsilon:
      action = np.argmax(Qtable[state,:])

    # Otherwise take a random choice and explore
    else:
      action = env.action_space.sample()
    
    # Given action, observe the outcome state [s'] and reward [r]
    new_state, reward, done, info = env.step(action)

    # Update using the Q-learning algo formula
    # Q(s,a) := Q(s,a) + lr * [R(s,a) + gamma* max Q'(s',a') - Q(s,a)]

    Qtable[state, action] = Qtable[state, action] + \
                              lr *(reward + gamma * np.max(Qtable[new_state,:]) 
                                - Qtable[state, action])
                             
    
    # Set new state as current State
    state = new_state

    # If done i.e destination reached then finish episode.
    if done == True:
      break
  
  # Reduce epsilon exploration as we update and table 
  epsilon = min_eps + (max_eps - min_eps)*np.exp(-decay_rate*episode)
  

In [55]:
# This is our learned Q-table.
display(Qtable)

array([[  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ],
       [ -2.50421537,  -2.43487347,  -2.50427491,  -2.43480242,
         -2.32039715, -11.43371452],
       [ -1.84041247,  -1.35777503,  -1.84267184,  -1.35777017,
         -0.57891593, -10.35777131],
       ...,
       [ -2.21309646,   0.68136372,  -2.25741178,  -2.154706  ,
         -7.        , -10.50803626],
       [ -2.25741178,  -2.13752647,  -2.32131447,  -2.22861473,
         -9.1       ,  -7.        ],
       [ -1.476412  ,  -1.5036658 ,  -1.21282   ,  11.36      ,
        -10.477222  , -10.123666  ]])

## Step 5: Use our Q-table to play Taxi  🚖
- After 50 000 episodes, our Q-table can be used as a "cheatsheet" or guide to play Taxi.
- By running this cell you can see our agent playing Taxi.

In [69]:
env.reset()
rewards = []
test_frames = []

for episode in range(total_test_episodes):
  state = env.reset()
  step = 0
  done = False
  total_rewards = 0

  for step in range(max_steps):
    test_frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )
    # Take the action (index) that have the max expected future R given that S.
    a = np.argmax(Qtable[state,:])

    n_state, r, done, info = env.step(a)

    total_rewards += r

    if done:
      rewards.append(total_rewards)
      break
    state = n_state

env.close()
print("Score over time: " + str(sum(rewards)/total_test_episodes))

animate_frames(test_frames, wait_time = 0.005)


+---------+
|[35m[42mR[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)

Timestep: 1289
State: 16
Action: 5
Reward: 20
