# Q-learning taxi game

## Imports

<font size =4> First, let's ensure we have the OpenAI gym package.
    </font>

In [None]:
!pip install gym==0.21.0
!pip install pygame

In [None]:
import numpy as np
import gym
import ipywidgets as widgets
from ipywidgets import interact
from tqdm.notebook import tqdm

## Initialisation

<font size=4>Let's create an environment for the taxi game. This environment object keeps
track its state, and provides functionality for performing actions in the environment and returning
rewards for those actions.</font>

In [None]:
env = gym.make("Taxi-v3")

<font size=4>The observation_space instance variable shows that there are 500 possible discrete states in this environment.</font>

In [None]:
env.observation_space

In [None]:
env.observation_space.n

<font size=4>The reset() method sets the environment to a random state and returns the index of that state.</font>

In [None]:
env.reset()  # This is a useful function - you will need it!

<font size=4> The render() function can show us what the environmental states look like...
 - The yellow rectangle is the taxi
 - The |'s are walls
 - The :'s are road
 - The goal is to pick up someone at blue letter, then drop off at pink letter</font>

In [None]:
env.render()

<font size=4>Note that there are 6 possible actions the taxi can take at any time step with the indices:
 - 0: down
 - 1: up
 - 2: right
 - 3: left
 - 4: pick up
 - 5: drop off</font>

In [None]:
env.action_space

In [None]:
env.action_space.n

<font size=4>To sample a random action from the action space, which you will need to do, you can use the `env.action_space.sample()` method.</font>

In [None]:
env.action_space.sample()

<font size=4>Experiment with taking actions...
    
- The argument to the `step( )` function is the index of the action you want to take

- The `new_state` returned by the `step( )` function is the index of the new state we are in after performing that action. 
 
- `reward` is the reward received for taking that action

- `info` and `done` are other flags that are not required for this exercise
</font>

In [None]:
# The argument to this function is the index of the action you want to take.
# 0: down, 1: up, 2: left, 3: right, 4: pickup, 5: dropoff

ACTION_INDEX = 0

new_state, reward, done, info = env.step(ACTION_INDEX) # This function is important!

env.render()

# Perform tabular Q-learning

<font size=4>Initialize the action-value function `Q(state, action)` and choose a learning rate $\alpha$</font>

In [None]:
# Initialise your Q(s,a) table. This should be a 2-D array of floats - use a numpy array for this.
# What shape should the table be?
# You can find the size of the state space with env.observation_space.n
# The action space is env.action_space. Find its shape also
# Try initialising all values to 0 at first, or play with random initialisation


Q = ?


# Choose a sensible learning rate
alpha = 0.1

<font size=4>Play the game for a certain number of episodes, updating the Q-function after each action</font>

In [None]:
# Choose a number of episodes to run
n_episodes = 100000
print_freq = 10  # logarithmic
prev_freq = 0

avg_step_count = 0
for episode in tqdm(range(1, n_episodes + 1)):

    # Initialise the environmental state by resetting the environment. 
    # Make sure to save the index of the state returned by the reset function
    # so that you know where you are starting.
    state = ?

    # Let's create a variable to track how many steps each episode takes.
    step_count = 0

    # Let epsilon-greediness decay with episode. Try experimenting with this!
    epsilon = 10 / episode

    # Continue until taxi performs the correct pick-up and drop-off (thus earning 20 points)
    reward = 0
    while reward != 20:

        # Choose action epsilon-greedily. You might have a function left over from the k-armed bandit 
        # notebook that can help you do this. Remember that Q(s,a) is a 2D table - when deciding
        # which action to use when exploting, we need to pick the row of 6 Q(s,a) values corresponding 
        # to the state that the environment is currently in
        action = ?

        # Perform the action you chose, and receive the new state and reward
        # what function should go here?
        ? = env.?

        step_count += 1

        # Update action-value function according to Q-learning algorithm
        # Refer to slide for update equation!

        # Q-update equation goes here

        # Update state and proceed to next action
        state = new_state

    # Track stats
    # This will print out how many steps on average it takes your q-learning
    # agent to complete the task.
    if episode % print_freq == 0:
        avg_step_count += 1 / print_freq * (step_count - avg_step_count)
        print(
            "Episode: {}, Average Step Count: {:.2f}".format(
                episode, avg_step_count
            )
        )
        avg_step_count = 0
        prev_freq = print_freq
        print_freq *= 10
    else:
        avg_step_count += (
            1 / (episode - prev_freq) * (step_count - avg_step_count)
        )

# Results

### Let's visualise the behaviour of the agent before and after learning.

Functions used for interactive results

In [None]:
def run_episode(Q, epsilon):
    states = []
    actions = []
    state = env.reset()
    states.append(state)
    done = False
    while not done:
        if np.random.uniform() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
        actions.append(action)
        state, reward, done, info = env.step(action)
        states.append(state)
    return states, actions


def snapshot(t):
    if t == 0:
        env.reset()
        env.env.s = states[0]
    else:
        env.env.s = states[t - 1]
        env.step(actions[t - 1])
    env.render()

Here is an example of the initial random strategy in action

In [None]:
states, actions = run_episode(Q, 1)
interact(
    snapshot, t=widgets.IntSlider(min=0, max=len(states) - 1, step=1, value=0)
);

Here is the learned optimal policy in action

In [None]:
states, actions = run_episode(Q, 0)
interact(
    snapshot, t=widgets.IntSlider(min=0, max=len(states) - 1, step=1, value=0)
);

## Extensions

### What happens to the convergence rate if you randomly initialise your Q table instead of setting it to 0?

## Try repeating this for a different game in the OpenAI gym! https://gym.openai.com/envs/FrozenLake8x8-v0/ is a good choice