## Step 0: Import the dependencies 📚
First, we need to import the libraries <b>that we'll need to create our agent.</b></br>
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our Taxi Environment
- `Random` to generate random numbers

In [1]:
import numpy as np
import gym
import random as rnd 

## Step 1: Create the environment 🎮
- Here we'll create the Taxi environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>

In [2]:
# create and reset the Taxi-v2 environment 
taxi_env = gym.make('Taxi-v2')
taxi_env.render()

+---------+
|[34;1mR[0m: | : :G|
| :[43m [0m: : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+



## Step 2: Create the Q-table and initialize it 🗄️
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [3]:
# create a q-table with all zeros 
states = taxi_env.observation_space.n
actions = taxi_env.action_space.n
print('num states: {}; num actions: {}'.format(states, actions))

qtable = np.zeros(states * actions).reshape(states, actions)
print('shape of q-table: {}'.format(qtable.shape))

print(qtable)

num states: 500; num actions: 6
shape of q-table: (500, 6)
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


the possible actions (encoded) to take in the game are: 
- 0: move one down
- 1: move one up
- 2: move one right
- 3: move one left
- 4: pickup passenger - location in blue 
- 5: dropoff passenger - destination in pink 

## Step 3: Create the hyperparameters ⚙️
Here, we'll specify the hyperparameters.

In [37]:
learning_rate = 0.1         # alpha
discount_rate = 0.9         # gamma
max_exploration = 1.0       # max epsilon
min_exploration = 0.01      # min epsilon
epsilon_decay = 0.01
epsilon = max_exploration

episodes = 50000            # total episodes to run
test_episodes = 100         # epsiodes to test the trained agent
steps_per_episode = 1000    # steps the agent takes per episodes 

## Step 4: The Q learning algorithm 🧠
- Now we implement the Q learning algorithm:
<img src="qtable_algo.png" alt="Q algo"/>

In [38]:
# repeat learning until episodes over 
for episode in range(episodes):
    state = taxi_env.reset()
    done = False
    
    for i in range(steps_per_episode):
        if done: 
            break 
        else:
            # choose action 
            exploration_exploitation = rnd.random()
            if np.sum(qtable[state, :]) == 0 or exploration_exploitation < epsilon:
                # exploration - choose random action
                action = taxi_env.action_space.sample()
            else:
                # exploitation - choose best action 
                action = np.argmax(qtable[state, : ])
            
            # make the step 
            new_state, reward, done, _ = taxi_env.step(action)
            
            # update q values 
            current_qvalue = qtable[state, action]
            # print('current q-value {}'.format(current_qvalue))
            qtable[state, action] = current_qvalue + learning_rate * \
                                    (reward + discount_rate * (np.max(qtable[new_state, :]) - current_qvalue))
            
            epsilon -= epsilon_decay
            state = new_state
    
    # print qtable after every episode 
    # print('Q-table after episode {}\n{}'.format(episode, qtable))
    
    # reducing epsilon per episode to have decreasing exploration and more exploitation 
    epsilon = min_exploration + (max_exploration-min_exploration)*np.exp(-epsilon_decay*episode)
    
# print final qtable
print(qtable)

[[ 0.          0.          0.          0.          0.          0.        ]
 [ 8.12911143  6.93392846  6.75149188  4.84122725 14.44444444  0.47761513]
 [ 6.77708775  9.36985359  4.22359705  5.64308228 16.66666667  3.1190741 ]
 ...
 [-1.49144559 -1.31895383 -1.56861604  2.8108381  -4.56983725 -2.93619   ]
 [-2.40242294 -2.93033622 -2.54044449  3.20761683 -3.63048845 -3.        ]
 [-0.1119009   0.22046414  1.36808042 16.8606016  -1.75191169 -1.        ]]


## Step 5: Use our Q-table to play Taxi ! 🚖
- After 50 000 episodes, our Q-table can be used as a "cheatsheet" to play Taxi.
- By running this cell you can see our agent playing Taxi.

In [39]:
taxi_env.reset()
rewards = np.zeros(test_episodes)

for run in range(test_episodes):
    reward_acc = 0
    state = taxi_env.reset()
    done = False
    
    # run until done 
    while not done: 
        # taxi_env.render()
        action = np.argmax(qtable[state, :])
        new_state, reward, done, _ = taxi_env.step(action)
        reward_acc += reward
        
        state = new_state
    # print('rewards of round {}: {}'.format(run, reward_acc))
    rewards[run] = reward_acc

taxi_env.close()

avg_reward = np.sum(rewards)/test_episodes
print('average reward per round: {}'.format(avg_reward))

average reward per round: 8.69
