For this exercise, I will use OpenAI Gym's Taxi-v2 environment to design an algorithm to teach a taxi agent to navigate a small gridworld. 

The description of the environment from subsection 3.1 of this [paper](https://arxiv.org/abs/cs/9905014).

In [12]:
from agent import Agent
from monitor import interact
import gym
import numpy as np

In [13]:
def simulate(agent, env):
    """
    simulate the interaction between the agent and the environment
    """
    state = env.reset()
    env.render()
    rewards = 0
    while True:
        action = agent.select_action(state)
        state, reward, done, _ = env.step(action)
        env.render()
        rewards += reward
        print('total rewards:' ,rewards)
        if done:
            break

In [14]:
env = gym.make('Taxi-v2')
agent = Agent(alpha=1)
avg_rewards, best_avg_reward = interact(env, agent, num_episodes=10000)

Alpha:1
Episode 10000/10000 || Best average reward 9.366



### let's  see what the agent can do now

In [15]:
simulate(agent, env)

+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[35mB[0m: |
+---------+

+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| : : :[43m [0m: |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (North)
total rewards: -1
+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| : :[43m [0m: : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (West)
total rewards: -2
+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| :[43m [0m: : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (West)
total rewards: -3
+---------+
|[34;1mR[0m: | : :G|
| : : : : |
|[43m [0m: : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (West)
total rewards: -4
+---------+
|[34;1mR[0m: | : :G|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (North)
total rewards: -5
+---------+
|[34;1m[43mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (North)
total rewards: -6
+---------+
|[42mR[0m: | : :G|
| : : : : 

It looks like every action is resonable,excellent

## finetune the parameters 

In [10]:
recorder = {'best_avg':-np.inf,
            'alpha': -1,
            'best_agent':None}
alpha = np.linspace(1, 0.001, num=5, endpoint=True)
for a in alpha:
    agent = Agent(alpha=a)
    _, best_avg_reward = interact(env, agent)
    if best_avg_reward > recorder['best_avg']:
        recorder['best_avg'] = best_avg_reward
        recorder['alpha'] = a
        recorder['best_agent'] = agent

Alpha:1.0
Episode 20000/20000 || Best average reward 9.372

Alpha:0.75025
Episode 20000/20000 || Best average reward 9.278

Alpha:0.5005
Episode 20000/20000 || Best average reward 9.211

Alpha:0.25075000000000003
Episode 20000/20000 || Best average reward 9.229

Alpha:0.001
Episode 20000/20000 || Best average reward -203.75



In [11]:
recorder

{'best_avg': 9.3699999999999992,
 'alpha': 1.0,
 'best_agent': <agent.Agent at 0x110409c88>}