# 5. Q-Learning

In contrast to Sarsa which is an on-policy method (it learns to improve the policy while following it), Q-learning is off-policy (it improves Q independent of the policy being followed).
The algorithm is nearly identical to Sarsa, with the following difference: <br>
$Q(s,a)$ <-- $Q(s,a) + \alpha [R + \gamma  Q(s',a') - Q(s,a)]$ <br> 
$Q(s,a)$ <-- $Q(s,a) + \alpha [R + \gamma  amax(Q(s',:)) - Q(s,a)]$ <br> 
amax(Q(s',:)) chooses the best possible action value available at the next state.
In Sutton & Barto, Sarsa performs better an the cliff-walking task but in the "Taxi-v2" and "FrozenLake" environments Q-Learning outperforms Sarsa.

Input: the policy $\pi$ <br>
Initialize $Q(s,a)$ arbitrarily <br>
Repeat (for each episode): <br>
&emsp;    Initialize s <br>
&emsp;    Repeat (for each step of episode): <br>
&emsp;&emsp;        A <-- action given by $\pi$ for s <br>
&emsp;&emsp;        Taken action A; observe reward, R, and next state, S' <br>
&emsp;&emsp;        $Q(s,a)$ <-- $Q(s,a) + \alpha [R + \gamma  amax(Q(s',:)) - Q(s,a)]$ <br>
&emsp;&emsp;        S <-- S' <br>
&emsp;    until S is terminal <br>

In [15]:
import gym
import numpy as np
from collections import deque
np.random.seed(42)

env_name = 'Taxi-v2'
#env_name = 'FrozenLake-v0'

env = gym.make(env_name)
state_space = env.observation_space.n
action_space = env.action_space.n

alpha = 0.85
gamma = 0.999
epsilon = 1.0 # amount of exploration
epsilon_decay = 0.99 # exploration decay
num_games = 1500 

q = np.zeros([state_space, action_space])
reward_list = deque(maxlen=100)

def choose_action(q, state, epsilon):
    ''' epsilon-greedy policy (explore with probability epsilon)  '''
    if(np.random.uniform() < epsilon):
        action = np.random.choice(action_space) # exploration
    else:
        action = np.argmax(q[state, :]) # exploitation
    return action

for game in range(num_games):
    state = env.reset()
    action = choose_action(q, state, epsilon)
    epsilon *= epsilon_decay
    done = False
    episode_reward = 0
    
    while(not done): 
        state_next, reward, done, _ = env.step(action)
        episode_reward += reward
        action_next = choose_action(q, state_next, epsilon)
        q[state, action] = q[state, action] + alpha*( reward + gamma*(np.amax(q[state_next, :])) - q[state, action] )
        state = state_next
        action = action_next
        if(done):
            reward_list.append(episode_reward)
            if(game%100 == 0):
                print('avg reward: ', np.mean(reward_list))



avg reward:  -803.0
avg reward:  -494.46
avg reward:  -77.6
avg reward:  -4.91
avg reward:  3.68
avg reward:  6.45
avg reward:  7.78
avg reward:  8.03
avg reward:  8.18
avg reward:  8.29
avg reward:  8.42
avg reward:  7.61
avg reward:  8.12
avg reward:  7.79
avg reward:  8.1
