# Q-Learning

Q-Learning is an off-policy Temporal Difference control algorithm. The algorithm is defined by: 
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t + 1} + \gamma \max_{a}Q(S_{t + 1}, a) - Q(S_t, A_t)]$$
, where $Q$ is the learned action-value function, $R$ is the reward, $A$ is the action, $\alpha$ is the step size, $\gamma$ is the discount rate and $t$ is the time step.

The learned action-value function $Q$ directly approximates the optimal action-value function $q_*$ independent of the policy being followed. For more information on the Q-Learning algorithm see section 6.5 of [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf).

In [8]:
import gym
import numpy as np

In [9]:
# Type of environment, options include:
# Taxi-v3, CliffWalking-v0, FrozenLake-v1
env_type = "Taxi-v3"
# Create the environment
env = gym.make(env_type, render_mode=None)

# Number of possible states and actions
num_states = env.observation_space.n 
num_actions = env.action_space.n

# Action-value function, 
# initialized to 0 for all states and actions
Q = np.zeros([num_states, num_actions])

## Create An $\epsilon$-Greedy Policy 

In [10]:
def policy_fn(state, Q, epsilon, num_actions):
    # Create a distribution of actions and divide the epsilon probability between all actions
    action_dist = np.ones(num_actions, dtype=float) * epsilon / num_actions
    # Find the best action
    best_action = np.argmax(Q[state, :])
    # Set probability for the best action to (1 - epsilon)
    action_dist[best_action] += (1.0 - epsilon)
    return action_dist

## Learn The Optimal Action-Value Function

In [11]:
# Number of episodes to train on
episodes = 1000
# Return (accumulation of all rewards over an episode)
G = 0
# Step size
alpha = 0.618
# Probability of taking a non-greedy action
epsilon = 0.1

for episode in range(1, episodes+1):
    terminated, truncated = False, False
    G, reward = 0, 0
    state, info = env.reset()
    firstState = state
    while not (terminated):
        # Select the next action following the epsilon-greedy policy
        action_dist = policy_fn(state, Q, epsilon, num_actions)
        action = np.random.choice(np.arange(num_actions), p=action_dist)
        # Take the action and observe reward and next state
        state2, reward, terminated, truncated, info = env.step(action)
        # Update the expected return for the action-value function
        Q[state, action] += alpha * (reward + np.max(Q[state2, :]) - Q[state, action]) 
        G += reward
        state = state2

    if episode % 10 == 0:
        print(f'Episode: {episode} Total Reward: {G}')

Episode: 10 Total Reward: -2405
Episode: 20 Total Reward: -270
Episode: 30 Total Reward: -72
Episode: 40 Total Reward: -6
Episode: 50 Total Reward: -151
Episode: 60 Total Reward: -160
Episode: 70 Total Reward: -132
Episode: 80 Total Reward: -449
Episode: 90 Total Reward: -108
Episode: 100 Total Reward: -67
Episode: 110 Total Reward: -7
Episode: 120 Total Reward: -20
Episode: 130 Total Reward: 3
Episode: 140 Total Reward: 5
Episode: 150 Total Reward: -253
Episode: 160 Total Reward: -1
Episode: 170 Total Reward: -134
Episode: 180 Total Reward: -19
Episode: 190 Total Reward: -10
Episode: 200 Total Reward: -13
Episode: 210 Total Reward: 0
Episode: 220 Total Reward: -63
Episode: 230 Total Reward: 3
Episode: 240 Total Reward: 4
Episode: 250 Total Reward: -17
Episode: 260 Total Reward: 1
Episode: 270 Total Reward: -25
Episode: 280 Total Reward: -4
Episode: 290 Total Reward: -6
Episode: 300 Total Reward: 7
Episode: 310 Total Reward: 6
Episode: 320 Total Reward: 0
Episode: 330 Total Reward: 0
E

## Run $q*$ On An Example

In [12]:
# Run the algorithm in inferencing mode and 
# observe its performance
env = gym.make(env_type, render_mode="human")
state, info = env.reset(seed=64)
G = 0
num_steps = 0
terminated = False

while not terminated:
    action = np.argmax(Q[state, :]) 
    state, reward, terminated, truncated, info = env.step(action)
    G += reward
    num_steps += 1

print(f'Total Reward: {G}, Steps Taken: {num_steps}')

env.close()

Total Reward: 4, Steps Taken: 17
